How to Extract All Links from a Website Using Puppeteer
Extracting all links from a website is a common task in web scraping and automation. Whether you're building a crawler, analyzing a website's structure, or gathering data, having access to all links can be invaluable. In this guide, we'll explore two approaches: using Puppeteer for manual extraction and using CaptureKit API for a simpler solution. Method 1: Using Puppeteer Puppeteer is a powerful Node.js library that provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Here's how you can use it to extract all URLs from a website: const puppeteer = require('puppeteer'); async function extractLinks(url) { // Launch the browser const browser = await puppeteer.launch(); const page = await browser.newPage(); try { // Navigate to the URL await page.goto(url, { waitUntil: 'networkidle0' }); // Extract all links const links = await page.evaluate(() => { const anchors = document.querySelectorAll('a'); return Array.from(anchors).map((anchor) => anchor.href); }); // Remove duplicates const uniqueLinks = [...new Set(links)]; return uniqueLinks; } catch (error) { console.error('Error:', error); throw error; } finally { await browser.close(); } } // Usage example async function main() { const url = 'https://example.com'; const links = await extractLinks(url); console.log('Found links:', links); } main(); This code will: Launch a headless browser using Puppeteer Navigate to the specified URL Extract all tags from the page Get their href attributes Remove any duplicate links Return the unique list of URLs Handling Dynamic Content If you're dealing with a website that loads content dynamically, you might need to wait for the content to load: // Wait for specific elements to load await page.waitForSelector('a'); // Or wait for network to be idle await page.waitForNetworkIdle(); Filtering Links You can also filter links based on specific criteria: const links = await page.evaluate(() => { const anchors = document.querySelectorAll('a'); return Array.from(anchors) .map((anchor) => anchor.href) .filter((href) => { // Filter out external links return href.startsWith('https://example.com'); // Or filter by specific patterns // return href.includes('/blog/'); }); }); Method 2: Using CaptureKit API (Recommended) While Puppeteer is powerful, setting up and maintaining a web scraping solution can be time-consuming and complex. That's where CaptureKit API comes in. Our API provides a simple, reliable way to extract all links from any website, with additional features like link categorization and metadata extraction. Here's how to use CaptureKit API: curl "https://api.capturekit.dev/content?url=https://tailwindcss.com&access_key=YOUR_ACCESS_KEY" The API response includes categorized links and additional metadata: { "success": true, "data": { "links": { "internal": ["https://tailwindcss.com/", "https://tailwindcss.com/docs"], "external": ["https://tailwindui.com", "https://shopify.com"], "social": [ "https://github.com/tailwindlabs/tailwindcss", "https://x.com/tailwindcss" ] }, "metadata": { "title": "Tailwind CSS - Rapidly build modern websites without ever leaving your HTML.", "description": "Tailwind CSS is a utility-first CSS framework.", "favicon": "https://tailwindcss.com/favicons/favicon-32x32.png", "ogImage": "https://tailwindcss.com/opengraph-image.jpg" } } } Benefits of Using CaptureKit API Categorized Links: Links are automatically categorized into internal, external, and social links Additional Metadata: Get website title, description, favicon, and OpenGraph image Reliability: No need to handle browser automation, network issues, or rate limiting Speed: Results are returned in seconds, not minutes Maintenance-Free: No need to update code when websites change their structure Conclusion While Puppeteer provides a powerful way to extract URLs programmatically, it requires significant setup and maintenance. For most use cases, using CaptureKit API is the recommended approach, offering a simpler, more reliable solution with additional features like link categorization and metadata extraction. Choose the method that best fits your needs: Use Puppeteer if you need full control over the scraping process or have specific requirements Use CaptureKit API if you want a quick, reliable solution with additional features

Extracting all links from a website is a common task in web scraping and automation. Whether you're building a crawler, analyzing a website's structure, or gathering data, having access to all links can be invaluable. In this guide, we'll explore two approaches: using Puppeteer for manual extraction and using CaptureKit API for a simpler solution.
Method 1: Using Puppeteer
Puppeteer is a powerful Node.js library that provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Here's how you can use it to extract all URLs from a website:
const puppeteer = require('puppeteer');
async function extractLinks(url) {
// Launch the browser
const browser = await puppeteer.launch();
const page = await browser.newPage();
try {
// Navigate to the URL
await page.goto(url, { waitUntil: 'networkidle0' });
// Extract all links
const links = await page.evaluate(() => {
const anchors = document.querySelectorAll('a');
return Array.from(anchors).map((anchor) => anchor.href);
});
// Remove duplicates
const uniqueLinks = [...new Set(links)];
return uniqueLinks;
} catch (error) {
console.error('Error:', error);
throw error;
} finally {
await browser.close();
}
}
// Usage example
async function main() {
const url = 'https://example.com';
const links = await extractLinks(url);
console.log('Found links:', links);
}
main();
This code will:
- Launch a headless browser using Puppeteer
- Navigate to the specified URL
- Extract all
tags from the page
- Get their
href
attributes - Remove any duplicate links
- Return the unique list of URLs
Handling Dynamic Content
If you're dealing with a website that loads content dynamically, you might need to wait for the content to load:
// Wait for specific elements to load
await page.waitForSelector('a');
// Or wait for network to be idle
await page.waitForNetworkIdle();
Filtering Links
You can also filter links based on specific criteria:
const links = await page.evaluate(() => {
const anchors = document.querySelectorAll('a');
return Array.from(anchors)
.map((anchor) => anchor.href)
.filter((href) => {
// Filter out external links
return href.startsWith('https://example.com');
// Or filter by specific patterns
// return href.includes('/blog/');
});
});
Method 2: Using CaptureKit API (Recommended)
While Puppeteer is powerful, setting up and maintaining a web scraping solution can be time-consuming and complex. That's where CaptureKit API comes in. Our API provides a simple, reliable way to extract all links from any website, with additional features like link categorization and metadata extraction.
Here's how to use CaptureKit API:
curl "https://api.capturekit.dev/content?url=https://tailwindcss.com&access_key=YOUR_ACCESS_KEY"
The API response includes categorized links and additional metadata:
{
"success": true,
"data": {
"links": {
"internal": ["https://tailwindcss.com/", "https://tailwindcss.com/docs"],
"external": ["https://tailwindui.com", "https://shopify.com"],
"social": [
"https://github.com/tailwindlabs/tailwindcss",
"https://x.com/tailwindcss"
]
},
"metadata": {
"title": "Tailwind CSS - Rapidly build modern websites without ever leaving your HTML.",
"description": "Tailwind CSS is a utility-first CSS framework.",
"favicon": "https://tailwindcss.com/favicons/favicon-32x32.png",
"ogImage": "https://tailwindcss.com/opengraph-image.jpg"
}
}
}
Benefits of Using CaptureKit API
- Categorized Links: Links are automatically categorized into internal, external, and social links
- Additional Metadata: Get website title, description, favicon, and OpenGraph image
- Reliability: No need to handle browser automation, network issues, or rate limiting
- Speed: Results are returned in seconds, not minutes
- Maintenance-Free: No need to update code when websites change their structure
Conclusion
While Puppeteer provides a powerful way to extract URLs programmatically, it requires significant setup and maintenance. For most use cases, using CaptureKit API is the recommended approach, offering a simpler, more reliable solution with additional features like link categorization and metadata extraction.
Choose the method that best fits your needs:
- Use Puppeteer if you need full control over the scraping process or have specific requirements
- Use CaptureKit API if you want a quick, reliable solution with additional features