Web Scraping with JavaScript and Playwright: A Modern Approach with Code Examples

Web scraping has evolved to tackle the challenges of modern web applications, where content is often loaded dynamically via JavaScript. Enter Playwright—a powerful, open-source automation library by Microsoft that simplifies scraping complex websites. Unlike older tools, Playwright supports Chromium, Firefox, and WebKit out of the box and handles SPAs, authentication, and even shadow DOMs with ease. In this guide, you’ll learn how to scrape websites using JavaScript and Playwright, complete with practical code examples. Why Playwright? Cross-browser support: Scrape with Chromium, Firefox, or WebKit. Auto-waiting: No more manual sleep() calls—Playwright waits for elements to load. Mobile emulation: Test responsive sites or mimic mobile devices. Stealth mode: Avoid bot detection with features like masking headless browsers. Rich API: Handle file downloads, network interception, and more. Setup First, initialize a Node.js project and install Playwright: npm init -y npm install playwright Basic Scraping: Extracting Data Let’s scrape book titles and prices from a demo e-commerce site (https://books.toscrape.com). const { chromium } = require('playwright'); (async () => { // Launch a headless browser const browser = await chromium.launch({ headless: true }); const page = await browser.newPage(); // Navigate to the target page await page.goto('https://books.toscrape.com'); // Extract book titles and prices const books = await page.$$eval('.product_pod', (items) => { return items.map(item => ({ title: item.querySelector('h3 a').getAttribute('title'), price: item.querySelector('.price_color').innerText, })); }); console.log(books); await browser.close(); })(); Explanation: chromium.launch() starts a headless browser instance. page.$$eval() runs a function in the browser context to query DOM elements. The selector .product_pod targets each book container, and nested queries extract the data. Handling Dynamic Content Modern sites often load data via AJAX or user interactions (e.g., clicking "Load More"). Playwright makes this straightforward: const { firefox } = require('playwright'); (async () => { const browser = await firefox.launch({ headless: false }); const page = await browser.newPage(); await page.goto('https://example-infinite-scroll.com'); // Scroll to the bottom repeatedly until no more content loads let previousHeight; while (true) { previousHeight = await page.evaluate('document.body.scrollHeight'); await page.evaluate('window.scrollTo(0, document.body.scrollHeight)'); await page.waitForTimeout(2000); // Wait for content to load const newHeight = await page.evaluate('document.body.scrollHeight'); if (newHeight === previousHeight) break; } // Extract all loaded items const items = await page.$$eval('.item', elements => elements.map(el => el.innerText) ); console.log(`Loaded ${items.length} items.`); await browser.close(); })(); Advanced Techniques 1. Authentication & Sessions Log into a site and reuse cookies for future sessions: const { webkit } = require('playwright'); (async () => { const browser = await webkit.launch({ headless: false }); const page = await browser.newPage(); // Navigate to login page await page.goto('https://example.com/login'); await page.fill('#username', 'user123'); await page.fill('#password', 'pass123'); await page.click('#submit'); // Wait for login to complete await page.waitForNavigation(); // Save cookies for reuse const cookies = await page.context().cookies(); console.log('Cookies saved:', cookies); await browser.close(); })(); 2. Avoiding Detection Use Playwright’s stealth plugins to mimic human behavior: const { chromium } = require('playwright'); const stealth = require('puppeteer-extra-plugin-stealth')(); (async () => { const browser = await chromium.launch({ headless: false, args: ['--disable-blink-features=AutomationControlled'] }); const page = await browser.newPage(); // Mask headless browser fingerprints await page.addInitScript(() => { delete navigator.webdriver; }); await page.goto('https://example-protected-site.com'); // ... proceed with scraping })(); 3. Intercepting Network Requests Capture API responses to scrape data directly from XHR/Fetch calls: const { chromium } = require('playwright'); (async () => { const browser = await chromium.launch(); const page = await browser.newPage(); // Listen for network responses page.on('response', async (response) => { if (response.url().includes('/api/data')) { const data = await response.json(); console.log('API Data:', data); } }); await page.goto('https://example-spa.com'); await browser.close(); })(); Best Practices Rate Lim

Mar 13, 2025 - 20:42
 0
Web Scraping with JavaScript and Playwright: A Modern Approach with Code Examples

Web scraping has evolved to tackle the challenges of modern web applications, where content is often loaded dynamically via JavaScript. Enter Playwright—a powerful, open-source automation library by Microsoft that simplifies scraping complex websites. Unlike older tools, Playwright supports Chromium, Firefox, and WebKit out of the box and handles SPAs, authentication, and even shadow DOMs with ease.

In this guide, you’ll learn how to scrape websites using JavaScript and Playwright, complete with practical code examples.

Why Playwright?

  • Cross-browser support: Scrape with Chromium, Firefox, or WebKit.
  • Auto-waiting: No more manual sleep() calls—Playwright waits for elements to load.
  • Mobile emulation: Test responsive sites or mimic mobile devices.
  • Stealth mode: Avoid bot detection with features like masking headless browsers.
  • Rich API: Handle file downloads, network interception, and more.

Setup

First, initialize a Node.js project and install Playwright:

npm init -y
npm install playwright

Basic Scraping: Extracting Data

Let’s scrape book titles and prices from a demo e-commerce site (https://books.toscrape.com).

const { chromium } = require('playwright');

(async () => {
  // Launch a headless browser
  const browser = await chromium.launch({ headless: true });
  const page = await browser.newPage();

  // Navigate to the target page
  await page.goto('https://books.toscrape.com');

  // Extract book titles and prices
  const books = await page.$$eval('.product_pod', (items) => {
    return items.map(item => ({
      title: item.querySelector('h3 a').getAttribute('title'),
      price: item.querySelector('.price_color').innerText,
    }));
  });

  console.log(books);
  await browser.close();
})();

Explanation:

  • chromium.launch() starts a headless browser instance.
  • page.$$eval() runs a function in the browser context to query DOM elements.
  • The selector .product_pod targets each book container, and nested queries extract the data.

Handling Dynamic Content

Modern sites often load data via AJAX or user interactions (e.g., clicking "Load More"). Playwright makes this straightforward:

const { firefox } = require('playwright');

(async () => {
  const browser = await firefox.launch({ headless: false });
  const page = await browser.newPage();
  await page.goto('https://example-infinite-scroll.com');

  // Scroll to the bottom repeatedly until no more content loads
  let previousHeight;
  while (true) {
    previousHeight = await page.evaluate('document.body.scrollHeight');
    await page.evaluate('window.scrollTo(0, document.body.scrollHeight)');
    await page.waitForTimeout(2000); // Wait for content to load
    const newHeight = await page.evaluate('document.body.scrollHeight');
    if (newHeight === previousHeight) break;
  }

  // Extract all loaded items
  const items = await page.$$eval('.item', elements => 
    elements.map(el => el.innerText)
  );

  console.log(`Loaded ${items.length} items.`);
  await browser.close();
})();

Advanced Techniques

1. Authentication & Sessions

Log into a site and reuse cookies for future sessions:

const { webkit } = require('playwright');

(async () => {
  const browser = await webkit.launch({ headless: false });
  const page = await browser.newPage();

  // Navigate to login page
  await page.goto('https://example.com/login');
  await page.fill('#username', 'user123');
  await page.fill('#password', 'pass123');
  await page.click('#submit');

  // Wait for login to complete
  await page.waitForNavigation();

  // Save cookies for reuse
  const cookies = await page.context().cookies();
  console.log('Cookies saved:', cookies);

  await browser.close();
})();

2. Avoiding Detection

Use Playwright’s stealth plugins to mimic human behavior:

const { chromium } = require('playwright');
const stealth = require('puppeteer-extra-plugin-stealth')();

(async () => {
  const browser = await chromium.launch({
    headless: false,
    args: ['--disable-blink-features=AutomationControlled']
  });
  const page = await browser.newPage();

  // Mask headless browser fingerprints
  await page.addInitScript(() => {
    delete navigator.webdriver;
  });

  await page.goto('https://example-protected-site.com');
  // ... proceed with scraping
})();

3. Intercepting Network Requests

Capture API responses to scrape data directly from XHR/Fetch calls:

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  // Listen for network responses
  page.on('response', async (response) => {
    if (response.url().includes('/api/data')) {
      const data = await response.json();
      console.log('API Data:', data);
    }
  });

  await page.goto('https://example-spa.com');
  await browser.close();
})();

Best Practices

  1. Rate Limiting: Use page.waitForTimeout() to space out requests.
  2. Error Handling: Wrap actions in try/catch blocks.
  3. Selectors: Prefer text= or role= selectors for reliability.
  4. Headless Mode: Use headless: false for debugging.

Ethical Considerations

  • Respect robots.txt and website terms of service.
  • Avoid scraping personally identifiable information (PII).
  • Use proxies or rotating IPs to prevent overloading servers.

Conclusion

Playwright is a game-changer for web scraping, offering unparalleled flexibility for handling dynamic content, authentication, and anti-bot measures. With its intuitive API and cross-browser support, it’s a must-have tool in your scraping toolkit.

Next Steps:

Call to Action

Got stuck? Check out Playwright’s debugging guide or drop a comment below!