Building a Multithreaded Web Scraper in Python With Real-Time Progress Tracking

Building a Multithreaded Web Scraper in Python With Real-Time Progress Tracking Web scraping can be slow when fetching data from many pages—unless you use multithreading. In this guide, we'll build a multithreaded scraper using Python's concurrent.futures module and track progress in real-time with tqdm. 1. Requirements pip install requests beautifulsoup4 tqdm 2. Basic Scraper Function Let’s define a function that fetches and processes a single page: import requests from bs4 import BeautifulSoup def scrape_page(url): try: response = requests.get(url, timeout=5) response.raise_for_status() soup = BeautifulSoup(response.text, 'html.parser') title = soup.title.string.strip() if soup.title else "No title" return { 'url': url, 'title': title } except Exception as e: return { 'url': url, 'error': str(e) } 3. Running in Parallel With Threads We'll use a thread pool for concurrent requests and tqdm for a live progress bar: from concurrent.futures import ThreadPoolExecutor, as_completed from tqdm import tqdm urls = [ 'https://example.com', 'https://www.python.org', 'https://www.wikipedia.org', # Add more URLs... ] results = [] with ThreadPoolExecutor(max_workers=10) as executor: futures = {executor.submit(scrape_page, url): url for url in urls} for future in tqdm(as_completed(futures), total=len(futures), desc="Scraping"): results.append(future.result()) 4. Saving Results You can write the results to a CSV file for further analysis: import csv with open('scrape_results.csv', 'w', newline='', encoding='utf-8') as f: writer = csv.DictWriter(f, fieldnames=['url', 'title', 'error']) writer.writeheader() for r in results: writer.writerow({ 'url': r['url'], 'title': r.get('title', ''), 'error': r.get('error', '') }) 5. Notes on Rate Limiting Always respect the target website’s robots.txt and consider adding delays or rotating user agents and proxies if scraping at scale. Consider using aiohttp and asyncio for true concurrency with I/O-bound operations if performance is critical. Conclusion With multithreading and proper tracking, you can dramatically improve your web scraping workflows in Python. This scalable approach works great for small-to-medium scraping tasks and can be adapted further for production use. If this post helped you, consider supporting me: buymeacoffee.com/hexshift

Apr 16, 2025 - 04:22

Building a Multithreaded Web Scraper in Python With Real-Time Progress Tracking

Web scraping can be slow when fetching data from many pages—unless you use multithreading. In this guide, we'll build a multithreaded scraper using Python's concurrent.futures module and track progress in real-time with tqdm.

1. Requirements

pip install requests beautifulsoup4 tqdm

2. Basic Scraper Function

Let’s define a function that fetches and processes a single page:

import requests
from bs4 import BeautifulSoup

def scrape_page(url):
    try:
        response = requests.get(url, timeout=5)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')
        title = soup.title.string.strip() if soup.title else "No title"
        return { 'url': url, 'title': title }
    except Exception as e:
        return { 'url': url, 'error': str(e) }

3. Running in Parallel With Threads

We'll use a thread pool for concurrent requests and tqdm for a live progress bar:

from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm

urls = [
    'https://example.com',
    'https://www.python.org',
    'https://www.wikipedia.org',
    # Add more URLs...
]

results = []

with ThreadPoolExecutor(max_workers=10) as executor:
    futures = {executor.submit(scrape_page, url): url for url in urls}
    for future in tqdm(as_completed(futures), total=len(futures), desc="Scraping"):
        results.append(future.result())

4. Saving Results

You can write the results to a CSV file for further analysis:

import csv

with open('scrape_results.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.DictWriter(f, fieldnames=['url', 'title', 'error'])
    writer.writeheader()
    for r in results:
        writer.writerow({
            'url': r['url'],
            'title': r.get('title', ''),
            'error': r.get('error', '')
        })

5. Notes on Rate Limiting

Always respect the target website’s robots.txt and consider adding delays or rotating user agents and proxies if scraping at scale. Consider using aiohttp and asyncio for true concurrency with I/O-bound operations if performance is critical.

Conclusion

With multithreading and proper tracking, you can dramatically improve your web scraping workflows in Python. This scalable approach works great for small-to-medium scraping tasks and can be adapted further for production use.

If this post helped you, consider supporting me: buymeacoffee.com/hexshift