Building a Multithreaded Web Scraper in Python With Real-Time Progress Tracking
Building a Multithreaded Web Scraper in Python With Real-Time Progress Tracking Web scraping can be slow when fetching data from many pages—unless you use multithreading. In this guide, we'll build a multithreaded scraper using Python's concurrent.futures module and track progress in real-time with tqdm. 1. Requirements pip install requests beautifulsoup4 tqdm 2. Basic Scraper Function Let’s define a function that fetches and processes a single page: import requests from bs4 import BeautifulSoup def scrape_page(url): try: response = requests.get(url, timeout=5) response.raise_for_status() soup = BeautifulSoup(response.text, 'html.parser') title = soup.title.string.strip() if soup.title else "No title" return { 'url': url, 'title': title } except Exception as e: return { 'url': url, 'error': str(e) } 3. Running in Parallel With Threads We'll use a thread pool for concurrent requests and tqdm for a live progress bar: from concurrent.futures import ThreadPoolExecutor, as_completed from tqdm import tqdm urls = [ 'https://example.com', 'https://www.python.org', 'https://www.wikipedia.org', # Add more URLs... ] results = [] with ThreadPoolExecutor(max_workers=10) as executor: futures = {executor.submit(scrape_page, url): url for url in urls} for future in tqdm(as_completed(futures), total=len(futures), desc="Scraping"): results.append(future.result()) 4. Saving Results You can write the results to a CSV file for further analysis: import csv with open('scrape_results.csv', 'w', newline='', encoding='utf-8') as f: writer = csv.DictWriter(f, fieldnames=['url', 'title', 'error']) writer.writeheader() for r in results: writer.writerow({ 'url': r['url'], 'title': r.get('title', ''), 'error': r.get('error', '') }) 5. Notes on Rate Limiting Always respect the target website’s robots.txt and consider adding delays or rotating user agents and proxies if scraping at scale. Consider using aiohttp and asyncio for true concurrency with I/O-bound operations if performance is critical. Conclusion With multithreading and proper tracking, you can dramatically improve your web scraping workflows in Python. This scalable approach works great for small-to-medium scraping tasks and can be adapted further for production use. If this post helped you, consider supporting me: buymeacoffee.com/hexshift
Building a Multithreaded Web Scraper in Python With Real-Time Progress Tracking
Web scraping can be slow when fetching data from many pages—unless you use multithreading. In this guide, we'll build a multithreaded scraper using Python's concurrent.futures
module and track progress in real-time with tqdm
.
1. Requirements
pip install requests beautifulsoup4 tqdm
2. Basic Scraper Function
Let’s define a function that fetches and processes a single page:
import requests
from bs4 import BeautifulSoup
def scrape_page(url):
try:
response = requests.get(url, timeout=5)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
title = soup.title.string.strip() if soup.title else "No title"
return { 'url': url, 'title': title }
except Exception as e:
return { 'url': url, 'error': str(e) }
3. Running in Parallel With Threads
We'll use a thread pool for concurrent requests and tqdm
for a live progress bar:
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm
urls = [
'https://example.com',
'https://www.python.org',
'https://www.wikipedia.org',
# Add more URLs...
]
results = []
with ThreadPoolExecutor(max_workers=10) as executor:
futures = {executor.submit(scrape_page, url): url for url in urls}
for future in tqdm(as_completed(futures), total=len(futures), desc="Scraping"):
results.append(future.result())
4. Saving Results
You can write the results to a CSV file for further analysis:
import csv
with open('scrape_results.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=['url', 'title', 'error'])
writer.writeheader()
for r in results:
writer.writerow({
'url': r['url'],
'title': r.get('title', ''),
'error': r.get('error', '')
})
5. Notes on Rate Limiting
Always respect the target website’s robots.txt
and consider adding delays or rotating user agents and proxies if scraping at scale. Consider using aiohttp
and asyncio
for true concurrency with I/O-bound operations if performance is critical.
Conclusion
With multithreading and proper tracking, you can dramatically improve your web scraping workflows in Python. This scalable approach works great for small-to-medium scraping tasks and can be adapted further for production use.
If this post helped you, consider supporting me: buymeacoffee.com/hexshift