Dev.to

Beginner's Guide to Web Scraping with Python Scrapy

Introduction to Scrapy Scrapy is a powerful Python framework for web scraping. It allows you to extract data from websites efficiently, handle complex workflows (like following links), and export data in structured formats (JSON, CSV). Ideal for data mining, monitoring, and testing. Installation Install Scrapy via pip: bash 复制 pip install scrapy Create a Scrapy Project Create a new project named tutorial: bash 复制 scrapy startproject tutorial cd tutorial Project Structure: 复制 tutorial/ scrapy.cfg tutorial/ init.py items.py middlewares.py pipelines.py settings.py spiders/ init.py Create Your First Spider Spiders define how to crawl websites. Create quotes_spider.py in tutorial/spiders/: python 复制 import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" # Unique identifier start_urls = ["hxxps://xxx"] def parse(self, response): # Extract quotes from the page for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').get(), 'author': quote.css('small.author::text').get(), 'tags': quote.css('div.tags a.tag::text').getall(), } # Follow the "Next" link next_page = response.css('li.next a::attr(href)').get() if next_page: yield response.follow(next_page, callback=self.parse) Explanation: name: Used to run the spider. start_urls: Initial URLs to crawl. parse(): Handles the response, extracts data, and follows links. CSS Selectors: Extract text using ::text and attributes with ::attr(attr_name). response.follow(): Automatically handles relative URLs. Using Items for Structured Data Define data fields in items.py: python 复制 import scrapy class QuoteItem(scrapy.Item): text = scrapy.Field() author = scrapy.Field() tags = scrapy.Field() Modify the spider to use items: python 复制 from tutorial.items import QuoteItem Inside the parse method: item = QuoteItem() item['text'] = quote.css('span.text::text').get() item['author'] = quote.css('small.author::text').get() item['tags'] = quote.css('div.tags a.tag::text').getall() yield item Run the Spider and Export Data Execute the spider and save results to quotes.json: bash 复制 scrapy crawl quotes -O quotes.json Handling Pagination The example already includes pagination by following the "Next" link using response.follow(). Scrapy automatically manages duplicate URLs. Adjust Settings for Politeness In settings.py: python 复制 USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36' DOWNLOAD_DELAY = 2 # 2-second delay between requests ROBOTSTXT_OBEY = True # Respect robots.txt Advanced: Item Pipelines Use pipelines to process scraped data (e.g., save to a database). Modify pipelines.py: python 复制 class TutorialPipeline: def process_item(self, item, spider): # Add custom processing here return item Enable the pipeline in settings.py: python 复制 ITEM_PIPELINES = {'tutorial.pipelines.TutorialPipeline': 300} Tips and Best Practices Ethical Scraping: Respect website terms of service and robots.txt. Debugging: Use scrapy shell "URL" to test selectors interactively. Avoid Bans: Rotate user agents and use proxies if needed. Conclusion You’ve built a Scrapy spider to extract quotes, handle pagination, and export data! Explore the official Scrapy docs to learn more about middleware, extensions, and advanced techniques. Final Code: GitHub Repository (link to a sample project if available) This tutorial covers the essentials to get started. Happy scraping!

Mar 15, 2025 - 06:13

Beginner's Guide to Web Scraping with Python Scrapy

Introduction to Scrapy
Scrapy is a powerful Python framework for web scraping. It allows you to extract data from websites efficiently, handle complex workflows (like following links), and export data in structured formats (JSON, CSV). Ideal for data mining, monitoring, and testing.
Installation
Install Scrapy via pip:

bash
复制
pip install scrapy

Create a Scrapy Project Create a new project named tutorial:

bash
复制
scrapy startproject tutorial
cd tutorial
Project Structure:

复制
tutorial/
scrapy.cfg
tutorial/
init.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
init.py

Create Your First Spider Spiders define how to crawl websites. Create quotes_spider.py in tutorial/spiders/:

python
复制
import scrapy

class QuotesSpider(scrapy.Spider):
name = "quotes" # Unique identifier
start_urls = ["hxxps://xxx"]

def parse(self, response):
    # Extract quotes from the page
    for quote in response.css('div.quote'):
        yield {
            'text': quote.css('span.text::text').get(),
            'author': quote.css('small.author::text').get(),
            'tags': quote.css('div.tags a.tag::text').getall(),
        }

    # Follow the "Next" link
    next_page = response.css('li.next a::attr(href)').get()
    if next_page:
        yield response.follow(next_page, callback=self.parse)

Explanation:

name: Used to run the spider.

start_urls: Initial URLs to crawl.

parse(): Handles the response, extracts data, and follows links.

CSS Selectors: Extract text using ::text and attributes with ::attr(attr_name).

response.follow(): Automatically handles relative URLs.

Using Items for Structured Data Define data fields in items.py:

python
复制
import scrapy

class QuoteItem(scrapy.Item):
text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()
Modify the spider to use items:

python
复制
from tutorial.items import QuoteItem

Inside the parse method:

item = QuoteItem()
item['text'] = quote.css('span.text::text').get()
item['author'] = quote.css('small.author::text').get()
item['tags'] = quote.css('div.tags a.tag::text').getall()
yield item

Run the Spider and Export Data Execute the spider and save results to quotes.json:

bash
复制
scrapy crawl quotes -O quotes.json

Handling Pagination
The example already includes pagination by following the "Next" link using response.follow(). Scrapy automatically manages duplicate URLs.
Adjust Settings for Politeness
In settings.py:

python
复制
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
DOWNLOAD_DELAY = 2 # 2-second delay between requests
ROBOTSTXT_OBEY = True # Respect robots.txt

Advanced: Item Pipelines Use pipelines to process scraped data (e.g., save to a database). Modify pipelines.py:

python
复制
class TutorialPipeline:
def process_item(self, item, spider):
# Add custom processing here
return item
Enable the pipeline in settings.py:

python
复制
ITEM_PIPELINES = {'tutorial.pipelines.TutorialPipeline': 300}

Tips and Best Practices Ethical Scraping: Respect website terms of service and robots.txt.

Debugging: Use scrapy shell "URL" to test selectors interactively.

Avoid Bans: Rotate user agents and use proxies if needed.

Conclusion You’ve built a Scrapy spider to extract quotes, handle pagination, and export data! Explore the official Scrapy docs to learn more about middleware, extensions, and advanced techniques.

Final Code: GitHub Repository (link to a sample project if available)

This tutorial covers the essentials to get started. Happy scraping!