Web Scraping with Python: Scraping Data Using Requests and lxml

In the world of web scraping, the lxml library is a powerful tool for parsing HTML and XML documents. Combined with the requests library, which simplifies sending HTTP requests, it provides a lightweight and efficient solution for scraping structured data from the web. In this article, we'll explore how to use requests and lxml to scrape data from websites and parse HTML content. Step 1: Install Required Libraries Before we start scraping, we need to install requests and lxml. These two libraries work together to fetch and parse web pages. To install the required libraries, run: pip install requests lxml Step 2: Fetch Web Page with Requests The first step in any web scraping task is to fetch the page you want to scrape. With requests, it's easy to send an HTTP GET request and retrieve the HTML content of the page. Here's a simple script to fetch a web page: import requests # Define the URL of the page url = "https://example.com" # Send a GET request to fetch the page content response = requests.get(url) # Check if the request was successful if response.status_code == 200: print("Page loaded successfully!") page_content = response.text else: print("Failed to retrieve the webpage.") This will send a GET request to the specified URL and store the HTML content of the page in page_content. Step 3: Parse HTML with lxml Now that we have the HTML content of the page, we can parse it with lxml. The library provides a powerful API for parsing HTML documents and searching for elements using XPath or CSS selectors. Here’s an example of parsing the HTML content and extracting specific data from it: from lxml import html # Parse the HTML content with lxml tree = html.fromstring(page_content) # Use XPath to extract data titles = tree.xpath('//h2[@class="title"]/text()') # Print the extracted titles for title in titles: print(title) In this example, we're using XPath to extract all text inside tags with the class title. You can modify the XPath query to target different elements based on the HTML structure of the page. Step 4: Extract Data Using CSS Selectors (Optional) Alternatively, if you prefer to use CSS selectors instead of XPath, lxml also supports them. Here's how you can extract data using CSS selectors: # Extract data using CSS selectors links = tree.cssselect('a.some-class') # Print the href attributes of all matching links for link in links: print(link.get('href')) This will find all anchor () tags with the class some-class and print the value of the href attribute, which is the link's URL. Step 5: Handling Pagination Many websites display large sets of data across multiple pages. To scrape all the data, you may need to handle pagination. This can be done by finding the "Next" button or page URL and sending requests to each page sequentially. Here’s an example of handling pagination by following "Next" links: # Scraping multiple pages using pagination current_page = 1 while True: # Define the URL of the current page page_url = f"https://example.com/page/{current_page}" # Send a GET request to fetch the page content response = requests.get(page_url) # If the page was fetched successfully if response.status_code == 200: print(f"Scraping page {current_page}...") page_content = response.text # Parse the HTML content with lxml tree = html.fromstring(page_content) # Extract the data from the page (example: titles) titles = tree.xpath('//h2[@class="title"]/text()') for title in titles: print(title) # Check if there is a next page (based on some condition) next_page = tree.xpath('//a[@class="next"]/text()') if next_page: current_page += 1 # Move to the next page else: break # No next page found, exit the loop else: print(f"Failed to retrieve page {current_page}.") break This script will scrape all pages of the website by sending requests to each page in sequence, checking for a "Next" link, and continuing the loop until no more pages are found. Step 6: Save the Scraped Data Once you've extracted the data, you may want to save it for further analysis. A simple way to save data is by writing it to a CSV file: import csv # Sample data data = [["Title 1", "Description 1"], ["Title 2", "Description 2"]] # Write the data to a CSV file with open("scraped_data.csv", mode="w", newline="") as file: writer = csv.writer(file) writer.writerows(data) print("Data saved to scraped_data.csv") This script will write the extracted data to a CSV file, where each row contains a title and description. ✅ Pros of Using Requests and lxml for Scraping

May 5, 2025 - 02:42

Web Scraping with Python: Scraping Data Using Requests and lxml

In the world of web scraping, the lxml library is a powerful tool for parsing HTML and XML documents. Combined with the requests library, which simplifies sending HTTP requests, it provides a lightweight and efficient solution for scraping structured data from the web. In this article, we'll explore how to use requests and lxml to scrape data from websites and parse HTML content.

Step 1: Install Required Libraries

Before we start scraping, we need to install requests and lxml. These two libraries work together to fetch and parse web pages.

To install the required libraries, run:

pip install requests lxml

Step 2: Fetch Web Page with Requests

The first step in any web scraping task is to fetch the page you want to scrape. With requests, it's easy to send an HTTP GET request and retrieve the HTML content of the page.

Here's a simple script to fetch a web page:

import requests

# Define the URL of the page
url = "https://example.com"

# Send a GET request to fetch the page content
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    print("Page loaded successfully!")
    page_content = response.text
else:
    print("Failed to retrieve the webpage.")

This will send a GET request to the specified URL and store the HTML content of the page in page_content.

Step 3: Parse HTML with lxml

Now that we have the HTML content of the page, we can parse it with lxml. The library provides a powerful API for parsing HTML documents and searching for elements using XPath or CSS selectors.

Here’s an example of parsing the HTML content and extracting specific data from it:

from lxml import html

# Parse the HTML content with lxml
tree = html.fromstring(page_content)

# Use XPath to extract data
titles = tree.xpath('//h2[@class="title"]/text()')

# Print the extracted titles
for title in titles:
    print(title)

In this example, we're using XPath to extract all text inside

tags with the class `title`. You can modify the XPath query to target different elements based on the HTML structure of the page.

Step 4: Extract Data Using CSS Selectors (Optional)

Alternatively, if you prefer to use CSS selectors instead of XPath, lxml also supports them. Here's how you can extract data using CSS selectors:

# Extract data using CSS selectors
links = tree.cssselect('a.some-class')

# Print the href attributes of all matching links
for link in links:
    print(link.get('href'))

This will find all anchor () tags with the class some-class and print the value of the href attribute, which is the link's URL.

Step 5: Handling Pagination

Many websites display large sets of data across multiple pages. To scrape all the data, you may need to handle pagination. This can be done by finding the "Next" button or page URL and sending requests to each page sequentially.

Here’s an example of handling pagination by following "Next" links:

# Scraping multiple pages using pagination
current_page = 1
while True:
    # Define the URL of the current page
    page_url = f"https://example.com/page/{current_page}"

    # Send a GET request to fetch the page content
    response = requests.get(page_url)

    # If the page was fetched successfully
    if response.status_code == 200:
        print(f"Scraping page {current_page}...")
        page_content = response.text

        # Parse the HTML content with lxml
        tree = html.fromstring(page_content)

        # Extract the data from the page (example: titles)
        titles = tree.xpath('//h2[@class="title"]/text()')
        for title in titles:
            print(title)

        # Check if there is a next page (based on some condition)
        next_page = tree.xpath('//a[@class="next"]/text()')
        if next_page:
            current_page += 1  # Move to the next page
        else:
            break  # No next page found, exit the loop
    else:
        print(f"Failed to retrieve page {current_page}.")
        break

This script will scrape all pages of the website by sending requests to each page in sequence, checking for a "Next" link, and continuing the loop until no more pages are found.

Step 6: Save the Scraped Data

Once you've extracted the data, you may want to save it for further analysis. A simple way to save data is by writing it to a CSV file:

import csv

# Sample data
data = [["Title 1", "Description 1"], ["Title 2", "Description 2"]]

# Write the data to a CSV file
with open("scraped_data.csv", mode="w", newline="") as file:
    writer = csv.writer(file)
    writer.writerows(data)

print("Data saved to scraped_data.csv")

This script will write the extracted data to a CSV file, where each row contains a title and description.