How to Scrape a Website that Requires Login with Python

This is a step-by-step guide for those who want to scrape data from a site that requires login, but don’t know where to start. I’ll walk you through the process and explain the common issues you might face along the way. Step 1: Inspect the Login Form First, you need to find out what data is sent during authentication and check if it can be handled with simple methods. Open the login page: Then open DevTools (F12 or right-click and Inspect), go to the Network tab, and try logging in to see what gets sent: We’ll cover this type of auth in Step 2, so if this matches what you see, skip ahead. In some cases, a token is sent along with the request. It’s usually generated dynamically: You can often find it in the page’s HTML. Go to the Elements tab in DevTools and look for a hidden input field with the token: We’ll cover this in Step 3, so jump there if that’s your case. Another scenario, you can’t see the request details at all. If that happens, move on to Step 4, where we handle such forms using Selenium. And finally, if the site needs full user interaction (like filling out the form manually) and has extra protection, we’ll deal with that in Step 5 using SeleniumBase. By the way, all examples in this article use demo sites and are for educational purposes only. Scraping login-protected data isn’t always legal. Step 2: Reproduce Login with requests To make requests and scrape data from such sites, we’ll use the requests library together with BeautifulSoup. If you don’t have them installed, run this in your terminal or command line: pip install beautifulsoup4 requests Then, create a script and import the libraries: import requests from bs4 import BeautifulSoup If you just send a request with login and password, you might get the page that normally loads after login, but any follow-up request will require you to authenticate again. To avoid that, you can use a session. Let’s create one: session = requests.Session() Set up your login data in a payload variable: payload = { 'email': 'admin@example.com', 'password': 'password' } Make sure the parameter names match the actual ones used on the site, they can vary. Also, replace the values with your real login and password. Then send a POST request and check the response status: post_response = session.post("login-URL-here", data=payload) post_response.raise_for_status() Now you can access the protected page and scrape it: protected_page = session.get("protected-URL") soup = BeautifulSoup(protected_page.text, 'html.parser') If the site uses a dedicated login endpoint, you can also pass credentials directly in the request: url = "https://postman-echo.com/basic-auth" username = "postman" password = "password" response = requests.get(url, auth=(username, password)) This kind of basic auth isn’t very secure, so you probably won’t see it often. Step 3: Extract Tokens (CSRF or Auth Tokens) A more secure type of authentication often requires not just a login and password, but also a unique token (like a CSRF token) generated by the server. If the token is in the HTML (e.g. inside a form), you can extract it like this: response = session.get("login-URL") response.raise_for_status() soup = BeautifulSoup(response.text, 'html.parser') csrf_token = soup.select_one("input[name='_token']")['value'] if soup.select_one("input[name='_token']") else None Then, include the token in the form data and send a POST request: payload = { 'email': 'admin@example.com', 'password': 'password', '_token': csrf_token } post_response = session.post("login-URL", data=payload) post_response.raise_for_status() If the token is stored in cookies instead, get it like this: csrf_token = response.cookies.get('_csrf') And send it in a header or reuse the cookies: response = session.post( "login-URL", data=payload, headers={"X-CSRF-Token": csrf_token}, cookies=response.cookies ) Which method works depends on how the site handles authentication. Step 4: Handle JavaScript-Based Forms If other authentication methods don’t work or aren’t available, you can use Selenium. It lets you simulate real user actions: opening pages, entering login credentials, clicking buttons, and so on. First, import the required modules: from selenium import webdriver from selenium.webdriver.common.by import By Then create a web driver: driver = webdriver.Chrome() Go to the login page: driver.get("login-URL") Fill in the login and password: driver.find_element(By.NAME, "email").send_keys("admin@example.com") driver.find_element(By.NAME, "password").send_keys("password") Click the submit button: driver.find_element(By.CSS_SELECTOR, 'button[type="submit"]').click() And finally, close the driver: driver.quit() If there’s no extra protection on the site, this approach should work. Step 5

May 7, 2025 - 15:34

How to Scrape a Website that Requires Login with Python

This is a step-by-step guide for those who want to scrape data from a site that requires login, but don’t know where to start. I’ll walk you through the process and explain the common issues you might face along the way.

Step 1: Inspect the Login Form

First, you need to find out what data is sent during authentication and check if it can be handled with simple methods. Open the login page:

Then open DevTools (F12 or right-click and Inspect), go to the Network tab, and try logging in to see what gets sent:

We’ll cover this type of auth in Step 2, so if this matches what you see, skip ahead.
In some cases, a token is sent along with the request. It’s usually generated dynamically:

You can often find it in the page’s HTML. Go to the Elements tab in DevTools and look for a hidden input field with the token:

We’ll cover this in Step 3, so jump there if that’s your case.
Another scenario, you can’t see the request details at all. If that happens, move on to Step 4, where we handle such forms using Selenium.
And finally, if the site needs full user interaction (like filling out the form manually) and has extra protection, we’ll deal with that in Step 5 using SeleniumBase.
By the way, all examples in this article use demo sites and are for educational purposes only. Scraping login-protected data isn’t always legal.

Step 2: Reproduce Login with requests

To make requests and scrape data from such sites, we’ll use the requests library together with BeautifulSoup. If you don’t have them installed, run this in your terminal or command line:

pip install beautifulsoup4 requests

Then, create a script and import the libraries:

import requests
from bs4 import BeautifulSoup

If you just send a request with login and password, you might get the page that normally loads after login, but any follow-up request will require you to authenticate again.
To avoid that, you can use a session. Let’s create one:

session = requests.Session()

Set up your login data in a payload variable:

payload = {
    'email': 'admin@example.com',
    'password': 'password'
}

Make sure the parameter names match the actual ones used on the site, they can vary. Also, replace the values with your real login and password.
Then send a POST request and check the response status:

post_response = session.post("login-URL-here", data=payload)
post_response.raise_for_status()

Now you can access the protected page and scrape it:

protected_page = session.get("protected-URL")
soup = BeautifulSoup(protected_page.text, 'html.parser')

If the site uses a dedicated login endpoint, you can also pass credentials directly in the request:

url = "https://postman-echo.com/basic-auth"
username = "postman"
password = "password"


response = requests.get(url, auth=(username, password))

This kind of basic auth isn’t very secure, so you probably won’t see it often.

Step 3: Extract Tokens (CSRF or Auth Tokens)

A more secure type of authentication often requires not just a login and password, but also a unique token (like a CSRF token) generated by the server.
If the token is in the HTML (e.g. inside a form), you can extract it like this:

response = session.get("login-URL")
response.raise_for_status()


soup = BeautifulSoup(response.text, 'html.parser')
csrf_token = soup.select_one("input[name='_token']")['value'] if soup.select_one("input[name='_token']") else None

Then, include the token in the form data and send a POST request:

payload = {
    'email': 'admin@example.com',
    'password': 'password',
    '_token': csrf_token
}


post_response = session.post("login-URL", data=payload)
post_response.raise_for_status()

If the token is stored in cookies instead, get it like this:

csrf_token = response.cookies.get('_csrf')

And send it in a header or reuse the cookies:

response = session.post(
    "login-URL", data=payload,
    headers={"X-CSRF-Token": csrf_token},
    cookies=response.cookies
)

Which method works depends on how the site handles authentication.

Step 4: Handle JavaScript-Based Forms

If other authentication methods don’t work or aren’t available, you can use Selenium. It lets you simulate real user actions: opening pages, entering login credentials, clicking buttons, and so on.
First, import the required modules:

from selenium import webdriver
from selenium.webdriver.common.by import By

Then create a web driver:

driver = webdriver.Chrome()

Go to the login page:

driver.get("login-URL")

Fill in the login and password:

driver.find_element(By.NAME, "email").send_keys("admin@example.com")
driver.find_element(By.NAME, "password").send_keys("password")

Click the submit button:

driver.find_element(By.CSS_SELECTOR, 'button[type="submit"]').click()

And finally, close the driver:

driver.quit()

If there’s no extra protection on the site, this approach should work.

Step 5: Access Protected Pages

The last and most difficult case is when the site adds extra protection like JavaScript challenges, User-Agent checks, or CAPTCHAs. In these cases, regular Selenium might not be enough, it’s easy to detect as a bot.
To get around this, you can use SeleniumBase, which supports undetectable mode via undetected-chromedriver. This helps mimic real user behavior.
First, install and import SeleniumBase:

from seleniumbase import SB

Start the browser in "stealth" mode:

with SB(uc=True) as sb:

Open the login page with auto-reconnect:

    sb.uc_open_with_reconnect("login-URL", reconnect_time=6)

Fill in the form:

    sb.type('input[name="email"]', "admin@example.com")
    sb.type('input[name="password"]', "password")

Click the submit button:

    sb.click('button[type="submit"]')

Once logged in, you can access protected pages and scrape data as needed.
You can also find more detailed examples in our other article about authentication in Python.