Have you ever needed up-to-date job listings but struggled to find one clean source? In this project, I built a full ETL (Extract, Transform, Load) pipeline using Python to scrape internship job data from websites, clean and process it, and load it into an Excel file for analysis or tracking. This project is perfect for anyone interested in web scraping, data pipelines, or automating jobs and tasks with Python and cron! The project's GitHub link is: GitHub What the Project Entails The pipeline performs three main tasks: Extract Scrapes internship listings from MyJobMag Kenya Applies filters like: q=Internship location=Nairobi ¤tpage={page_number} Transform Cleans the data using: BeautifulSoup for HTML parsing pandas for data manipulation Tasks performed: Removes malformed descriptions Drops duplicate entries Filters out rows with missing or invalid data Load Saves the cleaned data into an Excel file: internships.xlsx Useful for: Job search organization Data analysis Dashboards and reporting Step by Step Explanation. a. Extract. I used requests to make GET requests to the website's server and beautifulsoup4 to scrape content from the website. I also used dotenv to hide environmental variables like the website's URL. To install these packages, run the following command in your terminal: pip install beautifulsoup4 requests dotenv I built a script that extracts the job's title, description, posted date, and link to follow when applying to the internship/job. Import the necessary libraries required and initialize empty lists to hold the data's columns and data. import os import json import requests from dotenv import load_dotenv internships = [] titles = [] descriptions = [] opened_dates = [] links = [] The request gets the content from the server by looping through the pages and passes it into soup. Then, I loop through the listings found by soup's find_all() which finds HTML elements by its class. def extract(): for i in range(1, 6): url = f'{URL}¤tpage={i}' headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36" } response = requests.get(url, headers=headers) if response.status_code == 200: soup = BeautifulSoup(response.content, 'html.parser') listings = soup.find_all('li', class_='job-list-li') for listing in listings: title = listing.find('h2') title_text = clean_text(title.get_text()) if title else 'N/A' titles.append(title_text) a_tag = listing.find('a') link = a_tag['href'] if a_tag and a_tag.has_attr('href') else 'N/A' job_links = f'https://myjobmag.co.ke{link}' if a_tag and a_tag.has_attr('href') else 'N/A' links.append(job_links) description = listing.find('li', class_='job-desc') description_text = clean_text(description.get_text()) if description else 'N/A' descriptions.append(description_text) opened_at = listing.find('li', id='job-date') opened_text = opened_at.get_text(strip=True) if opened_at else 'N/A' opened_dates.append(opened_text) time.sleep(response.elapsed.total_seconds()) HINT: Using time.sleep(response.elapsed.total_seconds()) is letting the request take some time before sending another request. This is respectful to do to avoid sending too many requests to the server at a time. To clean this data collected from scraping, I loaded the data into a JSON file using the json module. def load_data(list): with open('data.json', 'w') as f: json.dump(list, f, indent=4) print('Data loaded into JSON successfully') To run this script: if __name__ == '__main__': extract() load_data(internships) HINT: Using this pattern if __name__ == '__main__' helps us to run our scripts within other scripts without encountering any errors and helps in testing our scripts elsewhere. b. Transform/Clean. The content collected from the website had some unwanted characters like Unicode representations e.g. \u2023 I created a function clean_text that cleans these characters off the content. Check the extract() function at title and description. def clean_text(text): if not text: return 'N/A' text = html.unescape(text) text = text.replace('\r', '').replace('\n', ' ').replace('\t', ' ').replace('\u00a0', ' ').replace('\u2019', ' ').replace('\u2023', ' ').replace('\u2013', ' ') text = ' '.join(text.split()) return text.strip() I used the pandas and numpy to build a script that cleans this data and loads it into an Excel sheet. Loading the data from JSON format into a Pandas dataframe: import pandas as pd import numpy as np df = pd.read_json('data.
Have you ever needed up-to-date job listings but struggled to find one clean source? In this project, I built a full ETL (Extract, Transform, Load) pipeline using Python to scrape internship job data from websites, clean and process it, and load it into an Excel file for analysis or tracking.
This project is perfect for anyone interested in web scraping, data pipelines, or automating jobs and tasks with Python and cron!
The project's GitHub link is: GitHub
What the Project Entails
The pipeline performs three main tasks:
-
Extract
- Scrapes internship listings from MyJobMag Kenya
- Applies filters like:
q=Internship
location=Nairobi
¤tpage={page_number}
-
Transform
- Cleans the data using:
-
BeautifulSoup
for HTML parsing -
pandas
for data manipulation
-
- Tasks performed:
- Removes malformed descriptions
- Drops duplicate entries
- Filters out rows with missing or invalid data
- Cleans the data using:
-
Load
- Saves the cleaned data into an Excel file:
internships.xlsx
- Useful for:
- Job search organization
- Data analysis
- Dashboards and reporting
- Saves the cleaned data into an Excel file:
Step by Step Explanation.
a. Extract.
I used requests to make GET requests to the website's server and beautifulsoup4 to scrape content from the website.
I also used dotenv to hide environmental variables like the website's URL.
To install these packages, run the following command in your terminal:
pip install beautifulsoup4 requests dotenv
I built a script that extracts the job's title, description, posted date, and link to follow when applying to the internship/job.
Import the necessary libraries required and initialize empty lists to hold the data's columns and data.
import os
import json
import requests
from dotenv import load_dotenv
internships = []
titles = []
descriptions = []
opened_dates = []
links = []
The request gets the content from the server by looping through the pages and passes it into soup. Then, I loop through the listings found by soup's find_all() which finds HTML elements by its class.
def extract():
for i in range(1, 6):
url = f'{URL}¤tpage={i}'
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36"
}
response = requests.get(url, headers=headers)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
listings = soup.find_all('li', class_='job-list-li')
for listing in listings:
title = listing.find('h2')
title_text = clean_text(title.get_text()) if title else 'N/A'
titles.append(title_text)
a_tag = listing.find('a')
link = a_tag['href'] if a_tag and a_tag.has_attr('href') else 'N/A'
job_links = f'https://myjobmag.co.ke{link}' if a_tag and a_tag.has_attr('href') else 'N/A'
links.append(job_links)
description = listing.find('li', class_='job-desc')
description_text = clean_text(description.get_text()) if description else 'N/A'
descriptions.append(description_text)
opened_at = listing.find('li', id='job-date')
opened_text = opened_at.get_text(strip=True) if opened_at else 'N/A'
opened_dates.append(opened_text)
time.sleep(response.elapsed.total_seconds())
HINT: Using time.sleep(response.elapsed.total_seconds()) is letting the request take some time before sending another request. This is respectful to do to avoid sending too many requests to the server at a time.
To clean this data collected from scraping, I loaded the data into a JSON file using the json module.
def load_data(list):
with open('data.json', 'w') as f:
json.dump(list, f, indent=4)
print('Data loaded into JSON successfully')
To run this script:
if __name__ == '__main__':
extract()
load_data(internships)
HINT: Using this pattern
if __name__ == '__main__'
helps us to run our scripts within other scripts without encountering any errors and helps in testing our scripts elsewhere.
b. Transform/Clean.
The content collected from the website had some unwanted characters like Unicode representations e.g. \u2023
I created a function clean_text that cleans these characters off the content. Check the extract() function at title and description.
def clean_text(text):
if not text:
return 'N/A'
text = html.unescape(text)
text = text.replace('\r', '').replace('\n', ' ').replace('\t', ' ').replace('\u00a0', ' ').replace('\u2019', ' ').replace('\u2023', ' ').replace('\u2013', ' ')
text = ' '.join(text.split())
return text.strip()
I used the pandas and numpy to build a script that cleans this data and loads it into an Excel sheet.
Loading the data from JSON format into a Pandas dataframe:
import pandas as pd
import numpy as np
df = pd.read_json('data.json')
When printing df, the columns' data is in a Python list, which we don't want it to be in. We need to explode the columns into rows and reset the index.
df = df.explode(['title', 'description', 'opened_on', 'link']).reset_index(drop=True)
The columns will be exploded. but the first row starts at index 0. To make the first row to have the index one:
df.index = df.index + 1
The data contains N/A in the rows which implies that these rows contain no relevant information and we need to drop these rows.
We need to replace the N/A values with np.nan values which will be easier to drop.
df = df.replace('N/A', np.nan)
df.dropna(inplace=True)
Now, the data is clean and ready to be loaded into an Excel sheet.
c. Load.
After transforming the data, I loaded the transformed data into an Excel sheet for better visualization and interaction with the end user.
Before loading the data into an Excel sheet, install the openpyxl which is a Python package that helps with loading of dataframes into Excel sheets. To install:
pip install openpyxl
After installation, add the following line to add the data into an Excel sheet:
df.to_excel('internships.xlsx')
End result.
As a student and job-seeker, I wanted to automate the process of checking for new internship listings. Building this ETL pipeline allowed me to:
- Learn scripting
- Building ETL pipelines using Python.
- Automate manual tracking tasks
I'll automate these tasks to be scraping on weekdays at 9:30AM and send me emails that data is avaliable for use in Excel.