Peer Review 3: France Data Engineering Job Market Analysis Pipeline Infra (Part 1)

Introduction Welcome to the third peer review series for DataTalks Club Data Engineering Zoomcamp. In this post, I’ll be dissecting a real-world data engineering project that analyzes the French Data Engineering job market. The goal? To break down the project’s infrastructure, orchestration, and cloud design—spotlighting what works well, what could be improved, and, most importantly, what we can all learn as practicing data engineers. Why do this? Because reviewing and sharing feedback on real-world projects sharpens our own skills, encourages open knowledge sharing, and helps us all grow together. Let’s dig in. Project Overview Project: DE-Job-Market-Analysis (GitHub) Objective: Build an end-to-end, cloud-native pipeline to collect, store, transform, and visualize Data Engineering job postings for the French market. Key questions addressed: What is the demand for Data Engineering roles in France? Which skills and tools are most sought after? Which companies are hiring, and what are their workforce sizes? What are the salary trends and geographic patterns? How do job posting trends evolve over time? Evaluation Criteria: How I Review For this peer review series, I use a structured rubric inspired by the DataTalksClub Data Engineering Zoomcamp project guidelines. The main areas of focus are: Problem Description Cloud Infrastructure (and Infrastructure as Code) Data Ingestion & Orchestration (Part 2: Transformations, Dashboarding, Reproducibility, and Actionable Feedback) 1. Problem Description Right from the start, the project’s README does an excellent job of motivating the work. It clearly explains why understanding the French Data Engineering job market matters, lays out the business context, and lists the specific insights the pipeline aims to deliver. “The project aims to provide valuable insights into the demand for Data Engineering roles, most sought-after skills, key hiring companies, salary trends, locations, and job posting trends over time.” Comment: Excellent articulation! The clarity of context and objectives makes it easy for any reader (technical or not) to quickly understand the project’s purpose and value. 2. Cloud Infrastructure & IaC This project is cloud-native, leveraging Google Cloud Platform (GCP) as the backbone. Key services used: BigQuery: The analytical data warehouse. Google Cloud Storage (GCS): For raw and processed data storage. Terraform: Infrastructure as Code for reproducible, automated cloud resource provisioning. What Stands Out The use of Terraform (with main.tf and variables.tf) to provision GCS buckets, BigQuery datasets, and service accounts is a mark of maturity—no click-ops here! The README provides step-by-step instructions for configuring GCP variables, applying Terraform, and setting up service accounts. The infrastructure is cleanly separated into its own directory, making the project modular and easy to maintain. Comment: Strong implementation of cloud and IaC best practices. The use of Terraform for GCP infra shows a solid grasp of production-grade deployments. 3. Workflow Orchestration: Batch Pipelines with Kestra The pipeline’s automation is orchestrated using Kestra, a modern workflow orchestration tool (think: Apache Airflow alternative, but YAML-first and developer-friendly). How It’s Used Kestra flows automate job posting scraping (using JobSpy), data uploads to GCS, and the triggering of dbt transformations. The orchestration logic is defined in YAML files, located in a dedicated kestra/ directory. The workflow covers end-to-end batch scheduling: daily scraping, loading, and transformation, ensuring up-to-date analytics. Batch vs. Streaming This project focuses exclusively on batch processing—scraping and updating the dataset on a periodic schedule. There’s no streaming ingestion (like Kafka), which is appropriate for the type of data source used here (static job listings). Comment: Great use of Kestra for orchestrating a robust, modular DAG. For future iterations, consider adding diagrams or screenshots of the Kestra flows to make the orchestration even clearer for newcomers. 4. Data Ingestion: Batch (and What About Streaming?) The ingestion process is classic batch ETL: Scraping: Job postings are scraped using an external tool and saved as CSV. Loading: CSVs are uploaded to GCS and registered as external tables in BigQuery. Automation: All steps are orchestrated via Kestra. Why not streaming? The data source (job boards) doesn’t support real-time feeds, so batch scraping is pragmatic. If live job posting APIs were ever available, a streaming pipeline could be an exciting next step. Comment: The batch pipeline is well-automated and fit-for-purpose. The README makes it easy to understand and reproduce the process. 5. Contents of Interest (Project Structure Highlights)

May 2, 2025 - 14:54
 0
Peer Review 3: France Data Engineering Job Market Analysis Pipeline Infra (Part 1)

Introduction

Welcome to the third peer review series for DataTalks Club Data Engineering Zoomcamp. In this post, I’ll be dissecting a real-world data engineering project that analyzes the French Data Engineering job market. The goal? To break down the project’s infrastructure, orchestration, and cloud design—spotlighting what works well, what could be improved, and, most importantly, what we can all learn as practicing data engineers.

Why do this? Because reviewing and sharing feedback on real-world projects sharpens our own skills, encourages open knowledge sharing, and helps us all grow together. Let’s dig in.

Project Overview

Project: DE-Job-Market-Analysis (GitHub)

Objective: Build an end-to-end, cloud-native pipeline to collect, store, transform, and visualize Data Engineering job postings for the French market.

Key questions addressed:

  • What is the demand for Data Engineering roles in France?
  • Which skills and tools are most sought after?
  • Which companies are hiring, and what are their workforce sizes?
  • What are the salary trends and geographic patterns?
  • How do job posting trends evolve over time?

Evaluation Criteria: How I Review

For this peer review series, I use a structured rubric inspired by the DataTalksClub Data Engineering Zoomcamp project guidelines. The main areas of focus are:

  1. Problem Description
  2. Cloud Infrastructure (and Infrastructure as Code)
  3. Data Ingestion & Orchestration
  4. (Part 2: Transformations, Dashboarding, Reproducibility, and Actionable Feedback)

1. Problem Description

Right from the start, the project’s README does an excellent job of motivating the work. It clearly explains why understanding the French Data Engineering job market matters, lays out the business context, and lists the specific insights the pipeline aims to deliver.

“The project aims to provide valuable insights into the demand for Data Engineering roles, most sought-after skills, key hiring companies, salary trends, locations, and job posting trends over time.”

Comment:

Excellent articulation! The clarity of context and objectives makes it easy for any reader (technical or not) to quickly understand the project’s purpose and value.

2. Cloud Infrastructure & IaC

This project is cloud-native, leveraging Google Cloud Platform (GCP) as the backbone.

Key services used:

  • BigQuery: The analytical data warehouse.
  • Google Cloud Storage (GCS): For raw and processed data storage.
  • Terraform: Infrastructure as Code for reproducible, automated cloud resource provisioning.

What Stands Out

  • The use of Terraform (with main.tf and variables.tf) to provision GCS buckets, BigQuery datasets, and service accounts is a mark of maturity—no click-ops here!
  • The README provides step-by-step instructions for configuring GCP variables, applying Terraform, and setting up service accounts.
  • The infrastructure is cleanly separated into its own directory, making the project modular and easy to maintain.

Comment:

Strong implementation of cloud and IaC best practices. The use of Terraform for GCP infra shows a solid grasp of production-grade deployments.

3. Workflow Orchestration: Batch Pipelines with Kestra

The pipeline’s automation is orchestrated using Kestra, a modern workflow orchestration tool (think: Apache Airflow alternative, but YAML-first and developer-friendly).

How It’s Used

  • Kestra flows automate job posting scraping (using JobSpy), data uploads to GCS, and the triggering of dbt transformations.
  • The orchestration logic is defined in YAML files, located in a dedicated kestra/ directory.
  • The workflow covers end-to-end batch scheduling: daily scraping, loading, and transformation, ensuring up-to-date analytics.

Batch vs. Streaming

This project focuses exclusively on batch processing—scraping and updating the dataset on a periodic schedule. There’s no streaming ingestion (like Kafka), which is appropriate for the type of data source used here (static job listings).

Comment:

Great use of Kestra for orchestrating a robust, modular DAG. For future iterations, consider adding diagrams or screenshots of the Kestra flows to make the orchestration even clearer for newcomers.

4. Data Ingestion: Batch (and What About Streaming?)

The ingestion process is classic batch ETL:

  • Scraping: Job postings are scraped using an external tool and saved as CSV.
  • Loading: CSVs are uploaded to GCS and registered as external tables in BigQuery.
  • Automation: All steps are orchestrated via Kestra.

Why not streaming?

The data source (job boards) doesn’t support real-time feeds, so batch scraping is pragmatic. If live job posting APIs were ever available, a streaming pipeline could be an exciting next step.

Comment:

The batch pipeline is well-automated and fit-for-purpose. The README makes it easy to understand and reproduce the process.

5. Contents of Interest (Project Structure Highlights)

  • dbt/ Directory:
    • Contains a full dbt project (job_market_analysis) with:
      • dbt_project.yml (project config and structure).
      • models/ subdirectory with staging, core, and marts models.
      • schema.yml for dbt model/table testing and documentation.
      • macros/, tests/, seeds/, and snapshots/ folders are present but currently empty.
  • kestra/ Directory:
    • YAML flow definitions for workflow orchestration.
  • terraform/ Directory:
    • main.tf and variables.tf for GCP infrastructure provisioning.
  • docker-compose.yml:
    • Used for local orchestration of services.
  • images/ Folder:
    • Includes dashboard screenshots for reference.
  • README.md:
    • Comprehensive, clear, and actionable documentation.

Conclusion & What’s Next

This project demonstrates a strong grasp of modern data engineering infrastructure design: cloud-native, reproducible, and automated. In Part 2, I’ll dive into the transformation layer (dbt), data warehouse design, dashboarding, reproducibility, and provide actionable feedback for the project author—and for all of us as data engineers.