InsightFlow Part 2: Setting Up the Cloud Infrastructure with Terraform

In this post, I’ll walk you through how I set up the cloud infrastructure for my project, InsightFlow, using Terraform. InsightFlow is a data engineering project that integrates retail and fuel price data from Malaysia, processes it with dbt, and enables analysis via AWS Athena. The infrastructure is fully managed using Terraform, ensuring reproducibility and scalability. Why Terraform? Terraform is an open-source Infrastructure as Code (IaC) tool that allows you to define and provision cloud resources in a declarative way. For InsightFlow, Terraform was the perfect choice because: Reproducibility: The same infrastructure can be deployed across development and production environments. Version Control: Infrastructure changes are tracked in Git, ensuring a clear history of modifications. Scalability: Terraform supports AWS services like S3, Glue, Athena, and Batch, which are core to InsightFlow. Project Overview InsightFlow’s infrastructure is divided into two main layers: Storage Layer: Manages S3 buckets for raw and processed data. Compute Layer: Manages AWS Batch for ingestion, Glue for ETL, and Kestra for workflow orchestration. Each layer is defined in separate Terraform modules for better organization and reusability. Step 1: Setting Up the Storage Layer The storage layer consists of two S3 buckets: Raw Data Bucket: Stores unprocessed data ingested from external sources. Processed Data Bucket: Stores cleaned and transformed data ready for analysis. Terraform Configuration for Storage Here’s how the storage layer is defined in Terraform: # Raw Data Bucket resource "aws_s3_bucket" "raw_data" { bucket = "${var.project_name}-prod-raw-data" tags = local.common_tags } # Processed Data Bucket resource "aws_s3_bucket" "processed_data" { bucket = "${var.project_name}-prod-processed-data" tags = local.common_tags } # Enable versioning for both buckets resource "aws_s3_bucket_versioning" "raw_data_versioning" { bucket = aws_s3_bucket.raw_data.id versioning_configuration { status = "Enabled" } } resource "aws_s3_bucket_versioning" "processed_data_versioning" { bucket = aws_s3_bucket.processed_data.id versioning_configuration { status = "Enabled" } } Key Features Versioning: Ensures data integrity by keeping track of changes. Access Control: Public access is blocked for both buckets to ensure security. Step 2: Setting Up the Compute Layer The compute layer handles data ingestion, transformation, and orchestration. It includes: AWS Batch: Runs ingestion jobs to fetch and upload data to the raw data bucket. AWS Glue: Scans the raw data bucket and creates schema definitions in the Glue Data Catalog. Kestra: Orchestrates the entire workflow, including ingestion, transformation, and testing. Terraform Configuration for AWS Batch Here’s how the AWS Batch job definition is configured: resource "aws_batch_job_definition" "ingestion_job_def" { name = "${var.project_name}-prod-ingestion-job-def" type = "container" container_properties = jsonencode({ image = "864899839546.dkr.ecr.ap-southeast-2.amazonaws.com/insightflow-ingestion:latest" command = ["python", "main.py"] environment = [ { name = "TARGET_BUCKET" value = "insightflow-prod-raw-data" }, { name = "AWS_REGION" value = var.aws_region } ] resourceRequirements = [ { type = "VCPU", value = "1" }, { type = "MEMORY", value = "2048" } ] }) } Key Features Containerized Jobs: The ingestion script runs in a Docker container, ensuring consistency across environments. Dynamic Resource Allocation: Batch jobs can scale based on the workload. Step 3: Orchestrating Workflows with Kestra Kestra is used to orchestrate the entire pipeline. It submits AWS Batch jobs, triggers Glue crawlers, and runs dbt commands for data transformation. Example Kestra Workflow Here’s a snippet of the Kestra workflow for submitting an AWS Batch job: - id: submit_batch_ingestion_job_cli type: io.kestra.core.tasks.scripts.Bash commands: - | JOB_DEF_NAME="insightflow-prod-ingestion-job-def" JOB_QUEUE_NAME="insightflow-prod-job-queue" TARGET_BUCKET_NAME="insightflow-prod-raw-data" AWS_REGION="ap-southeast-2" aws batch submit-job \\ --region "$AWS_REGION" \\ --job-name "ingestion-job-{{execution.id}}" \\ --job-queue "$JOB_QUEUE_NAME" \\ --job-definition "$JOB_DEF_NAME" \\ --container-overrides '{ "environment": [ {"name": "TARGET_BUCKET", "value": "'"$TARGET_BUCKET_NAME"'"} ] }' Step 4: Managing Infrastructure State Terraform uses an S3 bucket as the backend to store the state file. This ensures that the state is shared and consistent across team membe

Apr 29, 2025 - 00:16
 0
InsightFlow Part 2: Setting Up the Cloud Infrastructure with Terraform

In this post, I’ll walk you through how I set up the cloud infrastructure for my project, InsightFlow, using Terraform. InsightFlow is a data engineering project that integrates retail and fuel price data from Malaysia, processes it with dbt, and enables analysis via AWS Athena. The infrastructure is fully managed using Terraform, ensuring reproducibility and scalability.

Why Terraform?

Terraform is an open-source Infrastructure as Code (IaC) tool that allows you to define and provision cloud resources in a declarative way. For InsightFlow, Terraform was the perfect choice because:

  1. Reproducibility: The same infrastructure can be deployed across development and production environments.
  2. Version Control: Infrastructure changes are tracked in Git, ensuring a clear history of modifications.
  3. Scalability: Terraform supports AWS services like S3, Glue, Athena, and Batch, which are core to InsightFlow.

Project Overview

InsightFlow’s infrastructure is divided into two main layers:

  1. Storage Layer: Manages S3 buckets for raw and processed data.
  2. Compute Layer: Manages AWS Batch for ingestion, Glue for ETL, and Kestra for workflow orchestration.

Each layer is defined in separate Terraform modules for better organization and reusability.

Step 1: Setting Up the Storage Layer

The storage layer consists of two S3 buckets:

  • Raw Data Bucket: Stores unprocessed data ingested from external sources.
  • Processed Data Bucket: Stores cleaned and transformed data ready for analysis.

Terraform Configuration for Storage

Here’s how the storage layer is defined in Terraform:

# Raw Data Bucket
resource "aws_s3_bucket" "raw_data" {
  bucket = "${var.project_name}-prod-raw-data"
  tags   = local.common_tags
}

# Processed Data Bucket
resource "aws_s3_bucket" "processed_data" {
  bucket = "${var.project_name}-prod-processed-data"
  tags   = local.common_tags
}

# Enable versioning for both buckets
resource "aws_s3_bucket_versioning" "raw_data_versioning" {
  bucket = aws_s3_bucket.raw_data.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_versioning" "processed_data_versioning" {
  bucket = aws_s3_bucket.processed_data.id
  versioning_configuration {
    status = "Enabled"
  }
}

Key Features

  • Versioning: Ensures data integrity by keeping track of changes.
  • Access Control: Public access is blocked for both buckets to ensure security.

Step 2: Setting Up the Compute Layer

The compute layer handles data ingestion, transformation, and orchestration. It includes:

  • AWS Batch: Runs ingestion jobs to fetch and upload data to the raw data bucket.
  • AWS Glue: Scans the raw data bucket and creates schema definitions in the Glue Data Catalog.
  • Kestra: Orchestrates the entire workflow, including ingestion, transformation, and testing.

Terraform Configuration for AWS Batch

Here’s how the AWS Batch job definition is configured:

resource "aws_batch_job_definition" "ingestion_job_def" {
  name = "${var.project_name}-prod-ingestion-job-def"
  type = "container"

  container_properties = jsonencode({
    image = "864899839546.dkr.ecr.ap-southeast-2.amazonaws.com/insightflow-ingestion:latest"
    command = ["python", "main.py"]
    environment = [
      {
        name  = "TARGET_BUCKET"
        value = "insightflow-prod-raw-data"
      },
      {
        name  = "AWS_REGION"
        value = var.aws_region
      }
    ]
    resourceRequirements = [
      { type = "VCPU",   value = "1" },
      { type = "MEMORY", value = "2048" }
    ]
  })
}

Key Features

  • Containerized Jobs: The ingestion script runs in a Docker container, ensuring consistency across environments.
  • Dynamic Resource Allocation: Batch jobs can scale based on the workload.

Step 3: Orchestrating Workflows with Kestra

Kestra is used to orchestrate the entire pipeline. It submits AWS Batch jobs, triggers Glue crawlers, and runs dbt commands for data transformation.

Example Kestra Workflow

Here’s a snippet of the Kestra workflow for submitting an AWS Batch job:

- id: submit_batch_ingestion_job_cli
  type: io.kestra.core.tasks.scripts.Bash
  commands:
    - |
      JOB_DEF_NAME="insightflow-prod-ingestion-job-def"
      JOB_QUEUE_NAME="insightflow-prod-job-queue"
      TARGET_BUCKET_NAME="insightflow-prod-raw-data"
      AWS_REGION="ap-southeast-2"

      aws batch submit-job \\
        --region "$AWS_REGION" \\
        --job-name "ingestion-job-{{execution.id}}" \\
        --job-queue "$JOB_QUEUE_NAME" \\
        --job-definition "$JOB_DEF_NAME" \\
        --container-overrides '{
          "environment": [
            {"name": "TARGET_BUCKET", "value": "'"$TARGET_BUCKET_NAME"'"}
          ]
        }'

Step 4: Managing Infrastructure State

Terraform uses an S3 bucket as the backend to store the state file. This ensures that the state is shared and consistent across team members.

Backend Configuration

terraform {
  backend "s3" {
    bucket         = "insightflow-terraform-state-bucket"
    key            = "env:/prod/prod/compute.tfstate"
    region         = "ap-southeast-2"
    dynamodb_table = "terraform-state-lock-dynamo"
    encrypt        = true
  }
}

Step 5: Validating and Deploying

Before deploying the infrastructure, it’s important to validate the configuration and generate a plan.

Commands

  1. Validate Configuration:

    terraform validate
    
    
  2. Generate a Plan:

    terraform plan -var-file=prod.tfvars
    
    
  3. Apply the Changes:

    terraform apply -var-file=prod.tfvars
    
    

Conclusion

By using Terraform, I was able to set up a robust and scalable cloud infrastructure for InsightFlow. The modular approach ensures that the infrastructure is easy to manage and extend as the project grows. Whether you’re building a data pipeline or deploying a web application, Terraform is a powerful tool to have in your arsenal.