InsightFlow Part 8: Setting Up AWS Athena for Data Analysis in InsightFlow

InsightFlow GitHub Repo In this post, we’ll explore how Amazon Athena was set up for querying and analyzing data in the InsightFlow project. Athena is a serverless, interactive query service that allows you to analyze data directly in Amazon S3 using standard SQL. It’s an essential component of the InsightFlow pipeline, enabling efficient querying of both raw and transformed data. Why Amazon Athena? Amazon Athena is an ideal choice for InsightFlow due to its: Serverless Architecture: No infrastructure to manage; you only pay for the queries you run. Seamless Integration with S3: Queries data directly from S3 without requiring ETL processes to move data elsewhere. Support for Open Formats: Works with Parquet, ORC, JSON, and other formats, ensuring compatibility with the data pipeline. Partitioning and Optimization: Supports partitioning and compression to reduce query costs and improve performance. For InsightFlow, Athena is used to: Query raw and processed data stored in S3. Analyze trends in retail sales and fuel prices. Serve as the backend for dashboards in AWS QuickSight. Step 1: Preparing the Data in S3 The data pipeline stores both raw and transformed data in S3 buckets: Raw Data: Stored in the insightflow-prod-raw-data bucket under the raw/ prefix. Processed Data: Stored in the insightflow-prod-processed-data bucket under the processed/ prefix. Partitioning the Data To optimize query performance, the processed data is partitioned by year and month. For example: s3://insightflow-prod-processed-data/fct_retail_sales_monthly/year=2025/month=04/ Partitioning allows Athena to scan only the relevant data, reducing query costs and improving performance. Step 2: Setting Up the Glue Data Catalog Athena relies on the AWS Glue Data Catalog to store metadata about the datasets. Glue Crawlers were used to automatically discover schemas and populate the Data Catalog. Glue Crawler Configuration The Glue Crawler scans the processed S3 bucket and creates tables in the Glue database: resource "aws_glue_crawler" "processed_data_crawler" { name = "insightflow-prod-processed-data-crawler" role = aws_iam_role.glue_crawler_role.arn database_name = aws_glue_catalog_database.dbt_database.name s3_target { path = "s3://insightflow-prod-processed-data/processed/" } schema_change_policy { update_behavior = "UPDATE_IN_DATABASE" delete_behavior = "LOG" } tags = { Environment = "prod" Project = "InsightFlow" } } Once the crawler is run, the processed data is available as tables in the Glue Data Catalog. Step 3: Configuring Athena Creating a Workgroup Athena Workgroups help manage query costs and monitor usage. A workgroup was created for the project: resource "aws_athena_workgroup" "insightflow_workgroup" { name = "insightflow-prod-workgroup" configuration { enforce_workgroup_configuration = true publish_cloudwatch_metrics_enabled = true result_configuration { output_location = "s3://insightflow-prod-processed-data/athena-results/" } } tags = { Environment = "prod" Project = "InsightFlow" } } Setting Query Results Location Athena query results are stored in the athena-results/ prefix of the processed S3 bucket: s3://insightflow-prod-processed-data/athena-results/ This ensures that query results are accessible for debugging and downstream processing. Step 4: Querying Data with Athena Once the Glue Crawler has populated the Data Catalog, the data can be queried using SQL in the Athena console or programmatically via the AWS CLI or SDK. Example Query: Analyzing Retail Sales and Fuel Prices The following query analyzes the correlation between retail sales and fuel prices: SELECT r.year, r.month, r.sales_value_rm_mil, f.avg_ron95_price, f.avg_ron97_price, f.avg_diesel_price FROM fct_retail_sales_monthly r JOIN fuelprice_monthly f ON r.year = f.year AND r.month = f.month WHERE r.year = 2025 ORDER BY r.month; This query joins the fct_retail_sales_monthly fact table with the fuelprice_monthly table to analyze trends. Step 5: Optimizing Athena Queries 1. Use Partitioning Partitioning the data by year and month ensures that Athena scans only the relevant partitions, reducing query costs. 2. Use Parquet Format The data is stored in Parquet format, which is optimized for analytical queries due to its columnar storage and compression. 3. Limit Data Scanned Use SELECT statements to query only the required columns and apply filters (e.g., WHERE year = 2025) to minimize the amount of data scanned. 4. Monitor Query Costs Athena Workgroups provide metrics in CloudWatch to monitor query costs and performance. Step 6: Integrating A

Apr 29, 2025 - 04:45

InsightFlow Part 8: Setting Up AWS Athena for Data Analysis in InsightFlow

In this post, we’ll explore how Amazon Athena was set up for querying and analyzing data in the InsightFlow project. Athena is a serverless, interactive query service that allows you to analyze data directly in Amazon S3 using standard SQL. It’s an essential component of the InsightFlow pipeline, enabling efficient querying of both raw and transformed data.

Why Amazon Athena?

Amazon Athena is an ideal choice for InsightFlow due to its:

Serverless Architecture: No infrastructure to manage; you only pay for the queries you run.
Seamless Integration with S3: Queries data directly from S3 without requiring ETL processes to move data elsewhere.
Support for Open Formats: Works with Parquet, ORC, JSON, and other formats, ensuring compatibility with the data pipeline.
Partitioning and Optimization: Supports partitioning and compression to reduce query costs and improve performance.

For InsightFlow, Athena is used to:

Query raw and processed data stored in S3.
Analyze trends in retail sales and fuel prices.
Serve as the backend for dashboards in AWS QuickSight.

Step 1: Preparing the Data in S3

The data pipeline stores both raw and transformed data in S3 buckets:

Raw Data: Stored in the insightflow-prod-raw-data bucket under the raw/ prefix.
Processed Data: Stored in the insightflow-prod-processed-data bucket under the processed/ prefix.

Partitioning the Data

To optimize query performance, the processed data is partitioned by year and month. For example:

s3://insightflow-prod-processed-data/fct_retail_sales_monthly/year=2025/month=04/

Partitioning allows Athena to scan only the relevant data, reducing query costs and improving performance.

Step 2: Setting Up the Glue Data Catalog

Athena relies on the AWS Glue Data Catalog to store metadata about the datasets. Glue Crawlers were used to automatically discover schemas and populate the Data Catalog.

Glue Crawler Configuration

The Glue Crawler scans the processed S3 bucket and creates tables in the Glue database:

resource "aws_glue_crawler" "processed_data_crawler" {
  name          = "insightflow-prod-processed-data-crawler"
  role          = aws_iam_role.glue_crawler_role.arn
  database_name = aws_glue_catalog_database.dbt_database.name

  s3_target {
    path = "s3://insightflow-prod-processed-data/processed/"
  }

  schema_change_policy {
    update_behavior = "UPDATE_IN_DATABASE"
    delete_behavior = "LOG"
  }

  tags = {
    Environment = "prod"
    Project     = "InsightFlow"
  }
}

Once the crawler is run, the processed data is available as tables in the Glue Data Catalog.

Step 3: Configuring Athena

Creating a Workgroup

Athena Workgroups help manage query costs and monitor usage. A workgroup was created for the project:

resource "aws_athena_workgroup" "insightflow_workgroup" {
  name = "insightflow-prod-workgroup"

  configuration {
    enforce_workgroup_configuration = true
    publish_cloudwatch_metrics_enabled = true
    result_configuration {
      output_location = "s3://insightflow-prod-processed-data/athena-results/"
    }
  }

  tags = {
    Environment = "prod"
    Project     = "InsightFlow"
  }
}

Setting Query Results Location

Athena query results are stored in the athena-results/ prefix of the processed S3 bucket:

s3://insightflow-prod-processed-data/athena-results/

This ensures that query results are accessible for debugging and downstream processing.

Step 4: Querying Data with Athena

Once the Glue Crawler has populated the Data Catalog, the data can be queried using SQL in the Athena console or programmatically via the AWS CLI or SDK.

Example Query: Analyzing Retail Sales and Fuel Prices

The following query analyzes the correlation between retail sales and fuel prices:

SELECT
    r.year,
    r.month,
    r.sales_value_rm_mil,
    f.avg_ron95_price,
    f.avg_ron97_price,
    f.avg_diesel_price
FROM
    fct_retail_sales_monthly r
JOIN
    fuelprice_monthly f
ON
    r.year = f.year AND r.month = f.month
WHERE
    r.year = 2025
ORDER BY
    r.month;

This query joins the fct_retail_sales_monthly fact table with the fuelprice_monthly table to analyze trends.

Step 5: Optimizing Athena Queries

1. Use Partitioning

Partitioning the data by year and month ensures that Athena scans only the relevant partitions, reducing query costs.

2. Use Parquet Format

The data is stored in Parquet format, which is optimized for analytical queries due to its columnar storage and compression.

3. Limit Data Scanned

Use SELECT statements to query only the required columns and apply filters (e.g., WHERE year = 2025) to minimize the amount of data scanned.

4. Monitor Query Costs

Athena Workgroups provide metrics in CloudWatch to monitor query costs and performance.

Step 6: Integrating Athena with QuickSight

Athena serves as the backend for dashboards in AWS QuickSight. QuickSight connects to Athena using the Glue Data Catalog, enabling interactive visualizations of retail sales and fuel price trends.

Challenges and Lessons Learned

Schema Evolution: Managing schema changes in Glue required careful configuration of the schema_change_policy.
Partitioning Strategy: Choosing the right partitioning strategy was critical for optimizing query performance.
Cost Management: Monitoring query costs in Athena Workgroups helped identify and optimize expensive queries.

Conclusion

Amazon Athena is a powerful tool for querying and analyzing data directly in S3. By integrating Athena with the Glue Data Catalog and optimizing the data layout, InsightFlow enables efficient, cost-effective data analysis.