Migrating Hadoop Workloads to AWS: On-Premises HDFS, Spark, Kafka, Airflow to AWS S3, Iceberg, and EMR

Table of Contents Introduction Why Migrate from On-Premises Hadoop to AWS? Target AWS Architecture with Iceberg Step-by-Step Migration Process Code Snippets & Implementation Lessons Learned & Best Practices 1. Introduction Many enterprises still run on-premises Hadoop (HDFS, Spark, Kafka, Airflow) for big data processing. However, challenges like high operational costs, scalability bottlenecks, and maintenance overhead make cloud migration attractive. This blog provides a 6-step guide for migrating to AWS S3, Apache Iceberg, and EMR, including: ✔ Architecture diagrams ✔ Code snippets for Spark, Kafka, and Iceberg ✔ Lessons learned from real-world migrations 2. Why Migrate from On-Premises Hadoop to AWS? Challenges with On-Prem Hadoop Issue AWS Solution Expensive hardware & maintenance Pay-as-you-go pricing (EMR, S3) Manual scaling (YARN/HDFS) Auto-scaling EMR clusters HDFS limitations (durability, scaling) S3 (11 9’s durability) + Iceberg (ACID tables) Complex Kafka & Airflow management AWS MSK (Managed Kafka) & MWAA (Managed Airflow) Key Benefits of AWS + Iceberg Cost savings (no upfront hardware, spot instances) Modern table format (Iceberg for schema evolution, time travel) Serverless options (Glue, Athena, EMR Serverless) 3. Target AWS Architecture with Iceberg Current On-Premises Setup New AWS Architecture (Iceberg + EMR) Key AWS Services S3 – Data lake storage (replaces HDFS) EMR – Managed Spark with Iceberg support AWS Glue Data Catalog – Metastore for Iceberg tables MSK – Managed Kafka for streaming MWAA – Managed Airflow for orchestration 4. Step-by-Step Migration Process Phase 1: Assessment & Planning Inventory existing workloads (HDFS paths, Spark SQL, Kafka topics) Choose Iceberg for table format (supports schema evolution, upserts) Plan networking (VPC, security groups, IAM roles) Phase 2: Data Migration (HDFS → S3 + Iceberg) Option 1: Use distcp to copy data from HDFS to S3 hadoop distcp hdfs://namenode/path s3a://bucket/path Option 2: Use Spark to rewrite data as Iceberg df = spark.read.parquet("hdfs://path") df.write.format("iceberg").save("s3://bucket/iceberg_table") Phase 3: Compute Migration (Spark → EMR with Iceberg) Configure EMR with Iceberg (use bootstrap script): #!/bin/bash sudo pip install pyiceberg echo "spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog" >> /etc/spark/conf/spark-defaults.conf echo "spark.sql.catalog.glue_catalog.warehouse=s3://bucket/warehouse" >> /etc/spark/conf/spark-defaults.conf Phase 4: Streaming Migration (Kafka → MSK) Mirror topics using Kafka Connect { "name": "msk-mirror", "config": { "connector.class": "org.apache.kafka.connect.mirror.MirrorSourceConnector", "source.cluster.bootstrap.servers": "on-prem-kafka:9092", "target.cluster.bootstrap.servers": "b-1.msk.aws:9092", "topics": ".*" } } Phase 5: Orchestration Migration (Airflow → MWAA) Export DAGs and update paths (replace hdfs:// with s3://) Use AWS Secrets Manager for credentials Phase 6: Validation & Optimization Verify data consistency (compare row counts, checksums) Optimize Iceberg (compact files, partition pruning) CALL glue_catalog.system.rewrite_data_files('db.table', strategy='binpack') 5. Code Snippets & Implementation 1. Reading/Writing Iceberg Tables in Spark # Read from HDFS (old) df = spark.read.parquet("hdfs:///data/transactions") # Write to Iceberg (new) df.write.format("iceberg").mode("overwrite").save("s3://bucket/iceberg_db/transactions") # Query with time travel spark.read.format("iceberg").option("snapshot-id", "12345").load("s3://bucket/iceberg_db/transactions") 2. Kafka to Iceberg (Structured Streaming) df = spark.readStream.format("kafka") \ .option("kafka.bootstrap.servers", "b-1.msk.aws:9092") \ .option("subscribe", "transactions") \ .load() # Write to Iceberg in Delta Lake format df.writeStream.format("iceberg") \ .outputMode("append") \ .option("path", "s3://bucket/iceberg_db/streaming") \ .start() 3. Airflow DAG for Iceberg Maintenance from airflow import DAG from airflow.providers.amazon.aws.operators.emr import EmrAddStepsOperator dag = DAG("iceberg_maintenance", schedule_interval="@weekly") compact_task = EmrAddStepsOperator( task_id="compact_iceberg", job_flow_id="j-EMRCLUSTER", steps=[{ "Name": "Compact Iceberg", "HadoopJarStep": { "Jar": "command-runner.jar", "Args": ["spark-sql", "--executor-memory", "8G", "-e", "CALL glue_catalog.system.rewrite_data_files('db.

Apr 11, 2025 - 12:36

Migrating Hadoop Workloads to AWS: On-Premises HDFS, Spark, Kafka, Airflow to AWS S3, Iceberg, and EMR

Introduction
Why Migrate from On-Premises Hadoop to AWS?
Target AWS Architecture with Iceberg
Step-by-Step Migration Process
Code Snippets & Implementation
Lessons Learned & Best Practices

1. Introduction

Many enterprises still run on-premises Hadoop (HDFS, Spark, Kafka, Airflow) for big data processing. However, challenges like high operational costs, scalability bottlenecks, and maintenance overhead make cloud migration attractive.

This blog provides a 6-step guide for migrating to AWS S3, Apache Iceberg, and EMR, including:

✔ Architecture diagrams

✔ Code snippets for Spark, Kafka, and Iceberg

✔ Lessons learned from real-world migrations

2. Why Migrate from On-Premises Hadoop to AWS?

Challenges with On-Prem Hadoop

Issue	AWS Solution
Expensive hardware & maintenance	Pay-as-you-go pricing (EMR, S3)
Manual scaling (YARN/HDFS)	Auto-scaling EMR clusters
HDFS limitations (durability, scaling)	S3 (11 9’s durability) + Iceberg (ACID tables)
Complex Kafka & Airflow management	AWS MSK (Managed Kafka) & MWAA (Managed Airflow)

Key Benefits of AWS + Iceberg

Cost savings (no upfront hardware, spot instances)
Modern table format (Iceberg for schema evolution, time travel)
Serverless options (Glue, Athena, EMR Serverless)

3. Target AWS Architecture with Iceberg

Current On-Premises Setup

New AWS Architecture (Iceberg + EMR)

Key AWS Services

S3 – Data lake storage (replaces HDFS)
EMR – Managed Spark with Iceberg support
AWS Glue Data Catalog – Metastore for Iceberg tables
MSK – Managed Kafka for streaming
MWAA – Managed Airflow for orchestration

4. Step-by-Step Migration Process

Phase 1: Assessment & Planning

Inventory existing workloads (HDFS paths, Spark SQL, Kafka topics)
Choose Iceberg for table format (supports schema evolution, upserts)
Plan networking (VPC, security groups, IAM roles)

Phase 2: Data Migration (HDFS → S3 + Iceberg)

Option 1: Use distcp to copy data from HDFS to S3

  hadoop distcp hdfs://namenode/path s3a://bucket/path

Option 2: Use Spark to rewrite data as Iceberg

  df = spark.read.parquet("hdfs://path")  
  df.write.format("iceberg").save("s3://bucket/iceberg_table")

Phase 3: Compute Migration (Spark → EMR with Iceberg)

Configure EMR with Iceberg (use bootstrap script):

  #!/bin/bash  
  sudo pip install pyiceberg  
  echo "spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog" >> /etc/spark/conf/spark-defaults.conf  
  echo "spark.sql.catalog.glue_catalog.warehouse=s3://bucket/warehouse" >> /etc/spark/conf/spark-defaults.conf

Phase 4: Streaming Migration (Kafka → MSK)

Mirror topics using Kafka Connect

  {
    "name": "msk-mirror",
    "config": {
      "connector.class": "org.apache.kafka.connect.mirror.MirrorSourceConnector",
      "source.cluster.bootstrap.servers": "on-prem-kafka:9092",
      "target.cluster.bootstrap.servers": "b-1.msk.aws:9092",
      "topics": ".*"
    }
  }

Phase 5: Orchestration Migration (Airflow → MWAA)

Export DAGs and update paths (replace hdfs:// with s3://)
Use AWS Secrets Manager for credentials

Phase 6: Validation & Optimization

Verify data consistency (compare row counts, checksums)
Optimize Iceberg (compact files, partition pruning)

  CALL glue_catalog.system.rewrite_data_files('db.table', strategy='binpack')

5. Code Snippets & Implementation

1. Reading/Writing Iceberg Tables in Spark

# Read from HDFS (old)  
df = spark.read.parquet("hdfs:///data/transactions")  

# Write to Iceberg (new)  
df.write.format("iceberg").mode("overwrite").save("s3://bucket/iceberg_db/transactions")  

# Query with time travel  
spark.read.format("iceberg").option("snapshot-id", "12345").load("s3://bucket/iceberg_db/transactions")

2. Kafka to Iceberg (Structured Streaming)

df = spark.readStream.format("kafka") \
  .option("kafka.bootstrap.servers", "b-1.msk.aws:9092") \
  .option("subscribe", "transactions") \
  .load()  

# Write to Iceberg in Delta Lake format  
df.writeStream.format("iceberg") \
  .outputMode("append") \
  .option("path", "s3://bucket/iceberg_db/streaming") \
  .start()

3. Airflow DAG for Iceberg Maintenance

from airflow import DAG  
from airflow.providers.amazon.aws.operators.emr import EmrAddStepsOperator  

dag = DAG("iceberg_maintenance", schedule_interval="@weekly")  

compact_task = EmrAddStepsOperator(  
    task_id="compact_iceberg",  
    job_flow_id="j-EMRCLUSTER",  
    steps=[{  
        "Name": "Compact Iceberg",  
        "HadoopJarStep": {  
            "Jar": "command-runner.jar",  
            "Args": ["spark-sql", "--executor-memory", "8G",  
                     "-e", "CALL glue_catalog.system.rewrite_data_files('db.transactions')"]  
        }  
    }]  
)

6. Lessons Learned & Best Practices

Key Challenges & Fixes

Issue	Solution
Slow S3 writes	Use EMRFS S3-optimized committer
Hive metastore conflicts	Migrate to Glue Data Catalog
Kafka consumer lag	Increase MSK broker size & optimize partitions

Best Practices

✅ Use EMR 6.8+ for native Iceberg support

✅ Partition Iceberg tables by time for better performance

✅ Enable S3 lifecycle policies to save costs

✅ Monitor MSK lag with CloudWatch

Final Thoughts

Migrating to AWS S3 + Iceberg + EMR modernizes data infrastructure, reduces costs, and improves scalability. By following this guide, enterprises can minimize downtime and maximize performance.

Next Steps

Would you like a deeper dive into Iceberg optimizations or Kafka migration strategies? Let me know in the comments!