Leveling Up: Provisioning EKS with Terraform for My DevOps Project

Following the successful containerization and CI/CD setup for my TV Shows application (as detailed in my previous post, “From CosmosDB to DynamoDB: Migrating and Containerizing a FastAPI + React App”), the next logical step in my “Learn to Cloud” adventure was to embrace a more robust and scalable orchestration platform: Amazon Elastic Kubernetes Service (EKS). To manage this new layer of infrastructure as code (IaC), I turned to Terraform. This post chronicles the experience, the hurdles encountered while setting up EKS with Terraform, and the solutions that paved the way. Why EKS? The Push for Orchestration While running Docker containers on a single EC2 instance with Docker Compose worked for the initial phase, the goal was always to explore true cloud-native practices. Kubernetes (and EKS as its managed AWS offering) promised: Scalability: Easily scale my application pods up or down based on demand. Resilience: Automatic recovery of pods and better fault tolerance. Service Discovery & Load Balancing: Integrated mechanisms for inter-service communication and exposing applications. Declarative Configuration: Defining my desired application state with Kubernetes manifests. Terraform: The Tool for the Job Manually clicking through the AWS console to set up an EKS cluster, VPC, node groups, and all associated IAM roles and security groups is complex and error-prone. With its declarative IaC approach, Terraform was the clear choice to automate this provisioning, ensuring repeatability and version control for my infrastructure. I decided to leverage the official terraform-aws-modules/eks/aws module, which significantly simplifies EKS setup. Challenge 1: Mastering Terraform Remote Backend State Before starting on EKS, best practice dictates setting up a remote backend for Terraform state. This is crucial for collaboration and preventing state loss. I opted for S3 with DynamoDB for locking. The Stumble: I ran into a really weird issue. My initial mistake was including the resource definitions for the S3 bucket and DynamoDB table within the same Terraform configuration that was also defining the EKS cluster and using that S3 bucket as its backend. The Symptom: When I ran terraform apply for my EKS cluster, it seemed to be destroying the very S3 bucket and DynamoDB table that held my Terraform state! Terraform would start creating EKS resources, then attempt to "manage" (and destroy) the S3 bucket it was actively using. This caused the apply to fail spectacularly mid-operation, as Terraform couldn't save its state to a bucket it had just deleted, or release a lock from a DynamoDB table that no longer existed. The errors were clear: NoSuchBucketand ResourceNotFoundException for the lock table. As you can imagine, this was causing a mess, losing my Terraform state, and leading to issues with Terraform locks. The Solution (Initial Part): The key was strict separation. Create a completely separate Terraform configuration (e.g., in a remote-backend subdirectory) whose sole purpose is to define and create the S3 bucket (with versioning and encryption) and the DynamoDB table for state locking. This configuration uses the default local backend. Run terraform apply on this backend-only configuration, once to provision these resources. Then, in my main EKS Terraform configuration, I only included the backend "s3" {} block in providers.tf, pointing to the pre-existing bucket and table. The EKS configuration itself contained no resource blocks for the backend components. A Lingering Ghost (The Full Solution): Even after separating the code, when I ran terraform plan from the eks-cluster directory, I noticed it still had plans to destroy 4 resources, which I knew were my state resources! The issue was that a previous (failed) apply had recorded those backend resources in the EKS configuration's state file. When I ran terraform state list, the 4 resources on display were indeed the resources for the state. Terraform compared my current code (no backend resources defined) with the state file (backend resources were recorded) and concluded that since the code no longer defined them, they should be destroyed according to the state file it was managing. The fix was explicitly removing these resources from the EKS cluster's state file by running terraform state rm for each of the four backend resources. After doing this, terraform plan correctly showed 0 to destroy, acknowledging they are managed elsewhere. Challenge 2: VPC Configuration — To Create or To Use? The EKS module can create a new VPC or use an existing one. Initial Approach: I let the module create a new VPC. This approach is often cleaner and ensures that all EKS tagging requirements are met automatically. The “VpcLimitExceeded” Wall: During one terraform apply, I encountered the VpcLimitExceeded error. My AWS account had reached its default limit of 5 VPCs per region. This was because Terraform created some resources, including VPCs, during my fa

May 8, 2025 - 09:59
 0
Leveling Up: Provisioning EKS with Terraform for My DevOps Project

Following the successful containerization and CI/CD setup for my TV Shows application (as detailed in my previous post, “From CosmosDB to DynamoDB: Migrating and Containerizing a FastAPI + React App”), the next logical step in my “Learn to Cloud” adventure was to embrace a more robust and scalable orchestration platform: Amazon Elastic Kubernetes Service (EKS). To manage this new layer of infrastructure as code (IaC), I turned to Terraform. This post chronicles the experience, the hurdles encountered while setting up EKS with Terraform, and the solutions that paved the way.

Why EKS? The Push for Orchestration

While running Docker containers on a single EC2 instance with Docker Compose worked for the initial phase, the goal was always to explore true cloud-native practices. Kubernetes (and EKS as its managed AWS offering) promised:

  • Scalability: Easily scale my application pods up or down based on demand.
  • Resilience: Automatic recovery of pods and better fault tolerance.
  • Service Discovery & Load Balancing: Integrated mechanisms for inter-service communication and exposing applications.
  • Declarative Configuration: Defining my desired application state with Kubernetes manifests.

Terraform: The Tool for the Job

Manually clicking through the AWS console to set up an EKS cluster, VPC, node groups, and all associated IAM roles and security groups is complex and error-prone. With its declarative IaC approach, Terraform was the clear choice to automate this provisioning, ensuring repeatability and version control for my infrastructure. I decided to leverage the official terraform-aws-modules/eks/aws module, which significantly simplifies EKS setup.

Challenge 1: Mastering Terraform Remote Backend State

Before starting on EKS, best practice dictates setting up a remote backend for Terraform state. This is crucial for collaboration and preventing state loss. I opted for S3 with DynamoDB for locking.

  • The Stumble: I ran into a really weird issue. My initial mistake was including the resource definitions for the S3 bucket and DynamoDB table within the same Terraform configuration that was also defining the EKS cluster and using that S3 bucket as its backend.
  • The Symptom: When I ran terraform apply for my EKS cluster, it seemed to be destroying the very S3 bucket and DynamoDB table that held my Terraform state! Terraform would start creating EKS resources, then attempt to "manage" (and destroy) the S3 bucket it was actively using. This caused the apply to fail spectacularly mid-operation, as Terraform couldn't save its state to a bucket it had just deleted, or release a lock from a DynamoDB table that no longer existed. The errors were clear: NoSuchBucketand ResourceNotFoundException for the lock table. As you can imagine, this was causing a mess, losing my Terraform state, and leading to issues with Terraform locks.

The Solution (Initial Part): The key was strict separation.

  • Create a completely separate Terraform configuration (e.g., in a remote-backend subdirectory) whose sole purpose is to define and create the S3 bucket (with versioning and encryption) and the DynamoDB table for state locking. This configuration uses the default local backend.
  • Run terraform apply on this backend-only configuration, once to provision these resources.
  • Then, in my main EKS Terraform configuration, I only included the backend "s3" {} block in providers.tf, pointing to the pre-existing bucket and table. The EKS configuration itself contained no resource blocks for the backend components.

A Lingering Ghost (The Full Solution): Even after separating the code, when I ran terraform plan from the eks-cluster directory, I noticed it still had plans to destroy 4 resources, which I knew were my state resources! The issue was that a previous (failed) apply had recorded those backend resources in the EKS configuration's state file. When I ran terraform state list, the 4 resources on display were indeed the resources for the state. Terraform compared my current code (no backend resources defined) with the state file (backend resources were recorded) and concluded that since the code no longer defined them, they should be destroyed according to the state file it was managing. The fix was explicitly removing these resources from the EKS cluster's state file by running terraform state rm for each of the four backend resources. After doing this, terraform plan correctly showed 0 to destroy, acknowledging they are managed elsewhere.

Challenge 2: VPC Configuration — To Create or To Use?

The EKS module can create a new VPC or use an existing one.

  • Initial Approach: I let the module create a new VPC. This approach is often cleaner and ensures that all EKS tagging requirements are met automatically.

  • The “VpcLimitExceeded” Wall: During one terraform apply, I encountered the VpcLimitExceeded error. My AWS account had reached its default limit of 5 VPCs per region. This was because Terraform created some resources, including VPCs, during my failed terraform apply.

Solutions & Considerations:

Delete Unused VPCs: The quickest fix was to log into the AWS console and delete old, unused VPCs.

Challenge 3: “Resource Already Exists” — Cleaning Up After Failed Runs

EKS cluster creation is a lengthy process. If an apply fails midway, some resources might be created while others aren't, and Terraform might not have recorded their creation in the state file if it crashed before saving, just like what happened with the VPCs.

  • The Problem: On subsequent terraform apply attempts, I encountered errors like AlreadyExistsException for a KMS Alias (alias/eks/ltc-eks-cluster) and a CloudWatch Log Group (/aws/eks/ltc-eks-cluster/cluster). Terraform was trying to create these, but they already existed from a previous partial run.

The Solution:

Manual Deletion (Use with Caution): For the KMS alias, since I was certain it was a remnant and not in use, I manually deleted it via the AWS console before re-running apply was the option I took. This is riskier if you're unsure of the resource's origin.

Ideally, Terraform modules are idempotent, but sometimes with complex, multi-step creations like EKS, partial failures can leave things in a state that requires manual intervention.

The Sweet Success: A Running EKS Cluster

After navigating these hurdles, terraform apply finally completed, and the outputs.tf provided the command to configure kubectl. Seeing kubectl get nodes return the worker nodes from my new EKS cluster was a very satisfying moment!

Key Learnings from EKS & Terraform:

  • Backend State is Sacred: Treat its setup carefully and keep it separate. Understand how state inconsistencies can lead to unexpected plans.
  • Understand Module Defaults & Requirements: The EKS module is powerful but has expectations (like VPC subnet tagging for load balancers).
  • AWS Service Limits are Real: Be aware of them, especially for resources like VPCs.
  • Failed Applies Need Cleanup/Import: Don’t just keep re-running apply. Investigate "already exists" errors and either import the resource or safely delete it if it's an orphaned remnant.
  • Patience: EKS cluster provisioning takes time.

You can find the repository here.

Next Steps: Deploying the Application

With the EKS cluster provisioned by Terraform and my CI/CD pipeline pushing images to ECR, the next phase is to write the Kubernetes YAML manifests (Deployments, Services, ConfigMaps, Secrets) to deploy my React frontend and FastAPI backend onto the cluster. This will involve configuring Ingress for external access and IRSA for secure AWS service access from my backend pods. The journey to a fully orchestrated application continues!