Replacing Lambda Triggers with EventBridge in S3-to-Glue Workflows

In one of our production data platforms, we used Lambda functions to trigger AWS Glue jobs every time a file landed in an S3 folder. That setup worked fine when there were only two or three data sources. But as the system expanded to support more than 10 folders, it required deploying and maintaining an equal number of nearly identical Lambda functions, each wired to specific prefixes and jobs. The architecture became increasingly brittle and harder to manage. This post outlines how that structure was replaced using EventBridge, with prefix-based filtering and direct Glue job targets. No Lambda. No maintenance overhead. Scaling Limits and Operational Gaps Using S3 events to trigger Lambda comes with several limitations: A single Lambda function can’t be mapped to multiple S3 prefixes Each one requires separate deployment and IAM permissions Failures due to cold starts, dependency packaging, or misconfiguration often go undetected Event tracing is difficult without additional logging As more sources were added, silent failures became a recurring issue. In some cases, downstream data loads were missed completely. Objectives for a Scalable Triggering System A more resilient and maintainable system is needed to: Support multiple S3 prefixes cleanly Trigger different Glue jobs based on the prefix Include native retry behavior Offer better traceability and alerting EventBridge addressed these requirements directly. Event-Driven Architecture Overview The updated solution routes S3 events through EventBridge: S3 emits object-created events to EventBridge EventBridge rules filter events by prefix (e.g., applications/batch_load/) Each rule targets a specific Glue job Retries are handled natively Failures route to an SQS dead-letter queue (DLQ) An SNS topic forwards DLQ alerts to Slack for visibility This approach eliminated the need for intermediary Lambda functions. Step-by-Step Implementation Enable EventBridge Notifications on S3 Open the S3 bucket Go to Properties → Event notifications Enable "Send events to Amazon EventBridge" No further configuration is needed on the S3 side. Create EventBridge Rules by Prefix To map the applications/batch_load/ prefix to a job called batch_job_glue: Go to EventBridge → Rules → Create rule Name the rule: trigger-batch-glue-job Use the following event pattern: { "source": ["aws.s3"], "detail": { "bucket": {"name": ["my-etl-data-bucket"] }, "object": { "key": [{"prefix": "applications/batch_load/"}] } } } Set the target as the corresponding Glue job Under Input transformer, configure: Input Path: { "detail": { "bucket": {"name": "detail.bucket.name" }, "object": {"key": "detail.object.key" } } } Input Template: { "BUCKET_NAME": "detail.bucket.name", "OBJECT_KEY": "detail.object.key" } Repeat this setup for other prefixes and Glue jobs (e.g., applications/batch_job/, applications/daily_runs/) Receiving S3 Context in Glue Jobs Glue jobs only need minimal logic to accept the input parameters: import sys from awsglue.utils import getResolvedOptions args = getResolvedOptions(sys.argv, ['BUCKET_NAME', 'OBJECT_KEY']) bucket = args['BUCKET_NAME'] key = args['OBJECT_KEY'] print(f" Triggered for file: {bucket} and {key}") continuing further logic Example: Failure Detection with DLQ and Alerting In one case, a Glue job failed because the incoming data had a schema change that wasn't compatible with the job's Spark read logic. The job attempted to deserialize a new column that didn't exist in the table definition, which triggered a runtime AnalysisException. Since this was a data issue, retries didn’t help. EventBridge retried the event twice, both failed, and the event was then routed to the SQS DLQ. From there, SNS sent an alert email. This kind of alert-driven feedback loop has helped us catch data contract violations early something we couldn’t reliably do with Lambda triggers. Why Lambda Was a Misfit for This Use Case Lambda wasn’t a good fit here for multiple reasons. First, Glue jobs are often long-running, and Lambda has a 15-minute maximum timeout. If the job takes longer, you’re forced to use asynchronous invocation, which means Lambda fires the job and exits without waiting for it to complete. That leaves you with no visibility into whether the job failed halfway or succeeded, unless you bolt on custom polling, logging, and error handling. EventBridge, on the other hand, ensures reliable triggering with built-in retry and optional dead-letter queues. But it's important to note that this setup doesn't solve job-level failures either. If the Glue job fails midway due to bad data or resource issues, EventBridge considers its job done as long as the job was triggered successfully. To monitor job complet

May 6, 2025 - 06:03
 0
Replacing Lambda Triggers with EventBridge in S3-to-Glue Workflows

In one of our production data platforms, we used Lambda functions to trigger AWS Glue jobs every time a file landed in an S3 folder. That setup worked fine when there were only two or three data sources.

But as the system expanded to support more than 10 folders, it required deploying and maintaining an equal number of nearly identical Lambda functions, each wired to specific prefixes and jobs. The architecture became increasingly brittle and harder to manage.

This post outlines how that structure was replaced using EventBridge, with prefix-based filtering and direct Glue job targets. No Lambda. No maintenance overhead.

Scaling Limits and Operational Gaps

Using S3 events to trigger Lambda comes with several limitations:

A single Lambda function can’t be mapped to multiple S3 prefixes

Each one requires separate deployment and IAM permissions

Failures due to cold starts, dependency packaging, or misconfiguration often go undetected

Event tracing is difficult without additional logging

As more sources were added, silent failures became a recurring issue. In some cases, downstream data loads were missed completely.

Objectives for a Scalable Triggering System

A more resilient and maintainable system is needed to:

Support multiple S3 prefixes cleanly

Trigger different Glue jobs based on the prefix

Include native retry behavior

Offer better traceability and alerting

EventBridge addressed these requirements directly.

Event-Driven Architecture Overview

The updated solution routes S3 events through EventBridge:

S3 emits object-created events to EventBridge

EventBridge rules filter events by prefix (e.g., applications/batch_load/)

Each rule targets a specific Glue job

Retries are handled natively

Failures route to an SQS dead-letter queue (DLQ)

An SNS topic forwards DLQ alerts to Slack for visibility

This approach eliminated the need for intermediary Lambda functions.

Step-by-Step Implementation

  1. Enable EventBridge Notifications on S3

Open the S3 bucket

Go to Properties → Event notifications

Enable "Send events to Amazon EventBridge"

No further configuration is needed on the S3 side.

  1. Create EventBridge Rules by Prefix

To map the applications/batch_load/ prefix to a job called batch_job_glue:

Go to EventBridge → Rules → Create rule

Name the rule: trigger-batch-glue-job

Use the following event pattern:


{ 
  "source": ["aws.s3"], 
  "detail": { 
    "bucket": {"name": ["my-etl-data-bucket"] }, 
    "object": { 
      "key": [{"prefix": "applications/batch_load/"}] 
    } 
  }
} 

Set the target as the corresponding Glue job

Under Input transformer, configure:

Input Path:

{ 
  "detail": { 
    "bucket": {"name": "detail.bucket.name" }, 
    "object": {"key": "detail.object.key" } 
  } 
} 

Input Template:

{ 
  "BUCKET_NAME": "detail.bucket.name", 
  "OBJECT_KEY": "detail.object.key" 
} 

Repeat this setup for other prefixes and Glue jobs (e.g., applications/batch_job/, applications/daily_runs/)

Receiving S3 Context in Glue Jobs

Glue jobs only need minimal logic to accept the input parameters:

import sys 
from awsglue.utils import getResolvedOptions 

args = getResolvedOptions(sys.argv, ['BUCKET_NAME', 'OBJECT_KEY']) 

bucket = args['BUCKET_NAME'] 
key = args['OBJECT_KEY'] 

print(f" Triggered for file: {bucket} and {key}") 

continuing further logic

Example: Failure Detection with DLQ and Alerting

In one case, a Glue job failed because the incoming data had a schema change that wasn't compatible with the job's Spark read logic. The job attempted to deserialize a new column that didn't exist in the table definition, which triggered a runtime AnalysisException. Since this was a data issue, retries didn’t help. EventBridge retried the event twice, both failed, and the event was then routed to the SQS DLQ. From there, SNS sent an alert email.

This kind of alert-driven feedback loop has helped us catch data contract violations early something we couldn’t reliably do with Lambda triggers.

Why Lambda Was a Misfit for This Use Case

Lambda wasn’t a good fit here for multiple reasons. First, Glue jobs are often long-running, and Lambda has a 15-minute maximum timeout. If the job takes longer, you’re forced to use asynchronous invocation, which means Lambda fires the job and exits without waiting for it to complete. That leaves you with no visibility into whether the job failed halfway or succeeded, unless you bolt on custom polling, logging, and error handling.

EventBridge, on the other hand, ensures reliable triggering with built-in retry and optional dead-letter queues. But it's important to note that this setup doesn't solve job-level failures either. If the Glue job fails midway due to bad data or resource issues, EventBridge considers its job done as long as the job was triggered successfully.

To monitor job completion and handle runtime failures, teams still need to subscribe to Glue Job State Change events, wire up CloudWatch alerts, or implement a status feedback loop. But the triggering side becomes fully managed and cost-effective with no code, no maintenance, just clean event routing.

Replacing a set of Lambda functions with a few well-scoped EventBridge rules significantly reduced operational complexity. The architecture is leaner, traceable, and cheaper to run. For teams managing multi-source S3 ETL pipelines, this approach simplifies the start of the workflow, even though runtime monitoring still requires its layer.