When COBOL Fails: Real-Time Error Management with S3 and JSON
Introduction In any system that processes large volumes of data, failure is inevitable. And when you're running legacy COBOL applications as part of a modern pipeline, error handling becomes even more critical. COBOL programs weren’t built to emit structured error logs or integrate with cloud-native monitoring tools. So when something breaks—bad input, missing fields, logic bugs—you need a system that captures, logs, and routes those failures in a way that’s actionable. In this article, we’ll dive deep into the error handling strategy behind our eks_cobol project. You’ll learn how we built a fault-tolerant process that logs COBOL errors to Amazon S3 in structured JSON format, enabling real-time observability and downstream integration with ML services for predictive insights. This is a story of wrapping an ancient workhorse in battle armor and plugging it into the cloud. Why Error Handling Needs to Be Rethought for Legacy Code COBOL’s default behavior when encountering errors is to crash, print something vaguely helpful to STDOUT, or continue silently failing in ways that break downstream logic. That might have flown in the 1980s, but today’s systems demand traceability, alerts, and remediation. The goal isn’t just to capture when a COBOL job fails, but why it failed, and then to package that information for: Debugging by engineers Reruns with fixed data Machine learning analysis Visual dashboards or ticket automation Error Capture Strategy: Let COBOL Do Its Thing, Then Intercept We don’t modify the COBOL program much—instead, we catch its output and behavior externally. Here's how it works: The COBOL job runs inside a Kubernetes pod using a shell wrapper script. STDOUT is redirected to an output JSON file, and STDERR is redirected to an .error.json file. Exit codes and log contents are evaluated at the end of the job to determine success or failure. If a failure is detected, the .error.json file is uploaded to a dedicated S3 bucket using the AWS CLI or SDK. The shell script looks like this: #!/bin/bash set -e cobc -x -free TransformCSV.cbl -o TransformCSV if ./TransformCSV > /mnt/data/output/output.json 2> /mnt/data/output/error.log; then echo "Job completed successfully." else echo "Job failed. Parsing errors..." python3 parse_error.py /mnt/data/output/error.log /mnt/data/output/error.json aws s3 cp /mnt/data/output/error.json s3://my-cobol-errors-bucket/errors/ exit 1 fi This keeps the actual COBOL code clean and lets the outer logic handle the complexity of error interpretation and routing. Structured Error Files: JSON as a Contract One of the key improvements we made was transforming COBOL errors into structured JSON. This creates a consistent contract for downstream consumers. Here’s an example of what one of those error files looks like: { "jobId": "e8f3d9d4-1c9b-4c7b-b9e2-f2345a3a9c92", "timestamp": "2025-04-03T19:32:10Z", "status": "failed", "errorType": "DataFormatError", "message": "Invalid date format in field 6", "inputFile": "customers_202504.csv", "line": 42, "rawRecord": "A123,John,Doe,04/35/2024,ACTIVE" } These JSON error logs are easier to search, visualize, and feed into automated systems than raw console logs. We’ve built tooling to parse STDERR into this format using a small Python script (parse_error.py) with regex patterns customized for our COBOL compiler output. S3 as a Scalable, Searchable Error Store S3 gives us durability, versioning, and cost-effective long-term storage. Error logs are pushed into paths that follow this structure: s3://my-cobol-errors-bucket/errors/yyyy/mm/dd/job-id-error.json With S3 events and Lambda, we can even trigger workflows when new errors are detected—such as: Notifying a Slack channel Creating a Jira ticket Invoking a SageMaker pipeline to retrain our error prediction model We also periodically batch-query this data using Amazon Athena or AWS Glue to produce metrics like: Top 10 most common errors Failures by input file type Average failure rate by job type Connecting Error Data to Amazon SageMaker The structured error logs we store in S3 serve double duty. Beyond observability, we use them as labeled training data for a SageMaker model that predicts whether a COBOL job is likely to fail, based on characteristics of the input file (filename patterns, content, size, date range, etc.). When a new file hits the ingestion queue, it’s first evaluated by this model. If the model flags it as high risk, the system can: Route it to a special validation lane Run a “dry run” job with stricter logging Alert the data owner for review This proactive capability didn’t exist in the mainframe era—but it’s made possible by converting COBOL’s black box behavior into structured, analyzable events. Observability and Tracing Once errors are stored in JSON and surfaced through S3, we have the ability to plug them i

Introduction
In any system that processes large volumes of data, failure is inevitable. And when you're running legacy COBOL applications as part of a modern pipeline, error handling becomes even more critical. COBOL programs weren’t built to emit structured error logs or integrate with cloud-native monitoring tools. So when something breaks—bad input, missing fields, logic bugs—you need a system that captures, logs, and routes those failures in a way that’s actionable.
In this article, we’ll dive deep into the error handling strategy behind our eks_cobol
project. You’ll learn how we built a fault-tolerant process that logs COBOL errors to Amazon S3 in structured JSON format, enabling real-time observability and downstream integration with ML services for predictive insights. This is a story of wrapping an ancient workhorse in battle armor and plugging it into the cloud.
Why Error Handling Needs to Be Rethought for Legacy Code
COBOL’s default behavior when encountering errors is to crash, print something vaguely helpful to STDOUT, or continue silently failing in ways that break downstream logic. That might have flown in the 1980s, but today’s systems demand traceability, alerts, and remediation.
The goal isn’t just to capture when a COBOL job fails, but why it failed, and then to package that information for:
- Debugging by engineers
- Reruns with fixed data
- Machine learning analysis
- Visual dashboards or ticket automation
Error Capture Strategy: Let COBOL Do Its Thing, Then Intercept
We don’t modify the COBOL program much—instead, we catch its output and behavior externally. Here's how it works:
- The COBOL job runs inside a Kubernetes pod using a shell wrapper script.
- STDOUT is redirected to an output JSON file, and STDERR is redirected to an
.error.json
file. - Exit codes and log contents are evaluated at the end of the job to determine success or failure.
- If a failure is detected, the
.error.json
file is uploaded to a dedicated S3 bucket using the AWS CLI or SDK.
The shell script looks like this:
#!/bin/bash
set -e
cobc -x -free TransformCSV.cbl -o TransformCSV
if ./TransformCSV > /mnt/data/output/output.json 2> /mnt/data/output/error.log; then
echo "Job completed successfully."
else
echo "Job failed. Parsing errors..."
python3 parse_error.py /mnt/data/output/error.log /mnt/data/output/error.json
aws s3 cp /mnt/data/output/error.json s3://my-cobol-errors-bucket/errors/
exit 1
fi
This keeps the actual COBOL code clean and lets the outer logic handle the complexity of error interpretation and routing.
Structured Error Files: JSON as a Contract
One of the key improvements we made was transforming COBOL errors into structured JSON. This creates a consistent contract for downstream consumers. Here’s an example of what one of those error files looks like:
{
"jobId": "e8f3d9d4-1c9b-4c7b-b9e2-f2345a3a9c92",
"timestamp": "2025-04-03T19:32:10Z",
"status": "failed",
"errorType": "DataFormatError",
"message": "Invalid date format in field 6",
"inputFile": "customers_202504.csv",
"line": 42,
"rawRecord": "A123,John,Doe,04/35/2024,ACTIVE"
}
These JSON error logs are easier to search, visualize, and feed into automated systems than raw console logs. We’ve built tooling to parse STDERR into this format using a small Python script (parse_error.py
) with regex patterns customized for our COBOL compiler output.
S3 as a Scalable, Searchable Error Store
S3 gives us durability, versioning, and cost-effective long-term storage. Error logs are pushed into paths that follow this structure:
s3://my-cobol-errors-bucket/errors/yyyy/mm/dd/job-id-error.json
With S3 events and Lambda, we can even trigger workflows when new errors are detected—such as:
- Notifying a Slack channel
- Creating a Jira ticket
- Invoking a SageMaker pipeline to retrain our error prediction model
We also periodically batch-query this data using Amazon Athena or AWS Glue to produce metrics like:
- Top 10 most common errors
- Failures by input file type
- Average failure rate by job type
Connecting Error Data to Amazon SageMaker
The structured error logs we store in S3 serve double duty. Beyond observability, we use them as labeled training data for a SageMaker model that predicts whether a COBOL job is likely to fail, based on characteristics of the input file (filename patterns, content, size, date range, etc.).
When a new file hits the ingestion queue, it’s first evaluated by this model. If the model flags it as high risk, the system can:
- Route it to a special validation lane
- Run a “dry run” job with stricter logging
- Alert the data owner for review
This proactive capability didn’t exist in the mainframe era—but it’s made possible by converting COBOL’s black box behavior into structured, analyzable events.
Observability and Tracing
Once errors are stored in JSON and surfaced through S3, we have the ability to plug them into:
- CloudWatch Metrics: Tracking success/failure over time
- QuickSight Dashboards: Showing trends per job type or region
- Prometheus/Grafana: For real-time job status visualization
- OpenTelemetry: For tracing execution from data ingestion to COBOL run to S3 upload
Each error record contains a jobId
that links all logs, inputs, outputs, and metrics together. This gives engineers full end-to-end traceability for debugging or audits.
Conclusion
Legacy COBOL systems don't have to be black boxes. By wrapping them in smart containers, capturing their errors in structured JSON, and offloading that data to S3, we've created a system that is observable, maintainable, and even trainable.
This error handling architecture is key to unlocking modernization. It helps developers respond faster, empowers data teams to improve quality, and provides ML teams with real-world data to build predictive systems. Even when COBOL fails, it’s now part of a smarter system that learns and improves over time.