Predicting Legacy Failures: Training and Hosting ML Models in SageMaker
Introduction Legacy systems are infamous for failing silently—or catastrophically—with no early warning signs. In our eks_cobol pipeline, COBOL batch jobs handle sensitive data transformations. When something goes wrong, we don’t just want to know after it fails—we want to know before it runs. Enter machine learning. This article covers how we use Amazon SageMaker to train a model that predicts COBOL job failures based on input metadata and content characteristics. You’ll see how we take the structured error data from Article 4, create features, train a model using XGBoost, host it with a live endpoint, and wire it into our processing pipeline for real-time inference. The Prediction Problem The goal is to predict whether a COBOL job will fail, before running it, using data available at ingest time. Features include: Filename (which may encode customer, date, region, etc.) File size (bytes) Record count Presence of null fields or format anomalies Job type or business logic variant We label previous failed jobs with isFailure = True and successful jobs with isFailure = False. The model learns correlations between input patterns and known failures. Building the Training Dataset We merge two CSVs: One from failed COBOL jobs (errors_flat.csv) One from successful jobs (success_flat.csv) A preprocessing script ensures both datasets are aligned, normalized, and balanced. import pandas as pd errors = pd.read_csv('errors_flat.csv') success = pd.read_csv('success_flat.csv') df = pd.concat([errors, success], ignore_index=True) df['fileSize'] = df['rawRecord'].apply(lambda x: len(str(x).encode('utf-8'))) df['fileExtension'] = df['inputFile'].apply(lambda x: x.split('.')[-1]) df = pd.get_dummies(df, columns=['errorType', 'fileExtension']) df = df[['fileSize', 'isFailure'] + [col for col in df.columns if col.startswith('errorType_') or col.startswith('fileExtension_')]] df.to_csv('ml_input.csv', index=False) Training the Model in SageMaker We use SageMaker’s built-in XGBoost container for binary classification. The training script is handled via a SageMaker training job or a SageMaker Studio notebook. from sagemaker.inputs import TrainingInput container = sagemaker.image_uris.retrieve("xgboost", session.boto_region_name, "1.3-1") xgb_estimator = sagemaker.estimator.Estimator( image_uri=container, role=role, instance_count=1, instance_type="ml.p3.2xlarge", output_path=f's3://{bucket}/{prefix}/output', sagemaker_session=session ) xgb_estimator.set_hyperparameters( objective="binary:logistic", num_round=100, max_depth=5, eta=0.2, subsample=0.8, colsample_bytree=0.8 ) xgb_estimator.fit({ "train": TrainingInput(train_s3_path, content_type="csv"), "validation": TrainingInput(test_s3_path, content_type="csv") }) This trains a binary classifier that predicts failure probability (0.0 to 1.0) given new job metadata. Hosting the Inference Endpoint Once the model is trained and stored in S3, we deploy it to a real-time SageMaker endpoint: from sagemaker.serializers import CSVSerializer from sagemaker.deserializers import JSONDeserializer predictor = xgb_estimator.deploy(initial_instance_count=1, instance_type="ml.p3.2xlarge") predictor.serializer = CSVSerializer() predictor.deserializer = JSONDeserializer() sample = X_test.head(1).to_csv(header=False, index=False).strip() print("Sample row:", sample) print("Prediction:", predictor.predict(sample)) Now we can send job metadata in real-time and receive a prediction before running the COBOL job. Integrating Inference into the Pipeline Before a COBOL job runs, the ingestion service sends a prediction request to the SageMaker endpoint. If the prediction is above a threshold (say 0.8), we mark the job as "high risk" and route it to a validation or quarantine path. import boto3 import json runtime = boto3.client('sagemaker-runtime') def get_failure_score(fileSize, ext_onehot, error_type_onehot): payload = f"{fileSize}," + ",".join(map(str, ext_onehot + error_type_onehot)) response = runtime.invoke_endpoint( EndpointName='cobol-failure-predictor', ContentType='text/csv', Body=payload ) score = float(response['Body'].read().decode()) return score This gives us predictive observability—no more surprises when a job fails after burning through hours of runtime. Model Monitoring and Retraining We use SageMaker Model Monitor to detect drift in prediction distributions. As more jobs are processed, both successful and failed, we continuously push new records to the training bucket and retrain the model weekly via a scheduled SageMaker pipeline or Lambda-triggered training job. The retraining process includes: Collect new .json logs from S3 Run the same flatten + preprocess script Update the dataset Launch a training job with versioned output Replace the endpoint

Introduction
Legacy systems are infamous for failing silently—or catastrophically—with no early warning signs. In our eks_cobol
pipeline, COBOL batch jobs handle sensitive data transformations. When something goes wrong, we don’t just want to know after it fails—we want to know before it runs. Enter machine learning.
This article covers how we use Amazon SageMaker to train a model that predicts COBOL job failures based on input metadata and content characteristics. You’ll see how we take the structured error data from Article 4, create features, train a model using XGBoost, host it with a live endpoint, and wire it into our processing pipeline for real-time inference.
The Prediction Problem
The goal is to predict whether a COBOL job will fail, before running it, using data available at ingest time. Features include:
- Filename (which may encode customer, date, region, etc.)
- File size (bytes)
- Record count
- Presence of null fields or format anomalies
- Job type or business logic variant
We label previous failed jobs with isFailure = True
and successful jobs with isFailure = False
. The model learns correlations between input patterns and known failures.
Building the Training Dataset
We merge two CSVs:
- One from failed COBOL jobs (
errors_flat.csv
) - One from successful jobs (
success_flat.csv
)
A preprocessing script ensures both datasets are aligned, normalized, and balanced.
import pandas as pd
errors = pd.read_csv('errors_flat.csv')
success = pd.read_csv('success_flat.csv')
df = pd.concat([errors, success], ignore_index=True)
df['fileSize'] = df['rawRecord'].apply(lambda x: len(str(x).encode('utf-8')))
df['fileExtension'] = df['inputFile'].apply(lambda x: x.split('.')[-1])
df = pd.get_dummies(df, columns=['errorType', 'fileExtension'])
df = df[['fileSize', 'isFailure'] + [col for col in df.columns if col.startswith('errorType_') or col.startswith('fileExtension_')]]
df.to_csv('ml_input.csv', index=False)
Training the Model in SageMaker
We use SageMaker’s built-in XGBoost container for binary classification. The training script is handled via a SageMaker training job or a SageMaker Studio notebook.
from sagemaker.inputs import TrainingInput
container = sagemaker.image_uris.retrieve("xgboost", session.boto_region_name, "1.3-1")
xgb_estimator = sagemaker.estimator.Estimator(
image_uri=container,
role=role,
instance_count=1,
instance_type="ml.p3.2xlarge",
output_path=f's3://{bucket}/{prefix}/output',
sagemaker_session=session
)
xgb_estimator.set_hyperparameters(
objective="binary:logistic",
num_round=100,
max_depth=5,
eta=0.2,
subsample=0.8,
colsample_bytree=0.8
)
xgb_estimator.fit({
"train": TrainingInput(train_s3_path, content_type="csv"),
"validation": TrainingInput(test_s3_path, content_type="csv")
})
This trains a binary classifier that predicts failure probability (0.0 to 1.0) given new job metadata.
Hosting the Inference Endpoint
Once the model is trained and stored in S3, we deploy it to a real-time SageMaker endpoint:
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer
predictor = xgb_estimator.deploy(initial_instance_count=1, instance_type="ml.p3.2xlarge")
predictor.serializer = CSVSerializer()
predictor.deserializer = JSONDeserializer()
sample = X_test.head(1).to_csv(header=False, index=False).strip()
print("Sample row:", sample)
print("Prediction:", predictor.predict(sample))
Now we can send job metadata in real-time and receive a prediction before running the COBOL job.
Integrating Inference into the Pipeline
Before a COBOL job runs, the ingestion service sends a prediction request to the SageMaker endpoint. If the prediction is above a threshold (say 0.8), we mark the job as "high risk" and route it to a validation or quarantine path.
import boto3
import json
runtime = boto3.client('sagemaker-runtime')
def get_failure_score(fileSize, ext_onehot, error_type_onehot):
payload = f"{fileSize}," + ",".join(map(str, ext_onehot + error_type_onehot))
response = runtime.invoke_endpoint(
EndpointName='cobol-failure-predictor',
ContentType='text/csv',
Body=payload
)
score = float(response['Body'].read().decode())
return score
This gives us predictive observability—no more surprises when a job fails after burning through hours of runtime.
Model Monitoring and Retraining
We use SageMaker Model Monitor to detect drift in prediction distributions. As more jobs are processed, both successful and failed, we continuously push new records to the training bucket and retrain the model weekly via a scheduled SageMaker pipeline or Lambda-triggered training job.
The retraining process includes:
- Collect new
.json
logs from S3 - Run the same flatten + preprocess script
- Update the dataset
- Launch a training job with versioned output
- Replace the endpoint via blue/green deployment
Conclusion
Machine learning isn’t just for flashy new systems—it can massively improve how legacy pipelines operate. By training and hosting a binary classifier in SageMaker, we’ve added a predictive safety net to our COBOL workflows. With every job that fails or succeeds, the model gets smarter, reducing wasted compute and catching bad inputs early.
This is the kind of hybrid future that actually works: COBOL + Kubernetes + JSON + SageMaker, working in concert. And it all starts with clean training data and good feature engineering.