Predicting Legacy Failures: Training and Hosting ML Models in SageMaker

Introduction Legacy systems are infamous for failing silently—or catastrophically—with no early warning signs. In our eks_cobol pipeline, COBOL batch jobs handle sensitive data transformations. When something goes wrong, we don’t just want to know after it fails—we want to know before it runs. Enter machine learning. This article covers how we use Amazon SageMaker to train a model that predicts COBOL job failures based on input metadata and content characteristics. You’ll see how we take the structured error data from Article 4, create features, train a model using XGBoost, host it with a live endpoint, and wire it into our processing pipeline for real-time inference. The Prediction Problem The goal is to predict whether a COBOL job will fail, before running it, using data available at ingest time. Features include: Filename (which may encode customer, date, region, etc.) File size (bytes) Record count Presence of null fields or format anomalies Job type or business logic variant We label previous failed jobs with isFailure = True and successful jobs with isFailure = False. The model learns correlations between input patterns and known failures. Building the Training Dataset We merge two CSVs: One from failed COBOL jobs (errors_flat.csv) One from successful jobs (success_flat.csv) A preprocessing script ensures both datasets are aligned, normalized, and balanced. import pandas as pd errors = pd.read_csv('errors_flat.csv') success = pd.read_csv('success_flat.csv') df = pd.concat([errors, success], ignore_index=True) df['fileSize'] = df['rawRecord'].apply(lambda x: len(str(x).encode('utf-8'))) df['fileExtension'] = df['inputFile'].apply(lambda x: x.split('.')[-1]) df = pd.get_dummies(df, columns=['errorType', 'fileExtension']) df = df[['fileSize', 'isFailure'] + [col for col in df.columns if col.startswith('errorType_') or col.startswith('fileExtension_')]] df.to_csv('ml_input.csv', index=False) Training the Model in SageMaker We use SageMaker’s built-in XGBoost container for binary classification. The training script is handled via a SageMaker training job or a SageMaker Studio notebook. from sagemaker.inputs import TrainingInput container = sagemaker.image_uris.retrieve("xgboost", session.boto_region_name, "1.3-1") xgb_estimator = sagemaker.estimator.Estimator( image_uri=container, role=role, instance_count=1, instance_type="ml.p3.2xlarge", output_path=f's3://{bucket}/{prefix}/output', sagemaker_session=session ) xgb_estimator.set_hyperparameters( objective="binary:logistic", num_round=100, max_depth=5, eta=0.2, subsample=0.8, colsample_bytree=0.8 ) xgb_estimator.fit({ "train": TrainingInput(train_s3_path, content_type="csv"), "validation": TrainingInput(test_s3_path, content_type="csv") }) This trains a binary classifier that predicts failure probability (0.0 to 1.0) given new job metadata. Hosting the Inference Endpoint Once the model is trained and stored in S3, we deploy it to a real-time SageMaker endpoint: from sagemaker.serializers import CSVSerializer from sagemaker.deserializers import JSONDeserializer predictor = xgb_estimator.deploy(initial_instance_count=1, instance_type="ml.p3.2xlarge") predictor.serializer = CSVSerializer() predictor.deserializer = JSONDeserializer() sample = X_test.head(1).to_csv(header=False, index=False).strip() print("Sample row:", sample) print("Prediction:", predictor.predict(sample)) Now we can send job metadata in real-time and receive a prediction before running the COBOL job. Integrating Inference into the Pipeline Before a COBOL job runs, the ingestion service sends a prediction request to the SageMaker endpoint. If the prediction is above a threshold (say 0.8), we mark the job as "high risk" and route it to a validation or quarantine path. import boto3 import json runtime = boto3.client('sagemaker-runtime') def get_failure_score(fileSize, ext_onehot, error_type_onehot): payload = f"{fileSize}," + ",".join(map(str, ext_onehot + error_type_onehot)) response = runtime.invoke_endpoint( EndpointName='cobol-failure-predictor', ContentType='text/csv', Body=payload ) score = float(response['Body'].read().decode()) return score This gives us predictive observability—no more surprises when a job fails after burning through hours of runtime. Model Monitoring and Retraining We use SageMaker Model Monitor to detect drift in prediction distributions. As more jobs are processed, both successful and failed, we continuously push new records to the training bucket and retrain the model weekly via a scheduled SageMaker pipeline or Lambda-triggered training job. The retraining process includes: Collect new .json logs from S3 Run the same flatten + preprocess script Update the dataset Launch a training job with versioned output Replace the endpoint

Apr 7, 2025 - 14:46

Predicting Legacy Failures: Training and Hosting ML Models in SageMaker

Introduction

Legacy systems are infamous for failing silently—or catastrophically—with no early warning signs. In our eks_cobol pipeline, COBOL batch jobs handle sensitive data transformations. When something goes wrong, we don’t just want to know after it fails—we want to know before it runs. Enter machine learning.

This article covers how we use Amazon SageMaker to train a model that predicts COBOL job failures based on input metadata and content characteristics. You’ll see how we take the structured error data from Article 4, create features, train a model using XGBoost, host it with a live endpoint, and wire it into our processing pipeline for real-time inference.

The Prediction Problem

The goal is to predict whether a COBOL job will fail, before running it, using data available at ingest time. Features include:

Filename (which may encode customer, date, region, etc.)
File size (bytes)
Record count
Presence of null fields or format anomalies
Job type or business logic variant

We label previous failed jobs with isFailure = True and successful jobs with isFailure = False. The model learns correlations between input patterns and known failures.

Building the Training Dataset

We merge two CSVs:

One from failed COBOL jobs (errors_flat.csv)
One from successful jobs (success_flat.csv)

A preprocessing script ensures both datasets are aligned, normalized, and balanced.

import pandas as pd

errors = pd.read_csv('errors_flat.csv')
success = pd.read_csv('success_flat.csv')

df = pd.concat([errors, success], ignore_index=True)
df['fileSize'] = df['rawRecord'].apply(lambda x: len(str(x).encode('utf-8')))
df['fileExtension'] = df['inputFile'].apply(lambda x: x.split('.')[-1])
df = pd.get_dummies(df, columns=['errorType', 'fileExtension'])

df = df[['fileSize', 'isFailure'] + [col for col in df.columns if col.startswith('errorType_') or col.startswith('fileExtension_')]]
df.to_csv('ml_input.csv', index=False)

Training the Model in SageMaker

We use SageMaker’s built-in XGBoost container for binary classification. The training script is handled via a SageMaker training job or a SageMaker Studio notebook.

from sagemaker.inputs import TrainingInput

container = sagemaker.image_uris.retrieve("xgboost", session.boto_region_name, "1.3-1")

xgb_estimator = sagemaker.estimator.Estimator(
    image_uri=container,
    role=role,
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    output_path=f's3://{bucket}/{prefix}/output',
    sagemaker_session=session
)

xgb_estimator.set_hyperparameters(
    objective="binary:logistic",
    num_round=100,
    max_depth=5,
    eta=0.2,
    subsample=0.8,
    colsample_bytree=0.8
)

xgb_estimator.fit({
    "train": TrainingInput(train_s3_path, content_type="csv"),
    "validation": TrainingInput(test_s3_path, content_type="csv")
})

This trains a binary classifier that predicts failure probability (0.0 to 1.0) given new job metadata.

Hosting the Inference Endpoint

Once the model is trained and stored in S3, we deploy it to a real-time SageMaker endpoint:

from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer

predictor = xgb_estimator.deploy(initial_instance_count=1, instance_type="ml.p3.2xlarge")
predictor.serializer = CSVSerializer()
predictor.deserializer = JSONDeserializer()

sample = X_test.head(1).to_csv(header=False, index=False).strip()
print("Sample row:", sample)
print("Prediction:", predictor.predict(sample))

Now we can send job metadata in real-time and receive a prediction before running the COBOL job.

Integrating Inference into the Pipeline

Before a COBOL job runs, the ingestion service sends a prediction request to the SageMaker endpoint. If the prediction is above a threshold (say 0.8), we mark the job as "high risk" and route it to a validation or quarantine path.

import boto3
import json

runtime = boto3.client('sagemaker-runtime')

def get_failure_score(fileSize, ext_onehot, error_type_onehot):
    payload = f"{fileSize}," + ",".join(map(str, ext_onehot + error_type_onehot))
    response = runtime.invoke_endpoint(
        EndpointName='cobol-failure-predictor',
        ContentType='text/csv',
        Body=payload
    )
    score = float(response['Body'].read().decode())
    return score

This gives us predictive observability—no more surprises when a job fails after burning through hours of runtime.

Model Monitoring and Retraining

We use SageMaker Model Monitor to detect drift in prediction distributions. As more jobs are processed, both successful and failed, we continuously push new records to the training bucket and retrain the model weekly via a scheduled SageMaker pipeline or Lambda-triggered training job.

The retraining process includes:

Collect new .json logs from S3
Run the same flatten + preprocess script
Update the dataset
Launch a training job with versioned output
Replace the endpoint via blue/green deployment

Conclusion

Machine learning isn’t just for flashy new systems—it can massively improve how legacy pipelines operate. By training and hosting a binary classifier in SageMaker, we’ve added a predictive safety net to our COBOL workflows. With every job that fails or succeeds, the model gets smarter, reducing wasted compute and catching bad inputs early.

This is the kind of hybrid future that actually works: COBOL + Kubernetes + JSON + SageMaker, working in concert. And it all starts with clean training data and good feature engineering.