AWS DMS Fails After Azure SQL Managed Instance Failover? Here’s How to Fix It

Problem Statement A customer is using AWS Database Migration Service (DMS) to migrate data from Azure SQL Managed Instance to AWS RDS PostgreSQL over a VPN in Full Load + Change Data Capture (CDC) mode. The migration runs smoothly until Azure performs maintenance, triggering a failover. When this happens, DMS fails to locate the last tracked LSN (Log Sequence Number), causing a fatal error that stops the task. Currently, the only workaround is to manually restart the DMS task and perform a full load again, which is inefficient and time-consuming. Understanding the Issue When an Azure SQL Managed Instance failover occurs: The old primary is demoted, and a new primary takes over. The transaction logs on the new primary may not have the same LSN sequence as the old primary. AWS DMS attempts to resume CDC from the last tracked LSN, but if that LSN doesn’t exist on the new primary, DMS throws an error and stops the task. This issue arises because SQL Server Availability Groups do not synchronize backup history across replicas. Potential Solutions & Trade-offs Since this scenario is easy to replicate, testing these solutions in a non-production environment first is recommended. 1️⃣ Temporary Fix – Manual Action Required on Every Failover Find the latest LSN on the new primary: SELECT sys.fn_cdc_get_max_lsn() AS CurrentMaxLSN; Restart DMS using the new LSN via AWS CLI or console. Trade-offs: ✅ Simple and requires no architectural changes. ❌ Requires manual intervention every time a failover occurs. ❌ Increases downtime and operational overhead. 2️⃣ Permanent Fix – Adjust DMS Settings (May Not Work for Azure SQL MI) Set: "AlwaysOnSharedSynchedBackupIsEnabled": false This allows DMS to poll all nodes in the Always On cluster for transaction backups. Works for on-prem SQL Server, but may prevent reading from the old primary in Azure SQL MI. Trade-offs: ✅ Fully automated once configured. ✅ No need for manual intervention. ❌ May not be supported in Azure SQL MI, requiring additional testing. ❌ Potential data consistency risks if backups are not fully synchronized. 3️⃣ Permanent Fix – Use Transaction Log Backups Instead of Live Logs Azure SQL Managed Instance creates transaction log backups every ~10 minutes. These backups are consistent across failovers, avoiding LSN loss. DMS v3.5.3+ supports reading from log backups for Amazon RDS for SQL Server. Grant necessary permissions for DMS to read log backups: GRANT EXEC ON msdb.dbo.rds_dms_tlog_download TO rds_user; GRANT EXEC ON msdb.dbo.rds_dms_tlog_read TO rds_user; GRANT EXEC ON msdb.dbo.rds_dms_tlog_list_current_lsn TO rds_user; GRANT EXEC ON msdb.dbo.rds_task_status TO rds_user; More details: AWS DMS SQL Server Permissions.

Feb 13, 2025 - 18:16

AWS DMS Fails After Azure SQL Managed Instance Failover? Here’s How to Fix It

Problem Statement

A customer is using AWS Database Migration Service (DMS) to migrate data from Azure SQL Managed Instance to AWS RDS PostgreSQL over a VPN in Full Load + Change Data Capture (CDC) mode. The migration runs smoothly until Azure performs maintenance, triggering a failover. When this happens, DMS fails to locate the last tracked LSN (Log Sequence Number), causing a fatal error that stops the task.

Currently, the only workaround is to manually restart the DMS task and perform a full load again, which is inefficient and time-consuming.

Understanding the Issue

When an Azure SQL Managed Instance failover occurs:

The old primary is demoted, and a new primary takes over.
The transaction logs on the new primary may not have the same LSN sequence as the old primary.
AWS DMS attempts to resume CDC from the last tracked LSN, but if that LSN doesn’t exist on the new primary, DMS throws an error and stops the task.
This issue arises because SQL Server Availability Groups do not synchronize backup history across replicas.

Potential Solutions & Trade-offs

Since this scenario is easy to replicate, testing these solutions in a non-production environment first is recommended.

1️⃣ Temporary Fix – Manual Action Required on Every Failover

Find the latest LSN on the new primary:

  SELECT sys.fn_cdc_get_max_lsn() AS CurrentMaxLSN;

Restart DMS using the new LSN via AWS CLI or console.

Trade-offs:
✅ Simple and requires no architectural changes.

❌ Requires manual intervention every time a failover occurs.

❌ Increases downtime and operational overhead.

2️⃣ Permanent Fix – Adjust DMS Settings (May Not Work for Azure SQL MI)

Set:

  "AlwaysOnSharedSynchedBackupIsEnabled": false

This allows DMS to poll all nodes in the Always On cluster for transaction backups.
Works for on-prem SQL Server, but may prevent reading from the old primary in Azure SQL MI.

Trade-offs:
✅ Fully automated once configured.

✅ No need for manual intervention.

❌ May not be supported in Azure SQL MI, requiring additional testing.

❌ Potential data consistency risks if backups are not fully synchronized.

3️⃣ Permanent Fix – Use Transaction Log Backups Instead of Live Logs

Azure SQL Managed Instance creates transaction log backups every ~10 minutes.
These backups are consistent across failovers, avoiding LSN loss.
DMS v3.5.3+ supports reading from log backups for Amazon RDS for SQL Server.
Grant necessary permissions for DMS to read log backups:

  GRANT EXEC ON msdb.dbo.rds_dms_tlog_download TO rds_user;
  GRANT EXEC ON msdb.dbo.rds_dms_tlog_read TO rds_user;
  GRANT EXEC ON msdb.dbo.rds_dms_tlog_list_current_lsn TO rds_user;
  GRANT EXEC ON msdb.dbo.rds_task_status TO rds_user;

More details: AWS DMS SQL Server Permissions.