High Availability Multi-Region RDS Instance

Amazon Relational Database Service (Amazon RDS) is an managed relational database service. Amazon RDS automates undifferentiated database management tasks, such as provisioning, configuring, backing up, and patching. Currently AWS support Highly available, durable relational databases deployed across up to three Availability Zones (AZs) ie, Multi-AZ where standby instance is created in other availability zone to enhance availability and resilience, ensuring that if one zone fails, the system can automatically switch to another. AWS support multi-AZ where any availability zone goes down the system will be available through standby instance. The solution outlined here supports the RDS instance during regional unavailability. In this setup, resources are provisioned using Cloudformation templates. Primary region architecture: The CloudFormation stack outlined above provisions the following resources: two Lambda functions that create RDS-related resources, an SSM parameter containing the RDS information, and an alarm parameter. where lambda code will be residing in the s3 bucket. Here this template expects the RDS parameter and alarms parameters as input. The Create_resource_primary_region Lambda function will be established as a component of the CloudFormation template. It is triggered directly from the stack itself. During the stack creation process, this Lambda function activates and generates RDS instance, alarms based on the provided input. Additionally, this Lambda function updates the parameters to the SSM parameter, which is integral to the stack. The Create_stack_secondary_region lambda function is designed to retrieve the CloudFormation template from S3. Based on the specified secondary region parameter, it will create a stack in that region. which completes the deployment process in the primary region, which is halfway in our overall solution. Secondary region architecture: This stack provisions the following resources: 2 event-bridge rule, an ssm parameter, an lambda function and an sns-topic. During the stack creation, the Create_resource_failover_fallback_action Lambda function is triggered. This function creates a read replica in the secondary region from the primary region's RDS instance. Additionally, it updates the RDS instance parameters and alarm parameters in the SSM parameter store. This process successfully completes the deployment. During the primary region outage, the RDS_failover_event event bridge rule will trigger the Create_resource_failover_fallback_action lambda which will do following actions.  Notifies outage to users through sns topic. failover_status_check event rule will be disabled intially but enables now which will invoke the lambda every 5mins. Promotes the RDS Read replica(This will take sometime in this time the rds will be outage) The failover_status_check rule triggers a lambda function that verifies the promotion status. If the instance is successfully promoted, the lambda creates an alarm. Once all components are set up, signaling the completion of the failover process. The failover_status_check rule is then disabled, and users are notified through the designated topic. In the event that the region recovers during the promotion phase, the RDS_failover_event rule will notify the lambda, which will delete the promoting instance and create a new read replica in the secondary region using the primary region instance. Users will also receive notifications through the topic in this scenario. Once the primary region comes back then RDS_failover_event rule will notifies the lambda which will do the same steps for fallback ie,  Notifies outage to users through sns topic. failover_status_check event rule will be disabled initially but enables now which will invoke the lambda every 5mins. Create the read replica in the primary region from secondary region instance, the primary region instance and the alarm are deleted, the execution exits. The failover_status_check rule will trigger the lambda every 5 mins so it will be checking creating status once created..  Promotes the RDS Read replica in primary region.(This will take sometime in this time the rds will be outage) Failover_status_check rule will invoke the lambda which will check for the promotion status if promoted then lambda creates alarm and create read replica in secondary region. once everything created, the secondary region instance and alarm will be deleted and fallback is completed, disables the failover_status_check rule and notifies users through the topic. Note: Here we can fallback during maintenance windows too.Incase user has budget concern we can use the rds automated backups instead of the read-replicas which will reduce the budget but there is limitation of supported replication region!! Following is the RDS_failover_event event rule pattern: { "source": ["aws.health"], "detail-type": ["AWS Health Event"], "detail": { "service": ["RDS"], "eventTypeCategory": ["issue"], "eventTypeC

Apr 5, 2025 - 04:45
 0
High Availability Multi-Region RDS Instance

Amazon Relational Database Service (Amazon RDS) is an managed relational database service. Amazon RDS automates undifferentiated database management tasks, such as provisioning, configuring, backing up, and patching.

Currently AWS support Highly available, durable relational databases deployed across up to three Availability Zones (AZs) ie, Multi-AZ where standby instance is created in other availability zone to enhance availability and resilience, ensuring that if one zone fails, the system can automatically switch to another. AWS support multi-AZ where any availability zone goes down the system will be available through standby instance.

The solution outlined here supports the RDS instance during regional unavailability. In this setup, resources are provisioned using Cloudformation templates.

Primary region architecture:

Image description

The CloudFormation stack outlined above provisions the following resources: two Lambda functions that create RDS-related resources, an SSM parameter containing the RDS information, and an alarm parameter. where lambda code will be residing in the s3 bucket. Here this template expects the RDS parameter and alarms parameters as input.

The Create_resource_primary_region Lambda function will be established as a component of the CloudFormation template. It is triggered directly from the stack itself. During the stack creation process, this Lambda function activates and generates RDS instance, alarms based on the provided input. Additionally, this Lambda function updates the parameters to the SSM parameter, which is integral to the stack.

The Create_stack_secondary_region lambda function is designed to retrieve the CloudFormation template from S3. Based on the specified secondary region parameter, it will create a stack in that region. which completes the deployment process in the primary region, which is halfway in our overall solution.

Secondary region architecture:

Image description

This stack provisions the following resources: 2 event-bridge rule, an ssm parameter, an lambda function and an sns-topic.

During the stack creation, the Create_resource_failover_fallback_action Lambda function is triggered. This function creates a read replica in the secondary region from the primary region's RDS instance. Additionally, it updates the RDS instance parameters and alarm parameters in the SSM parameter store. This process successfully completes the deployment.

During the primary region outage, the RDS_failover_event event bridge rule will trigger the Create_resource_failover_fallback_action lambda which will do following actions. 

  1. Notifies outage to users through sns topic.
  2. failover_status_check event rule will be disabled intially but enables now which will invoke the lambda every 5mins.
  3. Promotes the RDS Read replica(This will take sometime in this time the rds will be outage)

The failover_status_check rule triggers a lambda function that verifies the promotion status. If the instance is successfully promoted, the lambda creates an alarm. Once all components are set up, signaling the completion of the failover process. The failover_status_check rule is then disabled, and users are notified through the designated topic.

In the event that the region recovers during the promotion phase, the RDS_failover_event rule will notify the lambda, which will delete the promoting instance and create a new read replica in the secondary region using the primary region instance. Users will also receive notifications through the topic in this scenario.

Once the primary region comes back then RDS_failover_event rule will notifies the lambda which will do the same steps for fallback ie, 

  1. Notifies outage to users through sns topic.
  2. failover_status_check event rule will be disabled initially but enables now which will invoke the lambda every 5mins.
  3. Create the read replica in the primary region from secondary region instance, the primary region instance and the alarm are deleted, the execution exits.
  4. The failover_status_check rule will trigger the lambda every 5 mins so it will be checking creating status once created.. 
  5. Promotes the RDS Read replica in primary region.(This will take sometime in this time the rds will be outage) Failover_status_check rule will invoke the lambda which will check for the promotion status if promoted then lambda creates alarm and create read replica in secondary region. once everything created, the secondary region instance and alarm will be deleted and fallback is completed, disables the failover_status_check rule and notifies users through the topic.

Note: Here we can fallback during maintenance windows too.Incase user has budget concern we can use the rds automated backups instead of the read-replicas which will reduce the budget but there is limitation of supported replication region!!

Following is the RDS_failover_event event rule pattern:
{
"source": ["aws.health"],
"detail-type": ["AWS Health Event"],
"detail": {
"service": ["RDS"],
"eventTypeCategory": ["issue"],
"eventTypeCode": ["AWS_RDS_API_ISSUE", "AWS_RDS_CONNECTIVITY_ISSUE", "AWS_RDS_OPERATIONAL_ISSUE"],
"statusCode": ["open","closed"]
},
"resources": [!Ref pDBIdentifier],
"region": [!Ref pPrimaryRegion]
}