QA Design Gurus: Disaster Recovery/High availability testing

Mar 13, 2015

Disaster Recovery/High availability testing

High available means a system that is designed to avoid the loss of service by reducing or minimizing the failures as well as minimizing planned down time for the system.

Ref: goo.gl/ocaLI0

For example, we expect electrical service to be highly available since we have geared our lives depend on electricity for registration, heating and lighting, in addition to less important daily needs.

Let’s try to understand this concept through an example in case of Cloud-based software products. Assume we have a cloud platform which uses the following Amazon EC2 Web services to achieve high availability.

Elastic Load Balancer (ELB)
Auto Scaling Groups (ASG)
Amazon Rational Database Service (RDS)
Amazon Simple Email Service (SES)

High availability testing in Amazon Cloud:

Assume our above mentioned cloud platform has a Cloud Services layer which cater for some of the common services required across the platform. These services could be user management, notifications, check-out process to purchase certain items etc. We will execute the following to make sure that Cloud Service on this Platform is highly available in case of failures and customer should not see any difference in case of failures in back end systems.

Stop one of the HA Instances (We are maintaining two HA instances for few services). User should be able to access the product and services of the product without any difference.
Amazon should not spin off new HA instance in case of stopping one of the HAA instance (As per our Product EC2 Deployment)
Terminate one of the HA instances (We are maintaining two HA instances for few services)Amazon should Spin off new instance with same set of services and should ready to use. All the Services should functional properly in newly instance. ELB should balance the load between new instance and old instance.
By doing Continuous Load testing on HA Services, ELB Should do the load balance between two HA instances. If two instances occupied with 100%, it should spin off new HA instance (As per our Product deployment and configuration on Amazon EC2 Deployment). We can monitor Amazon Ec2 Resource management through Amazon Cloud Watch.
Verifying the alarms in your favorite logging systems like sumo logs in all above cases as we configured alarms in Amazon Cloud Watch to make alarm.
Verifying that RDS is Auto Scaled in case of RDS full and Cloud Watch Should sent alarm regarding this as per our implementation.
Verifying that Our Services should be developed/designed such that to support high availability.

Examples:

E commerce transaction should be failed if that service (HA instance) is down in the middle of the transaction.
All the Services should respond back with proper error message in case of the services (HA instance) are down.
Services should have the fail over mechanism to make services are high available (Requests should go to the new instance in case of recovery).
Handling the Service failures in graceful manner in the span of recovery time.
None of the services will be available, while RDS is down and Proper Errors Should returned.
All Services should start accepting requests when RDS Service is up
Services should functional proper when RDS is full and auto scaled.

2 comments:

PhaniMarch 18, 2015 at 8:30 PM
Thank you for the write up. 2 & 3 points are contradicting. 2nd point you are saying Amazon should not spin off HA instance and 3rd point you are saying it should spin off. Which is correct? Just to make you aware - You explained how system should behave under Examples.
ReplyDelete
Replies
UnknownMarch 20, 2015 at 11:16 AM
Thanks Phani for reading this blog. For your comment, #2 and #3 are not contradict each other as #2 is talking about stopping the instance and #3 is talking about terminating the instance. It all depends on the amazon settings and auto scaling group. If we configured in amazon to spin off new instance in case of stopping and terminating the instance then it should spin off new instance in both cases.

In general, Amazon Auto scaling group works in case of instance termination. It will spin off new instance in case of instance termination not for stopping.

ReplyDelete
Replies

Add comment