Amazon EC2 and Amazon RDS Service Disruption in the US East Region

“As with any complicated operational issue, this one was caused by several root causes interacting with one another and therefore gives us many opportunities to protect the service against any similar event reoccurring.”

Incident	#22 at Amazon Web Services on 2011/04/21
Full report	https://aws.amazon.com/message/65648/
How it happened	A network change (to increase capacity) was performed as part of normal maintenance. During the change, EBS traffic was (inadvertently) shifted to a router with insufficient capacity (the secondary network intended only for node to node traffic), isolating many nodes in the EBS cluster as both networks became unavailable simultaneously. When the traffic was shifted back to the new primary network, the extra traffic on the secondary network (from many primary nodes looking for new replica space) exceeded capacity, leaving nodes "stuck" (aka a "re-mirroring storm"). The control plane's requests to create new volumes (which have long time out values) backed up and lead to thread starvation on control plane servers.
Architecture	A distributed three layer system: (1) virtual machines (EC2), (2) a distributed, replicated block data store for virtual machine instances (EBS), and (3) a control plane that coordinates user requests to EBS clusters. The nodes in the EBS cluster are connected with two networks, the second one is intended for only node to node network traffic.
Technologies	Amazon Elastic Compute Cloud (EC2), Amazon Elastic Block Store (EBS), Relational Database Service (RDS), Amazon Virtual Private Cloud (VPC) Regions
Root cause	During routine maintenance traffic from the primary network was redirected to the secondary, leaving both unavailable; and a race condition in the code on the EBS nodes (causing failure when a large number of requests for replication were concurrently closed).
Failure	EBS volumes became unable to service read and write operations; virtual machines reading from these disks became "stuck" waiting for disk operations; and the control plane experienced thread starvation due to "hung" requests to EBS.
Impact	Customers experienced elevated error rates on virual machine instances (with affected block storage attached) and attempts to create new instances failed.
Mitigation	Severed all communication between the degraded network and the control plane, allowing instances to recover over several days; gradually reconnected to the network and manually recovered volumes (from snapshots taken at begining of the event) that did not recover on their own.