Incident
|
#22 at
Amazon Web Services on
2011/04/21
|
Full report
|
https://aws.amazon.com/message/65648/
|
How it happened
|
A network change (to increase capacity) was performed as part of normal maintenance. During the change, EBS traffic was (inadvertently) shifted to a router with insufficient capacity (the secondary network intended only for node to node traffic), isolating many nodes in the EBS cluster as both networks became unavailable simultaneously. When the traffic was shifted back to the new primary network, the extra traffic on the secondary network (from many primary nodes looking for new replica space) exceeded capacity, leaving nodes "stuck" (aka a "re-mirroring storm"). The control plane's requests to create new volumes (which have long time out values) backed up and lead to thread starvation on control plane servers.
|
Architecture
|
A distributed three layer system: (1) virtual machines (EC2), (2) a distributed, replicated block data store for virtual machine instances (EBS), and (3) a control plane that coordinates user requests to EBS clusters. The nodes in the EBS cluster are connected with two networks, the second one is intended for only node to node network traffic.
|
Technologies
|
Amazon Elastic Compute Cloud (EC2), Amazon Elastic Block Store (EBS), Relational Database Service (RDS), Amazon Virtual Private Cloud (VPC) Regions
|
Root cause
|
During routine maintenance traffic from the primary network was redirected to the secondary, leaving both unavailable; and a race condition in the code on the EBS nodes (causing failure when a large number of requests for replication were concurrently closed).
|
Failure
|
EBS volumes became unable to service read and write operations; virtual machines reading from these disks became "stuck" waiting for disk operations; and the control plane experienced thread starvation due to "hung" requests to EBS.
|
Impact
|
Customers experienced elevated error rates on virual machine instances (with affected block storage attached) and attempts to create new instances failed.
|
Mitigation
|
Severed all communication between the degraded network and the control plane, allowing instances to recover over several days; gradually reconnected to the network and manually recovered volumes (from snapshots taken at begining of the event) that did not recover on their own.
|