Incident
|
#33 at
Amazon Web Services on
2017/02/28
|
Full report
|
https://aws.amazon.com/message/41926/
|
How it happened
|
In response to slowness in the billing system, an engineer executed a command to remove a small number of servers from a subsystem and entered the command incorrectly. A larger number of servers than intended were removed, leading to outages in two subsystems and many dependent services.
|
Architecture
|
Multiple regional datacenters with multiple subsystems each with multiple servers.
|
Technologies
|
Amazon Simple Storage Service (S3), Amazon Elastic Compute Cloud (EC2), Amazon Elastic Block Store (EBS), AWS Lambda
|
Root cause
|
A command executed by an engineer with incorrect paramters.
|
Failure
|
A large number of servers were removed from two sybsystems, beyond the number that could be tolerated.
|
Impact
|
Complete outage of the (S3) service API in one region.
|
Mitigation
|
A full resart of the two subsystems restored functionality within 5 hours.
|