Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region

“Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”
Incident #33 at Amazon Web Services on 2017/02/28
Full report https://aws.amazon.com/message/41926/
How it happened In response to slowness in the billing system, an engineer executed a command to remove a small number of servers from a subsystem and entered the command incorrectly. A larger number of servers than intended were removed, leading to outages in two subsystems and many dependent services.
Architecture Multiple regional datacenters with multiple subsystems each with multiple servers.
Technologies Amazon Simple Storage Service (S3), Amazon Elastic Compute Cloud (EC2), Amazon Elastic Block Store (EBS), AWS Lambda
Root cause A command executed by an engineer with incorrect paramters.
Failure A large number of servers were removed from two sybsystems, beyond the number that could be tolerated.
Impact Complete outage of the (S3) service API in one region.
Mitigation A full resart of the two subsystems restored functionality within 5 hours.