Incident
|
#25 at
Amazon Web Services on
2012/12/24
|
Full report
|
https://aws.amazon.com/message/680587/
|
How it happened
|
An engineer inadvertently executed a maintenance process against the production load balancer control plane, which led to state data being deleted and was unnoticed by the engineer. Some types of API calls to the control plane experienced high latency and error rates. As the control plane made modifications to load balancers performance was degraded (due to missing state data).
|
Architecture
|
Load balancer service, with a control plane that manages the configuration of the load balancers (for one region) and is controlled via an API.
|
Technologies
|
Elastic Load Balancing (ELB)
|
Root cause
|
A maintenance process was inadvertently run against production, deleting state data.
|
Failure
|
High latency and error rates for API calls to the control plane of the load balancer system; later load balancers began to experience performance issues.
|
Impact
|
Customers could not manage existing load balancers, though they could create new load balancers. Some load balancers were also degraded.
|
Mitigation
|
Temporarily disabled control plane features that were causing problematic modifications to load balancers; restored deleted state and then merged that in to the system state for each affected load balancer; and reenabled disabled features.
|