Incident
|
#41 at
Amazon Web Services on
2020/11/25
|
Full report
|
https://aws.amazon.com/message/11201/
|
How it happened
|
New servers were added to the front-end fleet in one region, increasing the number operating system threads used in each of the new and existing front-end server, exceeding the thread limit. Then a wide variety of errors began being reported in the logs and operations began failing.
|
Architecture
|
A streaming service (Kinesis) with front-end servers for routing requests and back-end server clusters for processing streams. Routing is based on a sharding strategy, with the shard-map cached by front-end servers. Communication between front-end servers uses one operating system thread per other server.
|
Technologies
|
Amazon Kinesis
|
Root cause
|
New capacity caused all front-end servers to exceed maximum number of allowed operating system threads. 6
|
Failure
|
Cache construction failed on front-end servers (leading to an out of date shard-map) and routing failed. 6
|
Impact
|
Service outage in one region; errors and failures in dependent services (Cognito, CloudWatch, AutoScaling and Lambda).
|
Mitigation
|
Eliminated the additional capacity; made a configuration change to front-end servers to get data from authoritative metadata store rather than from peer servers (to avoid servers being judged as unhealthy during startup); multi-hour restart of front-end.
|