Summary of the Amazon Kinesis Event in the Northern Virginia (US-EAST-1) Region

“The trigger, though not root cause, for the event was a relatively small addition of capacity”
Incident #41 at Amazon Web Services on 2020/11/25
Full report https://aws.amazon.com/message/11201/
How it happened New servers were added to the front-end fleet in one region, increasing the number operating system threads used in each of the new and existing front-end server, exceeding the thread limit. Then a wide variety of errors began being reported in the logs and operations began failing.
Architecture A streaming service (Kinesis) with front-end servers for routing requests and back-end server clusters for processing streams. Routing is based on a sharding strategy, with the shard-map cached by front-end servers. Communication between front-end servers uses one operating system thread per other server.
Technologies Amazon Kinesis
Root cause New capacity caused all front-end servers to exceed maximum number of allowed operating system threads. 6
Failure Cache construction failed on front-end servers (leading to an out of date shard-map) and routing failed. 6
Impact Service outage in one region; errors and failures in dependent services (Cognito, CloudWatch, AutoScaling and Lambda).
Mitigation Eliminated the additional capacity; made a configuration change to front-end servers to get data from authoritative metadata store rather than from peer servers (to avoid servers being judged as unhealthy during startup); multi-hour restart of front-end.