Postmortem of Sevice Outage at 3.4M CCU

“Fortnite hit a new peak of 3.4 million concurrent players last Sunday... and that didn't come without issues!”
Incident #8 at Epic on 2018/02/03 by Epic Team
Full report https://www.epicgames.com/fortnite/en-US/news/postmortem-of-service-outage-at-3-4m-ccu
How it happened As the load increased, one shard of the MongoDB database became unresponsive and the service experienced thread pool starvation. Memcached failed underload which saturated NGINX because it was stuck waiting on long running failure calls to Memcached. This in turn lead to failed health checks leading to the load balancer pulling all nodes out of rotation. A load balancer for the messaging service was overloaded and failed, leading to clients disconnecting followed by a flood of reconnection requests, exacerbating the problem.
Architecture (1) A service with a database connection pool that connects to a sidecar process which has its own connection pool for connecting to the shards in the 9 MongoDB shards, where each shard has a writer, two read replicas, and a hidden replica for redundancy. (2) A service with a sidecar process (as a shortcut for some request types) which has an NGINX proxy calling a Memcached instance. (3) A publish-subscribe messaging system based on XMPP.
Technologies MongoDB, Amazon Elastic Compute Cloud (EC2), NGINX
Root cause The overall service hit a new peak of 3.4 million concurrent users, exceeding various datastore (MongoDB and Memcached) and loadbalancer limits. A latent configuration defect limited the number of service threads (in the db pool) to a smaller number than necessary.
Failure Multiple services were overwhelmed and failed. The MongoDB and memcached became unresponsive. A service experienced database thread pool starvation.
Impact There were 6 different incidents, with a mix of partial and total service disruptions.
Mitigation Manually failed over unresponsive MongoDB datastore; rolled back to previous thread pool configuration (raising limit on number of service threads); fixed NGINX issue by increasing Memcached capacity (and simplifying the architecture to remove a potential bottleneck).