Slack’s Incident on 2-22-22

“What was not obvious early on was why we were seeing so much database load on this keyspace and how we might get to a normal serving state.”
Incident #43 at Slack on 2022/02/22 by Laura Nolan, Glen D. Sanford, Jamie Scheinblum, Chris Sullivan
Full report https://slack.engineering/slacks-incident-on-2-22-22/
How it happened A deliberate restart of 25% of the cache fleet to upgrade monitoring software on the hosts.
Architecture Messaging client connecting with backend; MySQL+Vitess database cluster; Memcached caching fleet managed with Mcrouter.
Technologies MySQL, Vitess clustering system, Memcached, Mcrouter
Root cause Reduced cache fleet (and therefore high miss rate) + expensive query
Failure Resource exhaustion in database tier.
Impact Many slow and failed requests to API layer; newly started messaging clients could not boot.
Mitigation Responders throttled client boot operations (at API) so requests from already booted clients could succeed; modified query to be more efficient; gradually increased the limit for boot operations, allowing caches to fill.