Slack’s Incident on 2-22-22

“What was not obvious early on was why we were seeing so much database load on this keyspace and how we might get to a normal serving state.”

Incident	#43 at Slack on 2022/02/22 by Laura Nolan, Glen D. Sanford, Jamie Scheinblum, Chris Sullivan
Full report	https://slack.engineering/slacks-incident-on-2-22-22/
How it happened	A deliberate restart of 25% of the cache fleet to upgrade monitoring software on the hosts.
Architecture	Messaging client connecting with backend; MySQL+Vitess database cluster; Memcached caching fleet managed with Mcrouter.
Technologies	MySQL, Vitess clustering system, Memcached, Mcrouter
Root cause	Reduced cache fleet (and therefore high miss rate) + expensive query
Failure	Resource exhaustion in database tier.
Impact	Many slow and failed requests to API layer; newly started messaging clients could not boot.
Mitigation	Responders throttled client boot operations (at API) so requests from already booted clients could succeed; modified query to be more efficient; gradually increased the limit for boot operations, allowing caches to fill.