Incident
|
#43 at
Slack on
2022/02/22 by Laura Nolan, Glen D. Sanford, Jamie Scheinblum, Chris Sullivan
|
Full report
|
https://slack.engineering/slacks-incident-on-2-22-22/
|
How it happened
|
A deliberate restart of 25% of the cache fleet to upgrade monitoring software on the hosts.
|
Architecture
|
Messaging client connecting with backend; MySQL+Vitess database cluster; Memcached caching fleet managed with Mcrouter.
|
Technologies
|
MySQL, Vitess clustering system, Memcached, Mcrouter
|
Root cause
|
Reduced cache fleet (and therefore high miss rate) + expensive query
|
Failure
|
Resource exhaustion in database tier.
|
Impact
|
Many slow and failed requests to API layer; newly started messaging clients could not boot.
|
Mitigation
|
Responders throttled client boot operations (at API) so requests from already booted clients could succeed; modified query to be more efficient; gradually increased the limit for boot operations, allowing caches to fill.
|