Incident
|
#5 at
Mailchimp on
2019/02/04 by Eric Muntz (SVP of Technology)
|
Full report
|
https://mailchimp.com/what-we-learned-from-the-recent-mandrill-outage/
|
How it happened
|
Due to higher than normal traffic to one database (ie, one shard) the autovacuuming process failed or fell behind, and so the database went into safety shutdown mode to prevent transaction id wraparound. Jobs failed and were queued on the application servers causing disk space to run low.
|
Architecture
|
Job processing application using several PostgreSQL databases as a shared key-value store that is sharded by key.
|
Technologies
|
PostgreSQL
|
Root cause
|
The sharding algorithm caused one database to have higher than normal writes and the autovacuuming process failed or fell behind.
|
Failure
|
The databases went into safety shutdown mode leading to failed database writes.
|
Impact
|
20% of jobs were delayed.
|
Mitigation
|
Dumped and restored the database while the vacuum process was in progress, leaving out non-essential data tables to speed up process.
|