What We Learned from the Recent Mandrill Outage

“In November of 2018, engineers on our Mandrill team identified the potential to reach wraparound, as the XIDs were climbing to approximately half their total limit during peak load. Our team determined wraparound was not an immediate threat, but we added a ticket to the backlog to set up additional monitoring.”
Incident #5 at Mailchimp on 2019/02/04 by Eric Muntz (SVP of Technology)
Full report https://mailchimp.com/what-we-learned-from-the-recent-mandrill-outage/
How it happened Due to higher than normal traffic to one database (ie, one shard) the autovacuuming process failed or fell behind, and so the database went into safety shutdown mode to prevent transaction id wraparound. Jobs failed and were queued on the application servers causing disk space to run low.
Architecture Job processing application using several PostgreSQL databases as a shared key-value store that is sharded by key.
Technologies PostgreSQL
Root cause The sharding algorithm caused one database to have higher than normal writes and the autovacuuming process failed or fell behind.
Failure The databases went into safety shutdown mode leading to failed database writes.
Impact 20% of jobs were delayed.
Mitigation Dumped and restored the database while the vacuum process was in progress, leaving out non-essential data tables to speed up process.