What We Learned from the Recent Mandrill Outage

“In November of 2018, engineers on our Mandrill team identified the potential to reach wraparound, as the XIDs were climbing to approximately half their total limit during peak load. Our team determined wraparound was not an immediate threat, but we added a ticket to the backlog to set up additional monitoring.”

Incident	#5 at Mailchimp on 2019/02/04 by Eric Muntz (SVP of Technology)
Full report	https://mailchimp.com/what-we-learned-from-the-recent-mandrill-outage/
How it happened	Due to higher than normal traffic to one database (ie, one shard) the autovacuuming process failed or fell behind, and so the database went into safety shutdown mode to prevent transaction id wraparound. Jobs failed and were queued on the application servers causing disk space to run low.
Architecture	Job processing application using several PostgreSQL databases as a shared key-value store that is sharded by key.
Technologies	PostgreSQL
Root cause	The sharding algorithm caused one database to have higher than normal writes and the autovacuuming process failed or fell behind.
Failure	The databases went into safety shutdown mode leading to failed database writes.
Impact	20% of jobs were delayed.
Mitigation	Dumped and restored the database while the vacuum process was in progress, leaving out non-essential data tables to speed up process.