Postmortem for July 27 outage of the Manta service

“There was a single 'DROP TRIGGER' query that was attempting to take an exclusive lock on the whole table. It appears that PostgreSQL blocks new attempts to take a shared lock while an exclusive lock is wanted.”

Incident	#21 at Joyent on 2015/07/27 by The Joyent Team
Full report	https://www.joyent.com/blog/manta-postmortem-7-27-2015
How it happened	Vacuuming to prevent transaction id wraparound was automatically initiated by the PostgreSQL autovacuuming process. While that was running a ('drop trigger') transaction requested an exclusive lock, blocking until the autovacuuming process completed. Subsequent transactions (requesting a shared lock) blocked behind the 'drop trigger' request causing failures and high latency.
Architecture	An API layer that calls multiple sharded PostgreSQL databases. Each shard is a three-node PostgreSQL cluster using synchronous replication.
Technologies	PostgreSQL
Root cause	During a particular database maintenance operation (vacuuming to prevent transaction id wraparound) any transaction that requests an exclusive lock is blocked and subsequent requests for a shared lock are blocked.
Failure	All transactions on one table on one shard were blocked.
Impact	API clients experienced high-latency failures (500-level responses) for between 19% and 27% for requests.
Mitigation	Retune the threshold at which "vacuuming to prevent transaction id wraparound" occurs, temporarily opting to monitor transaction metrics and manually initiate the database maintenance process.