Postmortem for July 27 outage of the Manta service

“There was a single 'DROP TRIGGER' query that was attempting to take an exclusive lock on the whole table. It appears that PostgreSQL blocks new attempts to take a shared lock while an exclusive lock is wanted.”
Incident #21 at Joyent on 2015/07/27 by The Joyent Team
Full report https://www.joyent.com/blog/manta-postmortem-7-27-2015
How it happened Vacuuming to prevent transaction id wraparound was automatically initiated by the PostgreSQL autovacuuming process. While that was running a ('drop trigger') transaction requested an exclusive lock, blocking until the autovacuuming process completed. Subsequent transactions (requesting a shared lock) blocked behind the 'drop trigger' request causing failures and high latency.
Architecture An API layer that calls multiple sharded PostgreSQL databases. Each shard is a three-node PostgreSQL cluster using synchronous replication.
Technologies PostgreSQL
Root cause During a particular database maintenance operation (vacuuming to prevent transaction id wraparound) any transaction that requests an exclusive lock is blocked and subsequent requests for a shared lock are blocked.
Failure All transactions on one table on one shard were blocked.
Impact API clients experienced high-latency failures (500-level responses) for between 19% and 27% for requests.
Mitigation Retune the threshold at which "vacuuming to prevent transaction id wraparound" occurs, temporarily opting to monitor transaction metrics and manually initiate the database maintenance process.