Incident
|
#21 at
Joyent on
2015/07/27 by The Joyent Team
|
Full report
|
https://www.joyent.com/blog/manta-postmortem-7-27-2015
|
How it happened
|
Vacuuming to prevent transaction id wraparound was automatically initiated by the PostgreSQL autovacuuming process. While that was running a ('drop trigger') transaction requested an exclusive lock, blocking until the autovacuuming process completed. Subsequent transactions (requesting a shared lock) blocked behind the 'drop trigger' request causing failures and high latency.
|
Architecture
|
An API layer that calls multiple sharded PostgreSQL databases. Each shard is a three-node PostgreSQL cluster using synchronous replication.
|
Technologies
|
PostgreSQL
|
Root cause
|
During a particular database maintenance operation (vacuuming to prevent transaction id wraparound) any transaction that requests an exclusive lock is blocked and subsequent requests for a shared lock are blocked.
|
Failure
|
All transactions on one table on one shard were blocked.
|
Impact
|
API clients experienced high-latency failures (500-level responses) for between 19% and 27% for requests.
|
Mitigation
|
Retune the threshold at which "vacuuming to prevent transaction id wraparound" occurs, temporarily opting to monitor transaction metrics and manually initiate the database maintenance process.
|