Incident
|
#26 at
GoCardless on
2015 by Chris Sinjakli (Senior Site Reliability Engineer)
|
Full report
|
https://gocardless.com/blog/zero-downtime-postgres-migrations-the-hard-parts/
|
How it happened
|
Schema changes to a database were deployed, modifying tables that were empty and unused. The change added a foreign key constraint and attempted to take an exclusive lock on both the empty table and an in use table and ended up being queued until that lock could be granted. All other operations (that would want a conflicting lock on the table) were blocked until the exclusive lock was granted and the migration completed.
|
Architecture
|
Web API backed by a PostgreSQL database
|
Technologies
|
PostgreSQL
|
Root cause
|
While adding a foreign key constraint (and the associated enforcing trigger) to a table, PostgreSQL attempts to take an exclusive lock on both tables invovled in the constraint, and if it is unable to get that lock it is queued. Other conflicting locks then queue up behind blocking other operations on the table.
|
Failure
|
Database transactions were blocked until the schema migration completed (specifically, the part of the migration that required adding a foreign key constraint).
|
Impact
|
Client API requests to (payments) API timed out and failed for around 15 seconds.
|
Mitigation
|
Incident was resolved naturally as the database change completed.
|