Zero-downtime Postgres migrations - the hard parts

“We deployed the changes, and all of our assumptions got blown out of the water. Just after the schema migration started, we started getting alerts about API requests timing out.”
Incident #26 at GoCardless on 2015 by Chris Sinjakli (Senior Site Reliability Engineer)
Full report https://gocardless.com/blog/zero-downtime-postgres-migrations-the-hard-parts/
How it happened Schema changes to a database were deployed, modifying tables that were empty and unused. The change added a foreign key constraint and attempted to take an exclusive lock on both the empty table and an in use table and ended up being queued until that lock could be granted. All other operations (that would want a conflicting lock on the table) were blocked until the exclusive lock was granted and the migration completed.
Architecture Web API backed by a PostgreSQL database
Technologies PostgreSQL
Root cause While adding a foreign key constraint (and the associated enforcing trigger) to a table, PostgreSQL attempts to take an exclusive lock on both tables invovled in the constraint, and if it is unable to get that lock it is queued. Other conflicting locks then queue up behind blocking other operations on the table.
Failure Database transactions were blocked until the schema migration completed (specifically, the part of the migration that required adding a foreign key constraint).
Impact Client API requests to (payments) API timed out and failed for around 15 seconds.
Mitigation Incident was resolved naturally as the database change completed.