October 21 post-incident analysis

“Connectivity between these locations was restored in 43 seconds, but this brief outage triggered a chain of events that led to 24 hours and 11 minutes of service degradation.”

Incident	#13 at GitHub on 2018/10/21 by Jason Warner (CTO)
Full report	https://github.blog/2018-10-30-oct21-post-incident-analysis/
How it happened	Routine maintenance work to replace failing network equipment led to 43 seconds of lost connectivity between regional datacenters (ie, a network partition). The cluster management software then elected a new primary in a west coast data center (for multiple clusters), directing all writes from the east coast data center to the west coast data center, leaving some un-replicated writes in both data centers and so the primary could not be failed back over to the east coast data center. The resuling cluster topology was not supported.
Architecture	Multiple connected regional data centers. MySQL database clusters (storing metadata) each with one primary and dozens of read replicas. Data is sharded across clusters managed using Orchestrator and Raft.
Technologies	MySQL, Orchestrator, Raft
Root cause	Routine maintenance work to replace failing network equipment led to 43 seconds of lost connectivity and a cross-data center topology for clusters.
Failure	Writes to the ("old") primary nodes were not replicated to the new primary node which also began receiving un-replicated writes. Applications writing from one data center to the other experienced latency and timeouts.
Impact	24 hours of degraded service, including displaying out of date and inconsistent data, and some features were unavailable.
Mitigation	Restored data from backups, synchronized replicas from both sites, (logging all conflicting writes for later, manual reconciliation), moved the primary node to the appropriate data center and resume queued jobs.