Incident
|
#13 at
GitHub on
2018/10/21 by Jason Warner (CTO)
|
Full report
|
https://github.blog/2018-10-30-oct21-post-incident-analysis/
|
How it happened
|
Routine maintenance work to replace failing network equipment led to 43 seconds of lost connectivity between regional datacenters (ie, a network partition). The cluster management software then elected a new primary in a west coast data center (for multiple clusters), directing all writes from the east coast data center to the west coast data center, leaving some un-replicated writes in both data centers and so the primary could not be failed back over to the east coast data center. The resuling cluster topology was not supported.
|
Architecture
|
Multiple connected regional data centers. MySQL database clusters (storing metadata) each with one primary and dozens of read replicas. Data is sharded across clusters managed using Orchestrator and Raft.
|
Technologies
|
MySQL, Orchestrator, Raft
|
Root cause
|
Routine maintenance work to replace failing network equipment led to 43 seconds of lost connectivity and a cross-data center topology for clusters.
|
Failure
|
Writes to the ("old") primary nodes were not replicated to the new primary node which also began receiving un-replicated writes. Applications writing from one data center to the other experienced latency and timeouts.
|
Impact
|
24 hours of degraded service, including displaying out of date and inconsistent data, and some features were unavailable.
|
Mitigation
|
Restored data from backups, synchronized replicas from both sites, (logging all conflicting writes for later, manual reconciliation), moved the primary node to the appropriate data center and resume queued jobs.
|