#13 at
GitHub on
2018/10/21 by Jason Warner (CTO)
Full report
How it happened
Routine maintenance work to replace failing network equipment led to 43 seconds of lost connectivity between regional datacenters (ie, a network partition). The cluster management software then elected a new primary in a west coast data center (for multiple clusters), directing all writes from the east coast data center to the west coast data center, leaving some un-replicated writes in both data centers and so the primary could not be failed back over to the east coast data center. The resuling cluster topology was not supported.
Multiple connected regional data centers. MySQL database clusters (storing metadata) each with one primary and dozens of read replicas. Data is sharded across clusters managed using Orchestrator and Raft.
MySQL, Orchestrator, Raft
Root cause
Routine maintenance work to replace failing network equipment led to 43 seconds of lost connectivity and a cross-data center topology for clusters.
Writes to the ("old") primary nodes were not replicated to the new primary node which also began receiving un-replicated writes. Applications writing from one data center to the other experienced latency and timeouts.
24 hours of degraded service, including displaying out of date and inconsistent data, and some features were unavailable.
Restored data from backups, synchronized replicas from both sites, (logging all conflicting writes for later, manual reconciliation), moved the primary node to the appropriate data center and resume queued jobs.