Incident
|
#32 at
Github on
2016/01/27 by Scott Sanders (Senior Operations Engineer)
|
Full report
|
https://github.blog/2016-02-03-january-28th-incident-report/
|
How it happened
|
The datacenter experienced a brief disruption in the systems that supply power to the servers and equipment. 25% of servers and several network devices rebooted, leaving the infrastructure in a partially operational state. Reboots of some machine types failed (due to physical drives not being recognized) and some application processes would not start because Redis clusters were unavailable.
|
Architecture
|
Datacenter with servers running applications with dependencies on Redis clusters.
|
Technologies
|
Redis
|
Root cause
|
Power disruption to database; a known firmware issue prevented machines from recognizing their own drives after power-cycle; and some application processes (unnecessarily) depend on Redis for starting up.
|
Failure
|
Sever machines would not boot after power-cycle; Redis clusters were unavailable; and application processes failed to (re)start.
|
Impact
|
Applications began serving HTTP 503 repsonses to users.
|
Mitigation
|
Repaired servers that would not boot by removing residual static electricity (aka "flea power"); rebuilt Redis clusters on alternate hardware (ie, restoring data onto standby equipment).
|