GitHub January 28th Incident Report

“Slightly over 25% of our servers and several network devices rebooted as a result. This left our infrastructure in a partially operational state and generated alerts to multiple on-call engineers.”
Incident #32 at Github on 2016/01/27 by Scott Sanders (Senior Operations Engineer)
Full report https://github.blog/2016-02-03-january-28th-incident-report/
How it happened The datacenter experienced a brief disruption in the systems that supply power to the servers and equipment. 25% of servers and several network devices rebooted, leaving the infrastructure in a partially operational state. Reboots of some machine types failed (due to physical drives not being recognized) and some application processes would not start because Redis clusters were unavailable.
Architecture Datacenter with servers running applications with dependencies on Redis clusters.
Technologies Redis
Root cause Power disruption to database; a known firmware issue prevented machines from recognizing their own drives after power-cycle; and some application processes (unnecessarily) depend on Redis for starting up.
Failure Sever machines would not boot after power-cycle; Redis clusters were unavailable; and application processes failed to (re)start.
Impact Applications began serving HTTP 503 repsonses to users.
Mitigation Repaired servers that would not boot by removing residual static electricity (aka "flea power"); rebuilt Redis clusters on alternate hardware (ie, restoring data onto standby equipment).