GitHub January 28th Incident Report

“Slightly over 25% of our servers and several network devices rebooted as a result. This left our infrastructure in a partially operational state and generated alerts to multiple on-call engineers.”

Incident	#32 at Github on 2016/01/27 by Scott Sanders (Senior Operations Engineer)
Full report	https://github.blog/2016-02-03-january-28th-incident-report/
How it happened	The datacenter experienced a brief disruption in the systems that supply power to the servers and equipment. 25% of servers and several network devices rebooted, leaving the infrastructure in a partially operational state. Reboots of some machine types failed (due to physical drives not being recognized) and some application processes would not start because Redis clusters were unavailable.
Architecture	Datacenter with servers running applications with dependencies on Redis clusters.
Technologies	Redis
Root cause	Power disruption to database; a known firmware issue prevented machines from recognizing their own drives after power-cycle; and some application processes (unnecessarily) depend on Redis for starting up.
Failure	Sever machines would not boot after power-cycle; Redis clusters were unavailable; and application processes failed to (re)start.
Impact	Applications began serving HTTP 503 repsonses to users.
Mitigation	Repaired servers that would not boot by removing residual static electricity (aka "flea power"); rebuilt Redis clusters on alternate hardware (ie, restoring data onto standby equipment).