Incident
|
#29 at
Stack Exchange on
2017/01/24
|
Full report
|
https://stackstatus.net/post/156407746074/outage-postmortem-january-24-2017
|
How it happened
|
A bugcheck in the primary SQL Server placed the primary in read only state, but application-level failovers were disabled due to a code defect, so the SQL server failed, and the network went offline
|
Architecture
|
Applications that depend on an SQL Server primary that has multiple standby servers
|
Technologies
|
SQL Server
|
Root cause
|
A SQL Server bugcheck caused by a (suspected) bad memory chip; and a fail-over related code defect
|
Failure
|
The service went into read-only mode for approximately 5 minutes and offline for 12 minutes
|
Impact
|
Service outage and 3.5 seconds of data loss
|
Mitigation
|
A sanity check on the SQL health was completed and the sites were put back into read-write mode
|