Incident
|
#15 at
Discord on
2017/03/20
|
Full report
|
https://status.discord.com/incidents/dj3l6lw926kl
|
How it happened
|
CPU soft-locks stalled the network stack and caused one cluster node to disconnect, leaving the cluster in a split state. Nodes from a dependent service attempted to reconnect to the cluster, increasing the load on existing nodes, causing failures in the service and unhandled events queuing up in memory, eventually consuming available memory and crashing the service. The same incident occured a second time about 1 hour later.
|
Architecture
|
Multiple clusters with data dependencies between them.
|
Technologies
|
Google Compute Engine (GCE)
|
Root cause
|
CPU soft lockups; a known defect that prevented the cluster from properly handling a lost node; and unbounded in memory event queues.
|
Failure
|
Event queue exceeded memory limits crashing the (sessions) service.
|
Impact
|
One third of all clients could not send messages. Messaging sending was then disabled for all clients during mitigation and then clients were disconnected and reconnected, also as part of the mitigation.
|
Mitigation
|
The disconnected node was rebooted, which forced the virtual machine to land on another physical host and resolved the soft-lock issues. A reboot of (generallly) all services, disconnecting and reconnecting all clients in the process.
|