Connectivity Issues

“Shortly thereafter the nodes of another service [...] attempted a reconnection, triggering a massive 'thundering herd' towards the existing members of the presence cluster.”
Incident #15 at Discord on 2017/03/20
Full report https://status.discord.com/incidents/dj3l6lw926kl
How it happened CPU soft-locks stalled the network stack and caused one cluster node to disconnect, leaving the cluster in a split state. Nodes from a dependent service attempted to reconnect to the cluster, increasing the load on existing nodes, causing failures in the service and unhandled events queuing up in memory, eventually consuming available memory and crashing the service. The same incident occured a second time about 1 hour later.
Architecture Multiple clusters with data dependencies between them.
Technologies Google Compute Engine (GCE)
Root cause CPU soft lockups; a known defect that prevented the cluster from properly handling a lost node; and unbounded in memory event queues.
Failure Event queue exceeded memory limits crashing the (sessions) service.
Impact One third of all clients could not send messages. Messaging sending was then disabled for all clients during mitigation and then clients were disconnected and reconnected, also as part of the mitigation.
Mitigation The disconnected node was rebooted, which forced the virtual machine to land on another physical host and resolved the soft-lock issues. A reboot of (generallly) all services, disconnecting and reconnecting all clients in the process.