Unavailable Guilds & Connection Issues

“These issues caused enough critical impact that Discord's engineering team was forced to fully restart the service, reconnecting millions of clients over a period of 20 minutes.”
Incident #3 at Discord on 2017/10/13
Full report https://status.discord.com/incidents/qk9cdgnqnhcn
How it happened Automatic migration changed the Redis secondary instance to be the primary instance. Due to a defect in failover code, the cluster nodes did not correctly failover to the new primary instance. This failure cascaded through many dependent systems.
Architecture Multiple highly available clusters with Redis primary and secondary instances.
Technologies Redis, Google Cloud Platform (GCP)
Root cause Previously known defect preventing proper failover behavior.
Failure Memory exhaustion and failing cluster nodes.
Impact Partial outage and reduced performance.
Mitigation A full restart of service necessitating reconnecting millions of clients.