Unavailable Guilds & Connection Issues

“These issues caused enough critical impact that Discord's engineering team was forced to fully restart the service, reconnecting millions of clients over a period of 20 minutes.”

Incident	#3 at Discord on 2017/10/13
Full report	https://status.discord.com/incidents/qk9cdgnqnhcn
How it happened	Automatic migration changed the Redis secondary instance to be the primary instance. Due to a defect in failover code, the cluster nodes did not correctly failover to the new primary instance. This failure cascaded through many dependent systems.
Architecture	Multiple highly available clusters with Redis primary and secondary instances.
Technologies	Redis, Google Cloud Platform (GCP)
Root cause	Previously known defect preventing proper failover behavior.
Failure	Memory exhaustion and failing cluster nodes.
Impact	Partial outage and reduced performance.
Mitigation	A full restart of service necessitating reconnecting millions of clients.