Incident
|
#3 at
Discord on
2017/10/13
|
Full report
|
https://status.discord.com/incidents/qk9cdgnqnhcn
|
How it happened
|
Automatic migration changed the Redis secondary instance to be the primary instance. Due to a defect in failover code, the cluster nodes did not correctly failover to the new primary instance. This failure cascaded through many dependent systems.
|
Architecture
|
Multiple highly available clusters with Redis primary and secondary instances.
|
Technologies
|
Redis, Google Cloud Platform (GCP)
|
Root cause
|
Previously known defect preventing proper failover behavior.
|
Failure
|
Memory exhaustion and failing cluster nodes.
|
Impact
|
Partial outage and reduced performance.
|
Mitigation
|
A full restart of service necessitating reconnecting millions of clients.
|