Incident
|
#24 at
Cloudflare on
2020/07/17 by John Graham-Cumming (CTO)
|
Full report
|
https://blog.cloudflare.com/cloudflare-outage-on-july-17-2020/
|
How it happened
|
Responders mitigating an (unrelated congestion) issue updated the configuration on a router in one location (Atlanta), with the goal of alleviating congestion. That configuration contained a defect that caused all traffic across the global network to be routed through to that location, overwhelming that router and causing failures for all locations on that network.
|
Architecture
|
A private backbone that carries traffic between different data centers, without going over the public internet.
|
Technologies
|
|
Root cause
|
Configuration error on a router in one location, which inadvertently rerouted all traffic on that network ("backbone") through that router.
|
Failure
|
Network router was overwhelmed and traffic to other locations was lost.
|
Impact
|
Outage for company's' services that lasted 27 minutes; logs and metrics were lost at the data centers processing logs.
|
Mitigation
|
The router with the defective configuration was disabled, shutting down the backbone in that location.
|