Incident
|
#16 at
Google on
2019/06/02
|
Full report
|
https://status.cloud.google.com/incident/cloud-networking/19009
|
How it happened
|
Maintenance event begins in a single location. Due to a misconfiguration automation software then descheduled the logical clusters running network control jobs (which should not have been configured to be stopped during such a maintenance event) in multiple locations (not just the location of the event). The network continued to run normally until BGP routing (between particular locations) was withdrawn, significantly reducing network capacity.
|
Architecture
|
Google Cloud regional datacenters each segregated into multiple logical clusters which each have their own dedicated cluster management software (for redundancy). The network control plane is managed by different instances of the same management software.
|
Technologies
|
Google Cloud Platform (GCP)
|
Root cause
|
Two latent misconfigurations: network control plan jobs and associated infrastructure were configured to stop during maintenance events, and network control plan management software were configured to be included in a rare maintenance event type; and a piece of maintenance software had a bug which led to it de-scheduling multiple software clusters at once.
|
Failure
|
Network congession and packet loss.
|
Impact
|
Customers experienced increased latency, intermittent errors, and connectivity loss to instances in multiple datacenters (leading to outages of services in those datacenters, unless they could redirect users to unaffected data centers).
|
Mitigation
|
Stopped the automation software which precipitated the event and restarted the network control plane and its infrastructure. Network configuration data was rebuilt and redistributed. In the meantime, responders redirected traffic to unaffected datacenters.
|