Google cloud Fails during maintenance

“Two normally-benign misconfigurations, and a specific software bug, combined to initiate the outage.”
Incident #16 at Google on 2019/06/02
Full report https://status.cloud.google.com/incident/cloud-networking/19009
How it happened Maintenance event begins in a single location. Due to a misconfiguration automation software then descheduled the logical clusters running network control jobs (which should not have been configured to be stopped during such a maintenance event) in multiple locations (not just the location of the event). The network continued to run normally until BGP routing (between particular locations) was withdrawn, significantly reducing network capacity.
Architecture Google Cloud regional datacenters each segregated into multiple logical clusters which each have their own dedicated cluster management software (for redundancy). The network control plane is managed by different instances of the same management software.
Technologies Google Cloud Platform (GCP)
Root cause Two latent misconfigurations: network control plan jobs and associated infrastructure were configured to stop during maintenance events, and network control plan management software were configured to be included in a rare maintenance event type; and a piece of maintenance software had a bug which led to it de-scheduling multiple software clusters at once.
Failure Network congession and packet loss.
Impact Customers experienced increased latency, intermittent errors, and connectivity loss to instances in multiple datacenters (leading to outages of services in those datacenters, unless they could redirect users to unaffected data centers).
Mitigation Stopped the automation software which precipitated the event and restarted the network control plane and its infrastructure. Network configuration data was rebuilt and redistributed. In the meantime, responders redirected traffic to unaffected datacenters.