Elevated error rate with Google Cloud Storage

“User-visible [services that use the failing service] also saw elevated error rates, although the user impact was greatly reduced by caching and redundancy built into those services.”

Incident	#42 at Google on 2019/03/12
Full report	https://status.cloud.google.com/incident/storage/19002
How it happened	A data store system had an (unexplained in the report) increase in storage resources used, and so responders made a configuration change which had a side effect of overloading the subsystem used for looking up location of stored data. The increased load led to a cascading failure.
Architecture	A data store system (storing "blob" data) with multiple dependent systems, most of which cache the data.
Technologies	Google’s internal blob, Google Cloud Platform, Google Cloud Storage, Stackdriver Monitoring, App’s Engine Blobstore API, Google services (Gmail, Photos, Google Drive, etc.)
Root cause	A configuration change to the data store system that overloaded one of the subsystems.
Failure	Elevated error rates (20% on average, peaking at 31%) from data store system.
Impact	Services that depend on the storage system (eg, Gmail, Photos, Google Drive) experienced failures (mimized by data caching), increased (long tail) latency.
Mitigation	Engineers stopped the configuration change deployment and manually reduced traffic levels until tasks restarted (as they would otherwise crash on star up due to load).