Incident
|
#42 at
Google on
2019/03/12
|
Full report
|
https://status.cloud.google.com/incident/storage/19002
|
How it happened
|
A data store system had an (unexplained in the report) increase in storage resources used, and so responders made a configuration change which had a side effect of overloading the subsystem used for looking up location of stored data. The increased load led to a cascading failure.
|
Architecture
|
A data store system (storing "blob" data) with multiple dependent systems, most of which cache the data.
|
Technologies
|
Google’s internal blob, Google Cloud Platform, Google Cloud Storage, Stackdriver Monitoring, App’s Engine Blobstore API, Google services (Gmail, Photos, Google Drive, etc.)
|
Root cause
|
A configuration change to the data store system that overloaded one of the subsystems.
|
Failure
|
Elevated error rates (20% on average, peaking at 31%) from data store system.
|
Impact
|
Services that depend on the storage system (eg, Gmail, Photos, Google Drive) experienced failures (mimized by data caching), increased (long tail) latency.
|
Mitigation
|
Engineers stopped the configuration change deployment and manually reduced traffic levels until tasks restarted (as they would otherwise crash on star up due to load).
|