Incident
|
#30 at
Heroku on
2017/04/03
|
Full report
|
https://status.heroku.com/incidents/1091
|
How it happened
|
During scheduled maintenance a configuration procedure timed-out leaving one instance configured incorrectly and this timeout was not visible to the engineers performing the maintenance. As a result, for newly stopped or started containers it was possible for the routing fleet to miss that state change (ie, treat a stopped container as running or a running container as stopped)
|
Architecture
|
A fleet for creating hosted application containers (called Dynos) and a separate fleet that routes traffic to those containers.
|
Technologies
|
|
Root cause
|
A recently added timeout for manual system administration activities that unexpectedly affected automated activities leading to potentially silent failures during updates. Specifically, a remote procedure to update instances of the fleet (the fleet that creates containers) configurations timed out leaving one instance with an incorrect configuration.
|
Failure
|
Traffic was routed to some stopped containers and not routed to some started containers.
|
Impact
|
Elevated error rate (including 'backend connection timeout', 'backend connetion refused', 'app crashed' and 'request error') for applications with containers that were stopped or started over the course of the incident.
|
Mitigation
|
Flushed the routing caches helped mitigate the issue during the investigation to minimize impact; updated instance with correct configuration.
|