Heroku April 2017 App Crashes

“These missed state updates were very hard for us to discover because our routing fleet only maintains a connection to the affected class of instance for 30 minutes. After this time the connection is terminated and cycled to another server.”
Incident #30 at Heroku on 2017/04/03
Full report https://status.heroku.com/incidents/1091
How it happened During scheduled maintenance a configuration procedure timed-out leaving one instance configured incorrectly and this timeout was not visible to the engineers performing the maintenance. As a result, for newly stopped or started containers it was possible for the routing fleet to miss that state change (ie, treat a stopped container as running or a running container as stopped)
Architecture A fleet for creating hosted application containers (called Dynos) and a separate fleet that routes traffic to those containers.
Technologies
Root cause A recently added timeout for manual system administration activities that unexpectedly affected automated activities leading to potentially silent failures during updates. Specifically, a remote procedure to update instances of the fleet (the fleet that creates containers) configurations timed out leaving one instance with an incorrect configuration.
Failure Traffic was routed to some stopped containers and not routed to some started containers.
Impact Elevated error rate (including 'backend connection timeout', 'backend connetion refused', 'app crashed' and 'request error') for applications with containers that were stopped or started over the course of the incident.
Mitigation Flushed the routing caches helped mitigate the issue during the investigation to minimize impact; updated instance with correct configuration.