Incident
|
#27 at
CircleCI on
2015/07/25
|
Full report
|
https://circleci.statuspage.io/incidents/hr0mm9xmm3x6
|
How it happened
|
There was an interruption in receiving external events (push hooks from GitHub), followed by a burst of events built up during the interruption (causing an arrival rate several multiples of normal peak). The queue backed up and event processing dropped to one per minute because the database became unresponsive due to resource contention. Engineers throttled incoming traffic at the load balancer (in an attempt to allow the queue to drain) but this made the entire site unresponsive as customer and event traffic came trhough the same load balancers.
|
Architecture
|
An event queue (receiving external events, from GitHub); and service reading from that queue and querying a database (with events being re-queued if the query failed).
|
Technologies
|
GitHub
|
Root cause
|
A surge of events exceeded the database's capacity.
|
Failure
|
Database became unresponsive due to resource contention.
|
Impact
|
External events were not processed and customer's could not reach the site.
|
Mitigation
|
Turned off automatic re-queueing of builds, optimized several slow running queries and then killed many active jobs (ie, the work of the event processing service) by clearing various queues.
|