CircleCI DB performance issue

“At this point, we were in extended failure mode: the original cause of the outage was no longer the fire to be fought. We were suffering a cascading effect, and that was now where we needed to put our focus.”

Incident	#27 at CircleCI on 2015/07/25
Full report	https://circleci.statuspage.io/incidents/hr0mm9xmm3x6
How it happened	There was an interruption in receiving external events (push hooks from GitHub), followed by a burst of events built up during the interruption (causing an arrival rate several multiples of normal peak). The queue backed up and event processing dropped to one per minute because the database became unresponsive due to resource contention. Engineers throttled incoming traffic at the load balancer (in an attempt to allow the queue to drain) but this made the entire site unresponsive as customer and event traffic came trhough the same load balancers.
Architecture	An event queue (receiving external events, from GitHub); and service reading from that queue and querying a database (with events being re-queued if the query failed).
Technologies	GitHub
Root cause	A surge of events exceeded the database's capacity.
Failure	Database became unresponsive due to resource contention.
Impact	External events were not processed and customer's could not reach the site.
Mitigation	Turned off automatic re-queueing of builds, optimized several slow running queries and then killed many active jobs (ie, the work of the event processing service) by clearing various queues.