Incident
|
#34 at
Travis CI on
2016/08/09 by Konstantin Haase (Chief Technology Officer)
|
Full report
|
https://blog.travis-ci.com/2016-09-30-the-day-we-deleted-our-vm-images/
|
How it happened
|
To troubleshoot a bug, the clean up process was turned off and in the meantime the organization began creating more virtual machine images than before, including images that had not yet beeen fully tested. The clean up process was turned back on and retrieved from the database a partial list of valid images (due to a limit of 100 on the query). The clean up process then deleted older, but still in use, images (including the images used most).
|
Architecture
|
Continuous integration tool running build jobs on Linux virtual machines in Google Compute Engine.
|
Technologies
|
Google Compute Engine (GCE)
|
Root cause
|
Virtual machine image cleanup script that queried a database for a list of valid images (so it knew what not to delete) had a limit of 100 on the query.
|
Failure
|
Deletion of virtual machine images being used in client build jobs.
|
Impact
|
Extended outage and permanent loss of virual machine images, breaking many customer's build jobs.
|
Mitigation
|
Recovering deleted virtual machine images was not an option so they rolled forward all jobs to the new (not well tested) images. The engineering team spent more than a week fixing issues that arose.
|