The day we deleted our VM images

“To avoid running out of space, we have an automated cleanup service in place to delete images that have been removed from our internal image catalog service. You may already see where this is going.”
Incident #34 at Travis CI on 2016/08/09 by Konstantin Haase (Chief Technology Officer)
Full report https://blog.travis-ci.com/2016-09-30-the-day-we-deleted-our-vm-images/
How it happened To troubleshoot a bug, the clean up process was turned off and in the meantime the organization began creating more virtual machine images than before, including images that had not yet beeen fully tested. The clean up process was turned back on and retrieved from the database a partial list of valid images (due to a limit of 100 on the query). The clean up process then deleted older, but still in use, images (including the images used most).
Architecture Continuous integration tool running build jobs on Linux virtual machines in Google Compute Engine.
Technologies Google Compute Engine (GCE)
Root cause Virtual machine image cleanup script that queried a database for a list of valid images (so it knew what not to delete) had a limit of 100 on the query.
Failure Deletion of virtual machine images being used in client build jobs.
Impact Extended outage and permanent loss of virual machine images, breaking many customer's build jobs.
Mitigation Recovering deleted virtual machine images was not an option so they rolled forward all jobs to the new (not well tested) images. The engineering team spent more than a week fixing issues that arose.