High queue times on OSX builds (.com and .org)

“When the [passwords] rotation happened, the configuration for the vsphere-janitor service did not get updated.”
Incident #18 at Travis CI on 2015/08/04
Full report https://www.traviscistatus.com/incidents/khzk8bg4p9sy
How it happened Passwords were rotated for the vSphere API as required and the resource clean up service was not reconfigured with the new password. The clean up service could then no longer clean up virtual machines after use, leading to more than 6000 virtual machines on a cluster that typically has around 200. Due to a defect, the clean up service continued to report the last known number of virtual machines to the metrics system (which delayed notification). Once the initial problem was mitigated, the build service worked to catch up on queued jobs but exceeded capacity of the cluster.
Architecture Build services (creation, build and cleanup) running on Xserve hosts virtualized using a vSphere cluster and hosted by MacStadium.
Technologies vSphere
Root cause A resource clean up service did not get updated with the new credentials for the (vSphere) API; a configuration defect approved more virtual machines than could be supprted by the underlying cluster.
Failure Creating new virtual machines on the cluster failed, which led to elevated requeue rates and a backlog of work.
Impact A period of instability and and outage for the service.
Mitigation Paused work, reconfigured the clean up service to use the appropriate password and restarted the service. Again paused work, fixed the "CPU reservation" configuration and restarted the build service.