Incident
|
#18 at
Travis CI on
2015/08/04
|
Full report
|
https://www.traviscistatus.com/incidents/khzk8bg4p9sy
|
How it happened
|
Passwords were rotated for the vSphere API as required and the resource clean up service was not reconfigured with the new password. The clean up service could then no longer clean up virtual machines after use, leading to more than 6000 virtual machines on a cluster that typically has around 200. Due to a defect, the clean up service continued to report the last known number of virtual machines to the metrics system (which delayed notification). Once the initial problem was mitigated, the build service worked to catch up on queued jobs but exceeded capacity of the cluster.
|
Architecture
|
Build services (creation, build and cleanup) running on Xserve hosts virtualized using a vSphere cluster and hosted by MacStadium.
|
Technologies
|
vSphere
|
Root cause
|
A resource clean up service did not get updated with the new credentials for the (vSphere) API; a configuration defect approved more virtual machines than could be supprted by the underlying cluster.
|
Failure
|
Creating new virtual machines on the cluster failed, which led to elevated requeue rates and a backlog of work.
|
Impact
|
A period of instability and and outage for the service.
|
Mitigation
|
Paused work, reconfigured the clean up service to use the appropriate password and restarted the service. Again paused work, fixed the "CPU reservation" configuration and restarted the build service.
|