Incident
|
#39 at
Travis CI on
2017/02/02
|
Full report
|
https://www.traviscistatus.com/incidents/sxrh0l46czqn
|
How it happened
|
A new version of the service was deployed and builds were being incorrectly marked as failed, so the deployment was rolled back, but the rollback was unsuccessful as the previous version (ie, the version to rollback to) was not marked ("tagged") correctly in the source (Docker Hub).
|
Architecture
|
A service that provisions virtual machines (VMs) or containers (for running software builds) and monitors those VMs over their lifetime. The service has multiple backends so the provisioned VM/container can be Docker, Google Compute Engine, vSphere for macOS, etc. The service is autoscaled using many EC2 instances (with solid state drives), with each running a finite number of concurrent jobs.
|
Technologies
|
Amazon Elastic Compute Cloud (EC2), Docker, Google Compute Engine (GCE), vSphere
|
Root cause
|
A new version of the service was deployed with a defect in how it determines a bash script failed or succeeded (ie in how it handles exit codes)
|
Failure
|
The service marked successful jobs as failures; A rollback of the service failed.
|
Impact
|
Cutsomer jobs run on VMs provisioned by the service were marked as failed even if they succeeded in some cases.
|
Mitigation
|
Responders correctly marked the target verion in the source (Docker Hub) and forced the rollback, rather than wait for the normal cycle.
|