Incident
|
#1 at
Buildkite on
2016/09/23 by Kieth Pitt (Founder and CTO)
|
Full report
|
https://building.buildkite.com/outage-post-mortem-for-august-23rd-82b619a3679b
|
How it happened
|
Downgraded to a lower capacity database instance. At subsequent daily peak, load exceeded database capacity and database connections failed. EC2 instances were removed by the load balancer due to health checks which called the database.
|
Architecture
|
A service running on multiple EC2 instances in an Autoscaling group (behind an Elastic Load Balancer) accessing a PostgreSQL Relational Database Service (RDS) instance.
|
Technologies
|
Amazon Elastic Compute Cloud (EC2), Amazon Relational Databases (RDS), PostgreSQL, Elastic Load Balancing (ELB)
|
Root cause
|
Database was under-scaled for peak load. The service's health checks called the database.
|
Failure
|
Failed connections to database; all EC2 instances removed and no new EC2 instances successfully activated.
|
Impact
|
Users could not log in or access other system features.
|
Mitigation
|
Upgraded database and added new EC2 instances to Elastic Load Balancer.
|