Buildkite Outage

“We woke up at 21:00 UTC almost 4 hours after we went offline to see our phones full of emails, tweets and Slack messages letting us know Buildkite was down. Many expletives were yelled as we all raced out of bed, opened laptops, and started figuring out what was going on.”

Incident	#1 at Buildkite on 2016/09/23 by Kieth Pitt (Founder and CTO)
Full report	https://building.buildkite.com/outage-post-mortem-for-august-23rd-82b619a3679b
How it happened	Downgraded to a lower capacity database instance. At subsequent daily peak, load exceeded database capacity and database connections failed. EC2 instances were removed by the load balancer due to health checks which called the database.
Architecture	A service running on multiple EC2 instances in an Autoscaling group (behind an Elastic Load Balancer) accessing a PostgreSQL Relational Database Service (RDS) instance.
Technologies	Amazon Elastic Compute Cloud (EC2), Amazon Relational Databases (RDS), PostgreSQL, Elastic Load Balancing (ELB)
Root cause	Database was under-scaled for peak load. The service's health checks called the database.
Failure	Failed connections to database; all EC2 instances removed and no new EC2 instances successfully activated.
Impact	Users could not log in or access other system features.
Mitigation	Upgraded database and added new EC2 instances to Elastic Load Balancer.