# |
Org |
Year |
Report |
1
|
Buildkite
|
2016
|
Buildkite Outage:
“We woke up at 21:00 UTC almost 4 hours after we went offline to see our phones full of emails, tweets and Slack messages letting us know Buildkite was down. Many expletives were yelled as we all raced out of bed, opened laptops, and started figuring out what was going on.”
|
2
|
Square
|
2017
|
Always Be Closing: The Tale of a Go Resource Leak:
“This root cause was tickled by a configuration change in another service, which inadvertently set its client request timeout to 60,000 seconds instead of the intended 60,000 milliseconds.”
|
3
|
Discord
|
2017
|
Unavailable Guilds & Connection Issues:
“These issues caused enough critical impact that Discord's engineering team was forced to fully restart the service, reconnecting millions of clients over a period of 20 minutes.”
|
4
|
Cloudflare
|
2017
|
Incident report on memory leak caused by Cloudflare parser bug:
“So, the bug had been dormant for years until the internal feng shui of the buffers passed between NGINX filter modules changed with the introduction of cf-html.”
|
5
|
Mailchimp
|
2019
|
What We Learned from the Recent Mandrill Outage:
“In November of 2018, engineers on our Mandrill team identified the potential to reach wraparound, as the XIDs were climbing to approximately half their total limit during peak load. Our team determined wraparound was not an immediate threat, but we added a ticket to the backlog to set up additional monitoring.”
|
6
|
Gitlab
|
2017
|
Postmortem of database outage of January 31:
“Trying to restore the replication process, an engineer proceeds to wipe the PostgreSQL database directory, errantly thinking they were doing so on the secondary. Unfortunately this process was executed on the primary instead.”
|
7
|
Salesforce
|
2016
|
RCM for NA14 Disruptions of Service - May 2016:
“Each attempt to restore service resulted in errors or failures that prevented these approaches from continuing.”
|
8
|
Epic
|
2018
|
Postmortem of Sevice Outage at 3.4M CCU:
“Fortnite hit a new peak of 3.4 million concurrent players last Sunday... and that didn't come without issues!”
|
9
|
Parse.ly
|
2015
|
Kafkapocalypse: a postmortem on our service outage:
“The real problem here isn't failure, but correlated cluster-wide failure. Because we were close to network limits on all of our Kafka nodes, when one failed, the remaining nodes would have to serve more consumers, which would, in turn, lead to more network traffic on those remaining nodes.”
|
10
|
GoCardless
|
2017
|
Incident review: API and Dashboard outage on 10 October 2017:
“The Pacemaker cluster correctly observed that Postgres was unhealthy on the primary node. It repeatedly attempted to promote a new primary, but each time it couldn't decide where that primary should run.”
|
11
|
Stripe
|
2019
|
Root cause analysis: significantly elevated error rates on 2019-07-10:
“The new version also introduced a subtle fault in the database’s failover system that only manifested in the presence of multiple stalled nodes. On the day of the events, one shard was in the specific state that triggered this fault.”
|
12
|
Cloudflare
|
2019
|
Details of the Cloudflare outage on July 2, 2019:
“The real story of how the Cloudflare service went down for 27 minutes is much more complex than 'a regular expression went bad'.”
|
13
|
GitHub
|
2018
|
October 21 post-incident analysis:
“Connectivity between these locations was restored in 43 seconds, but this brief outage triggered a chain of events that led to 24 hours and 11 minutes of service degradation.”
|
14
|
Tarsnap
|
2016
|
Tarsnap Outage:
“I'm happy that the particular failure mode -- 'something weird happened; shut down all the things' -- ran exactly as I hoped.”
|
15
|
Discord
|
2017
|
Connectivity Issues:
“Shortly thereafter the nodes of another service [...] attempted a reconnection, triggering a massive 'thundering herd' towards the existing members of the presence cluster.”
|
16
|
Google
|
2019
|
Google cloud Fails during maintenance:
“Two normally-benign misconfigurations, and a specific software bug, combined to initiate the outage.”
|
17
|
Stack Exchange
|
2014
|
Stack Exchange Configuration Error:
“While attempting to make a change enabling streamlined access for our web servers to internal API endpoints [...] a misleading comment in the iptables configuration led us to make a harmful change.”
|
18
|
Travis CI
|
2015
|
High queue times on OSX builds (.com and .org):
“When the [passwords] rotation happened, the configuration for the vsphere-janitor service did not get updated.”
|
19
|
Reddit
|
2017
|
Why Reddit was down on Aug 11:
“Autoscaler read the partially migrated Zookeeper data and terminated many of our application servers, which serve our website and API, and our caching servers, in 16 seconds.”
|
20
|
Dropbox
|
2014
|
Outage post-mortem:
“For the past couple of days, we’ve been working around the clock to restore full access as soon as possible.”
|
21
|
Joyent
|
2015
|
Postmortem for July 27 outage of the Manta service:
“There was a single 'DROP TRIGGER' query that was attempting to take an exclusive lock on the whole table. It appears that PostgreSQL blocks new attempts to take a shared lock while an exclusive lock is wanted.”
|
22
|
Amazon Web Services
|
2011
|
Amazon EC2 and Amazon RDS Service Disruption in the US East Region:
“As with any complicated operational issue, this one was caused by several root causes interacting with one another and therefore gives us many opportunities to protect the service against any similar event reoccurring.”
|
23
|
Foursquare
|
2010
|
Forsquare outage post mortem:
“Over these two months, check-ins were being written continually to each shard. Unfortunately, these check-ins did not grow evenly across chunks.”
|
24
|
Cloudflare
|
2020
|
Cloudflare outage on July 17, 2020:
“This configuration contained an error that caused all traffic across our backbone to be sent to Atlanta. This quickly overwhelmed the Atlanta router and caused Cloudflare network locations connected to the backbone to fail.”
|
25
|
Amazon Web Services
|
2012
|
Amazon ELB Service Event in the US-East Region:
“This process was run by one of a very small number of developers who have access to this production environment. Unfortunately, the developer did not realize the mistake at the time.”
|
26
|
GoCardless
|
2015
|
Zero-downtime Postgres migrations - the hard parts:
“We deployed the changes, and all of our assumptions got blown out of the water. Just after the schema migration started, we started getting alerts about API requests timing out.”
|
27
|
CircleCI
|
2015
|
CircleCI DB performance issue:
“At this point, we were in extended failure mode: the original cause of the outage was no longer the fire to be fought. We were suffering a cascading effect, and that was now where we needed to put our focus.”
|
28
|
Twilio
|
2013
|
Billing Incident Post-Mortem: Breakdown, Analysis and Root Cause:
“This caused all redis-slaves to reconnect and request full synchronization with the master at the same time. Receiving full sync requests from each redis-slave caused the master to suffer extreme load, resulting in performance degradation of the master and timeouts from redis-slaves to redis-master.”
|
29
|
Stack Exchange
|
2017
|
Outage Postmortem - January 24 2017:
“It took us 2 minutes to notice the issue, 5 minutes to locate the source of the issue and 10 minutes to get service restored.”
|
30
|
Heroku
|
2017
|
Heroku April 2017 App Crashes:
“These missed state updates were very hard for us to discover because our routing fleet only maintains a connection to the affected class of instance for 30 minutes. After this time the connection is terminated and cycled to another server.”
|
31
|
Google
|
2015
|
Google Compute Engine Persistent Disk issue in europe-west1-b:
“Four successive lightning strikes on the local utilities grid that powers our European datacenter caused a brief loss of power to storage systems which host disk capacity for GCE instances in the europe-west1-b zone.”
|
32
|
Github
|
2016
|
GitHub January 28th Incident Report:
“Slightly over 25% of our servers and several network devices rebooted as a result. This left our infrastructure in a partially operational state and generated alerts to multiple on-call engineers.”
|
33
|
Amazon Web Services
|
2017
|
Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region:
“Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”
|
34
|
Travis CI
|
2016
|
The day we deleted our VM images:
“To avoid running out of space, we have an automated cleanup service in place to delete images that have been removed from our internal image catalog service. You may already see where this is going.”
|
35
|
Elastic
|
2019
|
Elastic Cloud January 18, 2019 Incident Report:
“So, if the ZooKeeper server is loaded and causes heartbeat timeouts because of GC pauses, TreeCache will start flooding ZooKeeper with requests, making the situation worse and leading to a chain reaction that prevents the ZooKeeper servers from recovering, and can also kill client services.”
|
36
|
Amazon Web Services
|
2016
|
Summary of the AWS Service Event in the Sydney Region:
“The specific signature of this weekend’s utility power failure resulted in an unusually long voltage sag (rather than a complete outage).”
|
37
|
Duo
|
2018
|
Authentication Latency on DUO1 Deployment:
“Once this problem was identified, these queues were flushed on each application server and things immediately began to stabilize. In hindsight, this is effectively what the software rollback did as part of the issue on August 20th, which is why the rollback solved that prior issue.”
|
38
|
Stack Exchange
|
2016
|
A Post-Mortem on the Recent Developer Story Information Leak:
“A bug that caused the user’s phone number and email address to render in the HTML source for people that weren’t the user or an employer attempting to contact the user went unnoticed, because the information wasn’t actually rendered on the page.”
|
39
|
Travis CI
|
2017
|
Travis CI Container-based Linux Precise infrastructure emergency maintenance:
“This change appears to have effects on how bash handles exit codes, in a manner that we have fully investigated yet. This change was not detected by our staging environment tests and revealed insufficient diversity in how our tests reflect the variety of builds ou users are running.”
|
40
|
Elastic
|
2019
|
Elastic Cloud Incident Report: February 4, 2019:
“Service metrics had reported the hosts as healthy, thus signaling that it was safe to proceed with the maintenance; however, the metrics proved to be insufficient in conveying the state of individual hosts and of the coordination layer as a whole.”
|
41
|
Amazon Web Services
|
2020
|
Summary of the Amazon Kinesis Event in the Northern Virginia (US-EAST-1) Region:
“The trigger, though not root cause, for the event was a relatively small addition of capacity”
|
42
|
Google
|
2019
|
Elevated error rate with Google Cloud Storage:
“User-visible [services that use the failing service] also saw elevated error rates, although the user impact was greatly reduced by caching and redundancy built into those services.”
|
43
|
Slack
|
2022
|
Slack’s Incident on 2-22-22:
“What was not obvious early on was why we were seeing so much database load on this keyspace and how we might get to a normal serving state.”
|