AWS Lambda
|
# |
Org |
Year |
Report |
33
|
Amazon Web Services
|
2017
|
Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region:
“Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”
|
Amazon Elastic Block Store (EBS)
|
# |
Org |
Year |
Report |
22
|
Amazon Web Services
|
2011
|
Amazon EC2 and Amazon RDS Service Disruption in the US East Region:
“As with any complicated operational issue, this one was caused by several root causes interacting with one another and therefore gives us many opportunities to protect the service against any similar event reoccurring.”
|
23
|
Foursquare
|
2010
|
Forsquare outage post mortem:
“Over these two months, check-ins were being written continually to each shard. Unfortunately, these check-ins did not grow evenly across chunks.”
|
33
|
Amazon Web Services
|
2017
|
Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region:
“Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”
|
36
|
Amazon Web Services
|
2016
|
Summary of the AWS Service Event in the Sydney Region:
“The specific signature of this weekend’s utility power failure resulted in an unusually long voltage sag (rather than a complete outage).”
|
Amazon Elastic Compute Cloud (EC2)
|
# |
Org |
Year |
Report |
1
|
Buildkite
|
2016
|
Buildkite Outage:
“We woke up at 21:00 UTC almost 4 hours after we went offline to see our phones full of emails, tweets and Slack messages letting us know Buildkite was down. Many expletives were yelled as we all raced out of bed, opened laptops, and started figuring out what was going on.”
|
8
|
Epic
|
2018
|
Postmortem of Sevice Outage at 3.4M CCU:
“Fortnite hit a new peak of 3.4 million concurrent players last Sunday... and that didn't come without issues!”
|
9
|
Parse.ly
|
2015
|
Kafkapocalypse: a postmortem on our service outage:
“The real problem here isn't failure, but correlated cluster-wide failure. Because we were close to network limits on all of our Kafka nodes, when one failed, the remaining nodes would have to serve more consumers, which would, in turn, lead to more network traffic on those remaining nodes.”
|
22
|
Amazon Web Services
|
2011
|
Amazon EC2 and Amazon RDS Service Disruption in the US East Region:
“As with any complicated operational issue, this one was caused by several root causes interacting with one another and therefore gives us many opportunities to protect the service against any similar event reoccurring.”
|
23
|
Foursquare
|
2010
|
Forsquare outage post mortem:
“Over these two months, check-ins were being written continually to each shard. Unfortunately, these check-ins did not grow evenly across chunks.”
|
33
|
Amazon Web Services
|
2017
|
Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region:
“Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”
|
36
|
Amazon Web Services
|
2016
|
Summary of the AWS Service Event in the Sydney Region:
“The specific signature of this weekend’s utility power failure resulted in an unusually long voltage sag (rather than a complete outage).”
|
39
|
Travis CI
|
2017
|
Travis CI Container-based Linux Precise infrastructure emergency maintenance:
“This change appears to have effects on how bash handles exit codes, in a manner that we have fully investigated yet. This change was not detected by our staging environment tests and revealed insufficient diversity in how our tests reflect the variety of builds ou users are running.”
|
Amazon Kinesis
|
# |
Org |
Year |
Report |
41
|
Amazon Web Services
|
2020
|
Summary of the Amazon Kinesis Event in the Northern Virginia (US-EAST-1) Region:
“The trigger, though not root cause, for the event was a relatively small addition of capacity”
|
Amazon Relational Databases (RDS)
|
# |
Org |
Year |
Report |
1
|
Buildkite
|
2016
|
Buildkite Outage:
“We woke up at 21:00 UTC almost 4 hours after we went offline to see our phones full of emails, tweets and Slack messages letting us know Buildkite was down. Many expletives were yelled as we all raced out of bed, opened laptops, and started figuring out what was going on.”
|
Amazon Simple Storage Service (S3)
|
# |
Org |
Year |
Report |
14
|
Tarsnap
|
2016
|
Tarsnap Outage:
“I'm happy that the particular failure mode -- 'something weird happened; shut down all the things' -- ran exactly as I hoped.”
|
33
|
Amazon Web Services
|
2017
|
Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region:
“Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”
|
Amazon Virtual Private Cloud (VPC) Regions
|
# |
Org |
Year |
Report |
22
|
Amazon Web Services
|
2011
|
Amazon EC2 and Amazon RDS Service Disruption in the US East Region:
“As with any complicated operational issue, this one was caused by several root causes interacting with one another and therefore gives us many opportunities to protect the service against any similar event reoccurring.”
|
Apache Kafka
|
# |
Org |
Year |
Report |
9
|
Parse.ly
|
2015
|
Kafkapocalypse: a postmortem on our service outage:
“The real problem here isn't failure, but correlated cluster-wide failure. Because we were close to network limits on all of our Kafka nodes, when one failed, the remaining nodes would have to serve more consumers, which would, in turn, lead to more network traffic on those remaining nodes.”
|
Apache ZooKeeper
|
# |
Org |
Year |
Report |
9
|
Parse.ly
|
2015
|
Kafkapocalypse: a postmortem on our service outage:
“The real problem here isn't failure, but correlated cluster-wide failure. Because we were close to network limits on all of our Kafka nodes, when one failed, the remaining nodes would have to serve more consumers, which would, in turn, lead to more network traffic on those remaining nodes.”
|
19
|
Reddit
|
2017
|
Why Reddit was down on Aug 11:
“Autoscaler read the partially migrated Zookeeper data and terminated many of our application servers, which serve our website and API, and our caching servers, in 16 seconds.”
|
35
|
Elastic
|
2019
|
Elastic Cloud January 18, 2019 Incident Report:
“So, if the ZooKeeper server is loaded and causes heartbeat timeouts because of GC pauses, TreeCache will start flooding ZooKeeper with requests, making the situation worse and leading to a chain reaction that prevents the ZooKeeper servers from recovering, and can also kill client services.”
|
40
|
Elastic
|
2019
|
Elastic Cloud Incident Report: February 4, 2019:
“Service metrics had reported the hosts as healthy, thus signaling that it was safe to proceed with the maintenance; however, the metrics proved to be insufficient in conveying the state of individual hosts and of the coordination layer as a whole.”
|
App’s Engine Blobstore API
|
# |
Org |
Year |
Report |
42
|
Google
|
2019
|
Elevated error rate with Google Cloud Storage:
“User-visible [services that use the failing service] also saw elevated error rates, although the user impact was greatly reduced by caching and redundancy built into those services.”
|
Azure Disk Snapshots
|
# |
Org |
Year |
Report |
6
|
Gitlab
|
2017
|
Postmortem of database outage of January 31:
“Trying to restore the replication process, an engineer proceeds to wipe the PostgreSQL database directory, errantly thinking they were doing so on the secondary. Unfortunately this process was executed on the primary instead.”
|
Diesel Rotary Uninterruptable Power Supply (DRUPS)
|
# |
Org |
Year |
Report |
36
|
Amazon Web Services
|
2016
|
Summary of the AWS Service Event in the Sydney Region:
“The specific signature of this weekend’s utility power failure resulted in an unusually long voltage sag (rather than a complete outage).”
|
Docker
|
# |
Org |
Year |
Report |
39
|
Travis CI
|
2017
|
Travis CI Container-based Linux Precise infrastructure emergency maintenance:
“This change appears to have effects on how bash handles exit codes, in a manner that we have fully investigated yet. This change was not detected by our staging environment tests and revealed insufficient diversity in how our tests reflect the variety of builds ou users are running.”
|
Elastic Load Balancing (ELB)
|
# |
Org |
Year |
Report |
1
|
Buildkite
|
2016
|
Buildkite Outage:
“We woke up at 21:00 UTC almost 4 hours after we went offline to see our phones full of emails, tweets and Slack messages letting us know Buildkite was down. Many expletives were yelled as we all raced out of bed, opened laptops, and started figuring out what was going on.”
|
25
|
Amazon Web Services
|
2012
|
Amazon ELB Service Event in the US-East Region:
“This process was run by one of a very small number of developers who have access to this production environment. Unfortunately, the developer did not realize the mistake at the time.”
|
Elasticsearch
|
# |
Org |
Year |
Report |
40
|
Elastic
|
2019
|
Elastic Cloud Incident Report: February 4, 2019:
“Service metrics had reported the hosts as healthy, thus signaling that it was safe to proceed with the maintenance; however, the metrics proved to be insufficient in conveying the state of individual hosts and of the coordination layer as a whole.”
|
GitHub
|
# |
Org |
Year |
Report |
27
|
CircleCI
|
2015
|
CircleCI DB performance issue:
“At this point, we were in extended failure mode: the original cause of the outage was no longer the fire to be fought. We were suffering a cascading effect, and that was now where we needed to put our focus.”
|
Go
|
# |
Org |
Year |
Report |
2
|
Square
|
2017
|
Always Be Closing: The Tale of a Go Resource Leak:
“This root cause was tickled by a configuration change in another service, which inadvertently set its client request timeout to 60,000 seconds instead of the intended 60,000 milliseconds.”
|
Google Cloud Platform
|
# |
Org |
Year |
Report |
42
|
Google
|
2019
|
Elevated error rate with Google Cloud Storage:
“User-visible [services that use the failing service] also saw elevated error rates, although the user impact was greatly reduced by caching and redundancy built into those services.”
|
Google Cloud Platform (GCP)
|
# |
Org |
Year |
Report |
3
|
Discord
|
2017
|
Unavailable Guilds & Connection Issues:
“These issues caused enough critical impact that Discord's engineering team was forced to fully restart the service, reconnecting millions of clients over a period of 20 minutes.”
|
16
|
Google
|
2019
|
Google cloud Fails during maintenance:
“Two normally-benign misconfigurations, and a specific software bug, combined to initiate the outage.”
|
Google Cloud Storage
|
# |
Org |
Year |
Report |
42
|
Google
|
2019
|
Elevated error rate with Google Cloud Storage:
“User-visible [services that use the failing service] also saw elevated error rates, although the user impact was greatly reduced by caching and redundancy built into those services.”
|
Google Compute Engine (GCE)
|
# |
Org |
Year |
Report |
15
|
Discord
|
2017
|
Connectivity Issues:
“Shortly thereafter the nodes of another service [...] attempted a reconnection, triggering a massive 'thundering herd' towards the existing members of the presence cluster.”
|
31
|
Google
|
2015
|
Google Compute Engine Persistent Disk issue in europe-west1-b:
“Four successive lightning strikes on the local utilities grid that powers our European datacenter caused a brief loss of power to storage systems which host disk capacity for GCE instances in the europe-west1-b zone.”
|
34
|
Travis CI
|
2016
|
The day we deleted our VM images:
“To avoid running out of space, we have an automated cleanup service in place to delete images that have been removed from our internal image catalog service. You may already see where this is going.”
|
39
|
Travis CI
|
2017
|
Travis CI Container-based Linux Precise infrastructure emergency maintenance:
“This change appears to have effects on how bash handles exit codes, in a manner that we have fully investigated yet. This change was not detected by our staging environment tests and revealed insufficient diversity in how our tests reflect the variety of builds ou users are running.”
|
Google Drive
|
# |
Org |
Year |
Report |
42
|
Google
|
2019
|
Elevated error rate with Google Cloud Storage:
“User-visible [services that use the failing service] also saw elevated error rates, although the user impact was greatly reduced by caching and redundancy built into those services.”
|
Google services (Gmail
|
# |
Org |
Year |
Report |
42
|
Google
|
2019
|
Elevated error rate with Google Cloud Storage:
“User-visible [services that use the failing service] also saw elevated error rates, although the user impact was greatly reduced by caching and redundancy built into those services.”
|
Google’s internal blob
|
# |
Org |
Year |
Report |
42
|
Google
|
2019
|
Elevated error rate with Google Cloud Storage:
“User-visible [services that use the failing service] also saw elevated error rates, although the user impact was greatly reduced by caching and redundancy built into those services.”
|
HAProxy Load Balancer
|
# |
Org |
Year |
Report |
17
|
Stack Exchange
|
2014
|
Stack Exchange Configuration Error:
“While attempting to make a change enabling streamlined access for our web servers to internal API endpoints [...] a misleading comment in the iptables configuration led us to make a harmful change.”
|
Internet Information Services (IIS)
|
# |
Org |
Year |
Report |
17
|
Stack Exchange
|
2014
|
Stack Exchange Configuration Error:
“While attempting to make a change enabling streamlined access for our web servers to internal API endpoints [...] a misleading comment in the iptables configuration led us to make a harmful change.”
|
Kibana
|
# |
Org |
Year |
Report |
40
|
Elastic
|
2019
|
Elastic Cloud Incident Report: February 4, 2019:
“Service metrics had reported the hosts as healthy, thus signaling that it was safe to proceed with the maintenance; however, the metrics proved to be insufficient in conveying the state of individual hosts and of the coordination layer as a whole.”
|
Mcrouter
|
# |
Org |
Year |
Report |
43
|
Slack
|
2022
|
Slack’s Incident on 2-22-22:
“What was not obvious early on was why we were seeing so much database load on this keyspace and how we might get to a normal serving state.”
|
Memcached
|
# |
Org |
Year |
Report |
43
|
Slack
|
2022
|
Slack’s Incident on 2-22-22:
“What was not obvious early on was why we were seeing so much database load on this keyspace and how we might get to a normal serving state.”
|
MongoDB
|
# |
Org |
Year |
Report |
8
|
Epic
|
2018
|
Postmortem of Sevice Outage at 3.4M CCU:
“Fortnite hit a new peak of 3.4 million concurrent players last Sunday... and that didn't come without issues!”
|
23
|
Foursquare
|
2010
|
Forsquare outage post mortem:
“Over these two months, check-ins were being written continually to each shard. Unfortunately, these check-ins did not grow evenly across chunks.”
|
MySQL
|
# |
Org |
Year |
Report |
13
|
GitHub
|
2018
|
October 21 post-incident analysis:
“Connectivity between these locations was restored in 43 seconds, but this brief outage triggered a chain of events that led to 24 hours and 11 minutes of service degradation.”
|
20
|
Dropbox
|
2014
|
Outage post-mortem:
“For the past couple of days, we’ve been working around the clock to restore full access as soon as possible.”
|
43
|
Slack
|
2022
|
Slack’s Incident on 2-22-22:
“What was not obvious early on was why we were seeing so much database load on this keyspace and how we might get to a normal serving state.”
|
NGINX
|
# |
Org |
Year |
Report |
4
|
Cloudflare
|
2017
|
Incident report on memory leak caused by Cloudflare parser bug:
“So, the bug had been dormant for years until the internal feng shui of the buffers passed between NGINX filter modules changed with the introduction of cf-html.”
|
8
|
Epic
|
2018
|
Postmortem of Sevice Outage at 3.4M CCU:
“Fortnite hit a new peak of 3.4 million concurrent players last Sunday... and that didn't come without issues!”
|
Opscode Chef
|
# |
Org |
Year |
Report |
9
|
Parse.ly
|
2015
|
Kafkapocalypse: a postmortem on our service outage:
“The real problem here isn't failure, but correlated cluster-wide failure. Because we were close to network limits on all of our Kafka nodes, when one failed, the remaining nodes would have to serve more consumers, which would, in turn, lead to more network traffic on those remaining nodes.”
|
Orchestrator
|
# |
Org |
Year |
Report |
13
|
GitHub
|
2018
|
October 21 post-incident analysis:
“Connectivity between these locations was restored in 43 seconds, but this brief outage triggered a chain of events that led to 24 hours and 11 minutes of service degradation.”
|
Pacemaker
|
# |
Org |
Year |
Report |
10
|
GoCardless
|
2017
|
Incident review: API and Dashboard outage on 10 October 2017:
“The Pacemaker cluster correctly observed that Postgres was unhealthy on the primary node. It repeatedly attempted to promote a new primary, but each time it couldn't decide where that primary should run.”
|
Photos
|
# |
Org |
Year |
Report |
42
|
Google
|
2019
|
Elevated error rate with Google Cloud Storage:
“User-visible [services that use the failing service] also saw elevated error rates, although the user impact was greatly reduced by caching and redundancy built into those services.”
|
PostgreSQL
|
# |
Org |
Year |
Report |
1
|
Buildkite
|
2016
|
Buildkite Outage:
“We woke up at 21:00 UTC almost 4 hours after we went offline to see our phones full of emails, tweets and Slack messages letting us know Buildkite was down. Many expletives were yelled as we all raced out of bed, opened laptops, and started figuring out what was going on.”
|
5
|
Mailchimp
|
2019
|
What We Learned from the Recent Mandrill Outage:
“In November of 2018, engineers on our Mandrill team identified the potential to reach wraparound, as the XIDs were climbing to approximately half their total limit during peak load. Our team determined wraparound was not an immediate threat, but we added a ticket to the backlog to set up additional monitoring.”
|
6
|
Gitlab
|
2017
|
Postmortem of database outage of January 31:
“Trying to restore the replication process, an engineer proceeds to wipe the PostgreSQL database directory, errantly thinking they were doing so on the secondary. Unfortunately this process was executed on the primary instead.”
|
10
|
GoCardless
|
2017
|
Incident review: API and Dashboard outage on 10 October 2017:
“The Pacemaker cluster correctly observed that Postgres was unhealthy on the primary node. It repeatedly attempted to promote a new primary, but each time it couldn't decide where that primary should run.”
|
21
|
Joyent
|
2015
|
Postmortem for July 27 outage of the Manta service:
“There was a single 'DROP TRIGGER' query that was attempting to take an exclusive lock on the whole table. It appears that PostgreSQL blocks new attempts to take a shared lock while an exclusive lock is wanted.”
|
26
|
GoCardless
|
2015
|
Zero-downtime Postgres migrations - the hard parts:
“We deployed the changes, and all of our assumptions got blown out of the water. Just after the schema migration started, we started getting alerts about API requests timing out.”
|
Puppet
|
# |
Org |
Year |
Report |
17
|
Stack Exchange
|
2014
|
Stack Exchange Configuration Error:
“While attempting to make a change enabling streamlined access for our web servers to internal API endpoints [...] a misleading comment in the iptables configuration led us to make a harmful change.”
|
Raft
|
# |
Org |
Year |
Report |
13
|
GitHub
|
2018
|
October 21 post-incident analysis:
“Connectivity between these locations was restored in 43 seconds, but this brief outage triggered a chain of events that led to 24 hours and 11 minutes of service degradation.”
|
Ragel
|
# |
Org |
Year |
Report |
4
|
Cloudflare
|
2017
|
Incident report on memory leak caused by Cloudflare parser bug:
“So, the bug had been dormant for years until the internal feng shui of the buffers passed between NGINX filter modules changed with the introduction of cf-html.”
|
Redis
|
# |
Org |
Year |
Report |
3
|
Discord
|
2017
|
Unavailable Guilds & Connection Issues:
“These issues caused enough critical impact that Discord's engineering team was forced to fully restart the service, reconnecting millions of clients over a period of 20 minutes.”
|
28
|
Twilio
|
2013
|
Billing Incident Post-Mortem: Breakdown, Analysis and Root Cause:
“This caused all redis-slaves to reconnect and request full synchronization with the master at the same time. Receiving full sync requests from each redis-slave caused the master to suffer extreme load, resulting in performance degradation of the master and timeouts from redis-slaves to redis-master.”
|
32
|
Github
|
2016
|
GitHub January 28th Incident Report:
“Slightly over 25% of our servers and several network devices rebooted as a result. This left our infrastructure in a partially operational state and generated alerts to multiple on-call engineers.”
|
Relational Database Service (RDS)
|
# |
Org |
Year |
Report |
22
|
Amazon Web Services
|
2011
|
Amazon EC2 and Amazon RDS Service Disruption in the US East Region:
“As with any complicated operational issue, this one was caused by several root causes interacting with one another and therefore gives us many opportunities to protect the service against any similar event reoccurring.”
|
SQL Server
|
# |
Org |
Year |
Report |
29
|
Stack Exchange
|
2017
|
Outage Postmortem - January 24 2017:
“It took us 2 minutes to notice the issue, 5 minutes to locate the source of the issue and 10 minutes to get service restored.”
|
Stackdriver Monitoring
|
# |
Org |
Year |
Report |
42
|
Google
|
2019
|
Elevated error rate with Google Cloud Storage:
“User-visible [services that use the failing service] also saw elevated error rates, although the user impact was greatly reduced by caching and redundancy built into those services.”
|
Standard Persistent Disks
|
# |
Org |
Year |
Report |
31
|
Google
|
2015
|
Google Compute Engine Persistent Disk issue in europe-west1-b:
“Four successive lightning strikes on the local utilities grid that powers our European datacenter caused a brief loss of power to storage systems which host disk capacity for GCE instances in the europe-west1-b zone.”
|
TreeCache
|
# |
Org |
Year |
Report |
35
|
Elastic
|
2019
|
Elastic Cloud January 18, 2019 Incident Report:
“So, if the ZooKeeper server is loaded and causes heartbeat timeouts because of GC pauses, TreeCache will start flooding ZooKeeper with requests, making the situation worse and leading to a chain reaction that prevents the ZooKeeper servers from recovering, and can also kill client services.”
|
Vitess clustering system
|
# |
Org |
Year |
Report |
43
|
Slack
|
2022
|
Slack’s Incident on 2-22-22:
“What was not obvious early on was why we were seeing so much database load on this keyspace and how we might get to a normal serving state.”
|
etc.)
|
# |
Org |
Year |
Report |
42
|
Google
|
2019
|
Elevated error rate with Google Cloud Storage:
“User-visible [services that use the failing service] also saw elevated error rates, although the user impact was greatly reduced by caching and redundancy built into those services.”
|
vSphere
|
# |
Org |
Year |
Report |
18
|
Travis CI
|
2015
|
High queue times on OSX builds (.com and .org):
“When the [passwords] rotation happened, the configuration for the vsphere-janitor service did not get updated.”
|
39
|
Travis CI
|
2017
|
Travis CI Container-based Linux Precise infrastructure emergency maintenance:
“This change appears to have effects on how bash handles exit codes, in a manner that we have fully investigated yet. This change was not detected by our staging environment tests and revealed insufficient diversity in how our tests reflect the variety of builds ou users are running.”
|