Incident
|
#14 at
Tarsnap on
2016/07/24 by Colin Percival (Owner)
|
Full report
|
http://mail.tarsnap.com/tarsnap-announce/msg00035.html
|
How it happened
|
Write requests to a third party dependency (S3) began to experience timeout failures, likely due to a change on the third party side, for a subset of requests. These requests were retried until retry limits were hit. The process writing to the dependency aborted and automatically restarts, redundantly logging each time. The filesystem fills up and the primary service experiences disk write failures and shutsdown.
|
Architecture
|
A fleet of servers running an archiving service, and several supporting background jobs.
|
Technologies
|
Amazon Simple Storage Service (S3)
|
Root cause
|
The service experienced an increase in correlated timeout failures from a third party dependency (Amazon S3).
|
Failure
|
Timeout failures for requests to third party dependency; filesystem at 100% capacity led to the service shutting down.
|
Impact
|
Service functionality was unavailable.
|
Mitigation
|
Deleted the log file that was filling up the file system, and restarted the service.
|