Incident
|
#2 at
Square on
2017/01/19 by Alec Homes (Software Engineer)
|
Full report
|
https://medium.com/square-corner-blog/always-be-closing-3d5fda0e00da
|
How it happened
|
A configuration change setting client timeout to 60K secs (rather than the intended 60K ms) was deployed to one of the server's clients. Due to a latent defect, requests from that client were held in memory for 60K seconds. Memory and CPU utilization grew steadily until server was unable to serve requests.
|
Architecture
|
Multiple clients and a shared server.
|
Technologies
|
Go
|
Root cause
|
One client's timeout was set too high (due to confusing seconds with milliseconds); a latent code defect in the server (failure to cleanup per-request context).
|
Failure
|
High memory and CPU utilization on the server.
|
Impact
|
Degraded performance and then an outage for all client of the server.
|
Mitigation
|
Deployed fix for configuration and code defects.
|