Always Be Closing: The Tale of a Go Resource Leak

“This root cause was tickled by a configuration change in another service, which inadvertently set its client request timeout to 60,000 seconds instead of the intended 60,000 milliseconds.”
Incident #2 at Square on 2017/01/19 by Alec Homes (Software Engineer)
Full report
How it happened A configuration change setting client timeout to 60K secs (rather than the intended 60K ms) was deployed to one of the server's clients. Due to a latent defect, requests from that client were held in memory for 60K seconds. Memory and CPU utilization grew steadily until server was unable to serve requests.
Architecture Multiple clients and a shared server.
Technologies Go
Root cause One client's timeout was set too high (due to confusing seconds with milliseconds); a latent code defect in the server (failure to cleanup per-request context).
Failure High memory and CPU utilization on the server.
Impact Degraded performance and then an outage for all client of the server.
Mitigation Deployed fix for configuration and code defects.