Authentication Latency on DUO1 Deployment

“Once this problem was identified, these queues were flushed on each application server and things immediately began to stabilize. In hindsight, this is effectively what the software rollback did as part of the issue on August 20th, which is why the rollback solved that prior issue.”
Incident #37 at Duo on 2018/08/29
Full report https://status.duo.com/incidents/4w07bmvnt359
How it happened A surge of inbound requests exceeded database capacity and were queued by the application. Even after the traffic subsided the queued requests prevented the database from recovering (ie, capacity was still exceeded).
Architecture A fleet of application servers (running an authentication service) backed by a relational database.
Technologies
Root cause Traffic exceeded database capacity; and a queuing strategy for failed database requests.
Failure Database requests failed and were were queued on the application servers.
Impact Increased authentication latency and intermittent request timeouts for all customer applications.
Mitigation Application request queues were flushed.