Incident
|
#23 at
Foursquare on
2010/10/05 by Eliot Horowitz (CTO 10gen)
|
Full report
|
https://groups.google.com/forum/#!topic/mongodb-user/UoqU8ofp134
|
How it happened
|
Data grew unevenly between two database shards, consuming available memory (66GB) in one shard leading to data reads and writes going to the disk, increasing latency by an order of magnitude for key queries. Requests began backing up and the site crashed.
|
Architecture
|
A MongoDB database on a two-shard cluster, each replicating to a slave for redundancy. Frequently used data is stored in RAM. Sharding is based on user id.
|
Technologies
|
MongoDB, Amazon Elastic Compute Cloud (EC2), Amazon Elastic Block Store (EBS)
|
Root cause
|
Data in two shards grew unevenly eventually consuming all available RAM in one of the shards, requring reads and writes to hit EBS volumes.
|
Failure
|
Key queries had high latency leading to a backlog of requests.
|
Impact
|
Multiple days of outages for the site.
|
Mitigation
|
Created a third MongoDB shard and moved 5% of the data to the new shard, ran a command (on primary and replica) to compact the database (repairDatabase) to free up enough memory to bring the system online.
|