Forsquare outage post mortem

“Over these two months, check-ins were being written continually to each shard. Unfortunately, these check-ins did not grow evenly across chunks.”

Incident	#23 at Foursquare on 2010/10/05 by Eliot Horowitz (CTO 10gen)
Full report	https://groups.google.com/forum/#!topic/mongodb-user/UoqU8ofp134
How it happened	Data grew unevenly between two database shards, consuming available memory (66GB) in one shard leading to data reads and writes going to the disk, increasing latency by an order of magnitude for key queries. Requests began backing up and the site crashed.
Architecture	A MongoDB database on a two-shard cluster, each replicating to a slave for redundancy. Frequently used data is stored in RAM. Sharding is based on user id.
Technologies	MongoDB, Amazon Elastic Compute Cloud (EC2), Amazon Elastic Block Store (EBS)
Root cause	Data in two shards grew unevenly eventually consuming all available RAM in one of the shards, requring reads and writes to hit EBS volumes.
Failure	Key queries had high latency leading to a backlog of requests.
Impact	Multiple days of outages for the site.
Mitigation	Created a third MongoDB shard and moved 5% of the data to the new shard, ran a command (on primary and replica) to compact the database (repairDatabase) to free up enough memory to bring the system online.