Forsquare outage post mortem

“Over these two months, check-ins were being written continually to each shard. Unfortunately, these check-ins did not grow evenly across chunks.”
Incident #23 at Foursquare on 2010/10/05 by Eliot Horowitz (CTO 10gen)
Full report https://groups.google.com/forum/#!topic/mongodb-user/UoqU8ofp134
How it happened Data grew unevenly between two database shards, consuming available memory (66GB) in one shard leading to data reads and writes going to the disk, increasing latency by an order of magnitude for key queries. Requests began backing up and the site crashed.
Architecture A MongoDB database on a two-shard cluster, each replicating to a slave for redundancy. Frequently used data is stored in RAM. Sharding is based on user id.
Technologies MongoDB, Amazon Elastic Compute Cloud (EC2), Amazon Elastic Block Store (EBS)
Root cause Data in two shards grew unevenly eventually consuming all available RAM in one of the shards, requring reads and writes to hit EBS volumes.
Failure Key queries had high latency leading to a backlog of requests.
Impact Multiple days of outages for the site.
Mitigation Created a third MongoDB shard and moved 5% of the data to the new shard, ran a command (on primary and replica) to compact the database (repairDatabase) to free up enough memory to bring the system online.