Previously in Jepsen, we discussed Redis. In this post, we’ll see MongoDB drop a phenomenal amount of data. See also: followup analyses of 2.6.7 and 3.4.0-rc3.

MongoDB is a document-oriented database with a similar distribution design to Redis. In a replica set, there exists a single writable primary node which accepts writes, and asynchronously replicates those writes as an oplog to N secondaries. However, there are a few key differences.

First, Mongo builds in its leader election and replicated state machine. There’s no separate system which tries to observe a replica set in order to make decisions about what it should do. The replica set decides among itself which node should be primary, when to step down, how to replicate, etc. This is operationally simpler and eliminates whole classes of topology problems.

Second, Mongo allows you to ask that the primary confirm successful replication of a write by its disk log, or by secondary nodes. At the cost of latency, we can get stronger guarantees about whether or not a write was successful.

What happens when a primary becomes inaccessible?

-046.jpg
-047.jpg

The remaining secondaries will gradually detect the failed connection and attempt to come to a consensus about what to do. If they have a majority (and remember, there can be only one majority in a cluster, so this suggests we’re heading towards a CP system), they’ll select the node with the highest optime (a monotonic clock maintained by each node) and promote it to be a new primary. Simultaneously, the minority nodes will detect that they no longer have a quorum, and demote the primary to a secondary so it can’t accept writes.

-045.jpg
-048.jpg

If our primary is on n1, and we cut off n1 and n2 from the rest of the cluster, we expect either n3, n4, or n5 to become the new primary. Because this architecture demotes the original primary on n1, we won’t find ourselves in the same split-brain problem we saw with Redis.

Consistency

So is MongoDB CP? There’s a popular notion that MongoDB is a CP system, including exchanges like this, where all kinds of nuanced technical assertions about strong consistency are thrown around. At the same time, Mongo’s documentation for replica sets explains carefully that Mongo may “revert operations”:

In some failover situations primaries will have accepted write operations that have not replicated to the secondaries after a failover occurs. This case is rare and typically occurs as a result of a network partition with replication lag. When this member (the former primary) rejoins the replica set and attempts to continue replication as a secondary the former primary must revert these operations or “roll back” these operations to maintain database consistency across the replica set.

“Revert” certainly doesn’t sound like linearizability to me, but that bit about “maintain[ing] database consistency” doesn’t sound so bad. What actually happens? Let’s find out!

For this example, we’ll be adding integers to a list in a MongoDB document by using the update command in a CaS loop–just like you’d use with any transactionally isolated database. Yes, we could use $addInSet, but I’m using this app as an example of atomic updates in general, and they have different oplog dynamics.

Unacknowledged

-050.jpg

Up until recently, clients for MongoDB didn’t bother to check whether or not their writes succeeded, by default: they just sent them and assumed everything went fine. This goes about as well as you’d expect.

lein run mongo-unsafe -n 6000
salticid jepsen.partition

For a while, writes continue to complete against n1. Then we see errors as the replica set fails over, like

3186	No replica set members available in [ { address:'n3/10.10.3.101:27017', ok:true, ping:0.8954104, isMaster:false, isSecondary:true, setName:rs0, maxBsonObjectSize:16777216, },{ address:'n4/10.10.3.95:27017', ok:true, ping:0.681164, isMaster:false, isSecondary:true, setName:rs0, maxBsonObjectSize:16777216, },{ address:'n5/10.10.3.32:27017', ok:true, ping:0.6231328, isMaster:false, isSecondary:true, setName:rs0, maxBsonObjectSize:16777216, },{ address:'n2/10.10.3.52:27017', ok:true, ping:0.51316977, isMaster:false, isSecondary:true, setName:rs0, maxBsonObjectSize:16777216, },{ address:'n1/10.10.3.242:27017', ok:true, ping:0.37008655, isMaster:false, isSecondary:true, setName:rs0, maxBsonObjectSize:16777216, } ] for { "mode" : "primary"}

During this time, the majority nodes (n3, n4, n5) are still secondaries, but they’ve agreed that the old primary is inaccessible. They compare optimes and race to elect a leader:

$ salticid mongo.rs_stat
22:09:08 Starting...
22:09:08 MongoDB shell version: 2.4.1
22:09:08 connecting to: test
22:09:08 n1:27017 (not reachable/healthy)       1368940104/56
22:09:08 n2:27017 (not reachable/healthy)       1368940103/458
22:09:08 n3:27017 SECONDARY     1368940104/89
22:09:08 n4:27017 SECONDARY     1368940104/89
22:09:08 n5:27017 SECONDARY     1368940104/102
22:09:08 true
22:09:08 Finished
22:09:23 n1:27017 (not reachable/healthy)     1368941926/66
22:09:23 n2:27017 (not reachable/healthy)     1368941961/70
22:09:23 n3:27017 SECONDARY     1368941962/9
22:09:23 n4:27017 SECONDARY     1368941961/45
22:09:23 n5:27017 PRIMARY       1368941963/11

N5 wins the race, and proceeds to accept writes. If we heal the partition with salticid jepsen.heal, and wait a few seconds, the nodes will detect the fully connected cluster and the new primary will step down, to allow n1 to resume its place. Now that the cluster has stabilized, we hit enter to check how many of our writes survived:

Hit enter when ready to collect results.
Writes completed in 93.608 seconds

6000 total
5700 acknowledged
3319 survivors
2381 acknowledged writes lost! (╯°□°)╯︵ ┻━┻
469 474 479 484 489 494 ... 3166 3168 3171 3173 3178 3183
0.95 ack rate
0.4177193 loss rate
0.0 unacknowledged but successful rate

42% write loss. Well, to some extent, this shouldn’t be surprising, because we weren’t checking to see whether the server was successful in applying our writes. Those 300 errors only came about when we tried to write to a secondary. But we never actually crashed a node, and we didn’t see any signs of a split-brain condition with two simultaneous primaries–so why did Mongo drop data?

Remember those writes that completed on n1 just after the partition started? Those writes are still on n1, but never made it to n5. N5 proceeded without them. Now n1 and n5 are comparing notes, and n1 realizes that n5’s optime is higher. N1 figures out the last point where the two agreed on the oplog, and rolls back to that point.

22:09:33 Sun May 19 05:09:33.032 [rsHealthPoll] replSet member n5:27017 is now in state PRIMARY
22:09:33 Sun May 19 05:09:33.207 [initandlisten] connection accepted from 10.10.3.95:37718 #6154 (23 connections now open)
22:09:33 Sun May 19 05:09:33.417 [rsBackgroundSync] replSet syncing to: n5:27017
22:09:33 Sun May 19 05:09:33.438 [rsBackgroundSync] replSet our last op time fetched: May 19 05:08:37:2
22:09:33 Sun May 19 05:09:33.438 [rsBackgroundSync] replset source's GTE: May 19 05:09:26:1
22:09:33 Sun May 19 05:09:33.438 [rsBackgroundSync] replSet rollback 0
22:09:33 Sun May 19 05:09:33.438 [rsBackgroundSync] replSet ROLLBACK
22:09:33 Sun May 19 05:09:33.439 [rsBackgroundSync] replSet rollback 1
22:09:33 Sun May 19 05:09:33.439 [rsBackgroundSync] replSet rollback 2 FindCommonPoint
22:09:33 Sun May 19 05:09:33.439 [rsBackgroundSync] replSet info rollback our last optime:   May 19 05:08:37:2
22:09:33 Sun May 19 05:09:33.439 [rsBackgroundSync] replSet info rollback their last optime: May 19 05:09:33:32
22:09:33 Sun May 19 05:09:33.439 [rsBackgroundSync] replSet info rollback diff in end of log times: -56 seconds
22:09:35 Sun May 19 05:09:33.621 [initandlisten] connection accepted from 10.10.3.32:59066 #6155 (24 connections now open)
22:09:35 Sun May 19 05:09:35.221 [rsBackgroundSync] replSet rollback found matching events at May 19 05:08:24:66
22:09:35 Sun May 19 05:09:35.221 [rsBackgroundSync] replSet rollback findcommonpoint scanned : 3798
22:09:35 Sun May 19 05:09:35.221 [rsBackgroundSync] replSet replSet rollback 3 fixup
22:09:35 Sun May 19 05:09:35.222 [rsBackgroundSync] replSet rollback 3.5
22:09:35 Sun May 19 05:09:35.222 [rsBackgroundSync] replSet rollback 4 n:1
22:09:35 Sun May 19 05:09:35.222 [rsBackgroundSync] replSet minvalid=May 19 05:09:35 51985e8f:19
22:09:35 Sun May 19 05:09:35.222 [rsBackgroundSync] replSet rollback 4.6
22:09:35 Sun May 19 05:09:35.223 [rsBackgroundSync] replSet rollback 4.7
22:09:35 Sun May 19 05:09:35.223 [rsBackgroundSync] replSet rollback 5 d:0 u:1
22:09:35 Sun May 19 05:09:35.224 [rsBackgroundSync] replSet rollback 6
22:09:35 Sun May 19 05:09:35.236 [rsBackgroundSync] replSet rollback 7
22:09:35 Sun May 19 05:09:35.238 [rsBackgroundSync] replSet rollback done
22:09:35 Sun May 19 05:09:35.238 [rsBackgroundSync] replSet RECOVERING

During a rollback, all the writes the old primary accepted after the common point in the oplog are removed from the database and written to a BSON file in Mongo’s rollbacks directory. If you’re a sysadmin, you could go look at the rollback files to try and reconstruct the writes that the database dropped.

Well, theoretically. In my tests, it only does this in 1 out of 5 runs or so. Mostly, it just throws those writes away entirely: no rollback files, no nothing. I don’t really know why.

-054.jpg