MongoDB副本集同步原理

MongoDB的同步原理，官方文檔介紹的比較少，網上資料也不是太多，下面是結合官方文檔、網上資料和測試時候的日志，整理出來的一點東西。

因為MongoDB的每個分片也是副本集，是以隻需要搞副本集的同步原理即可。

一、Initial Sync

大體來說，MongoDB副本集同步主要包含兩個步驟：

1. Initial Sync，全量同步
2. Replication，即sync oplog

先通過init sync同步全量資料，再通過replication不斷重放Primary上的oplog同步增量資料。全量同步完成後，成員從轉換 STARTUP2為SECONDARY

1.1 初始化同步過程

1) 全量同步開始，擷取同步源上的最新時間戳t1
2) 全量同步集合資料，建立索引（比較耗時）
3) 擷取同步源上最新的時間戳t2
4) 重放t1到t2之間所有的oplog
5) 全量同步結束

簡單來說，就是周遊Primary上的所有DB的所有集合，将資料拷貝到自身節點，然後讀取全量同步開始到結束時間段内的oplog并重放。

initial sync結束後，Secondary會建立到Primary上local.oplog.rs的tailable cursor，不斷從Primary上擷取新寫入的oplog，并應用到自身。

1.2 初始化同步場景

Secondary節點當出現如下狀況時，需要先進⾏全量同步

1) oplog為空
2) local.replset.minvalid集合⾥_initialSyncFlag字段設定為true（用于init sync失敗處理）
3) 記憶體标記initialSyncRequested設定為true（用于resync指令，resync指令隻用于master/slave架構，副本集無法使用）

這3個場景分别對應(場景2和場景3沒看到官網文檔有寫，參考張友東大神部落格)

1) 新節點加⼊，⽆任何oplog，此時需先進性initial sync
2) initial sync開始時，會主動将_initialSyncFlag字段設定為true，正常結束後再設定為false；如果節點重新開機時，發現_initialSyncFlag為true，說明上次全量同步中途失敗了，此時應該重新進⾏initial sync
3)當⽤戶發送resync指令時，initialSyncRequested會設定為true，此時會強制重新開始⼀次initial sync

1.3 疑問點解釋

1.3.1 全量同步資料的時候，會不會源資料的oplog被覆寫了導緻全量同步失敗？

在3.4版本及以後，不會。

下面這張圖說明了3.4對全量同步的改進（圖來自張友東部落格）：

官方文檔是：

initial sync會在為每個集合複制文檔時構所有集合索引。在早期版本（3.4之前）的MongoDB中，僅_id在此階段建構索引。
Initial sync複制資料的時候會将新增的oplog記錄存到本地（3.4新增）。

1.3.2 如果一個成員脫離了副本集，然後過了很長的時間，oplog已經被多次覆寫的時候，該節點再次加入進來，會直接進行replication嗎？

不會的。

先看一個脫節成員重新加入日志：

2019-08-30T10:12:21.585+0800 I REPL     [replication-0] could not find member to sync from
2019-08-30T10:13:21.591+0800 I REPL     [rsBackgroundSync] sync source candidate: 127.0.0.1:40002
2019-08-30T10:13:21.592+0800 I REPL     [replication-0] We are too stale to use 127.0.0.1:40002 as a sync source. Blacklisting this sync source because our last fetched timestamp: Timestamp(1567098178, 2144) is before their earliest timestamp: Timestamp(1567123884, 133) for 1min until: 2019-08-30T10:14:21.592+0800
2019-08-30T10:13:21.592+0800 I REPL     [replication-0] sync source candidate: 127.0.0.1:40000
2019-08-30T10:13:21.592+0800 I REPL     [replication-0] We are too stale to use 127.0.0.1:40000 as a sync source. Blacklisting this sync source because our last fetched timestamp: Timestamp(1567098178, 2144) is before their earliest timestamp: Timestamp(1567123883, 1248) for 1min until: 2019-08-30T10:14:21.592+0800
2019-08-30T10:13:21.592+0800 I REPL     [replication-0] could not find member to sync from

這個副本集是三個節店（40000、40001、40002），當40001啟動的時候，發現自己的最新資料比較副本集oplog的最舊的還要舊，就進入recovering狀态。

那麼這種情況如何解決呢？

官方提供了兩種方法：

1. 删除該節點全部資料，重新進行initial sync
1. 複制（備份恢複）其他節點資料，然後進行同步（顯然如果複制過來的資料恢複的時候副本集的oplog又被覆寫了，那麼還是會進入recovering狀态）

二、Replication

2.1 sync oplog的過程

全量同步結束後，Secondary就開始從結束時間點建立tailable cursor，不斷的從同步源拉取oplog并重放應用到自身，這個過程并不是由一個線程來完成的，mongodb為了提升同步效率，将拉取oplog以及重放oplog分到了不同的線程來執行。

具體線程和作用如下（這部分暫時沒有在官方文檔找到，來自張友東大神部落格）：

producer thread：這個線程不斷的從同步源上拉取oplog，并加入到一個BlockQueue的隊列裡儲存着，BlockQueue最大存儲240MB的oplog資料，當超過這個門檻值時，就必須等到oplog被replBatcher消費掉才能繼續拉取。
replBatcher thread：這個線程負責逐個從producer thread的隊列裡取出oplog，并放到自己維護的隊列裡，這個隊列最多允許5000個元素，并且元素總大小不超過512MB，當隊列滿了時，就需要等待oplogApplication消費掉
oplogApplication會取出replBatch thread目前隊列的所有元素，并将元素根據docId（如果存儲引擎不支援文檔鎖，則根據集合名稱）分散到不同的replWriter線程，replWriter線程将所有的oplog應用到自身；等待所有oplog都應用完畢，oplogApplication線程将所有的oplog順序寫入到local.oplog.rs集合。

針對上面的叙述，畫了一個圖友善了解：

producer的buffer和apply線程的統計資訊都可以通過db.serverStatus().metrics.repl來查詢到。

2.2 對過程疑問點的解釋

2.2.1 為什麼oplog的回放要弄這麼多的線程？

和mysql一樣，一個線程做一個事情，拉取oplog是單線程，其他線程進行回放；多個回放線程加快速度。

2.2.2 為什麼需要replBatcher線程來中轉？

oplog重放時，要保持順序性，⽽且遇到create、drop等DDL指令時，這些指令與其他的增删改查指令是不能并⾏執⾏的，⽽這些控制就是由replBatcher來完成的。

2.2.3 如何解決secondary節點oplog重放追不上primary問題？

方法一：設定更大的回放線程數

* mongod指令行指定：mongod --setParameter replWriterThreadCount=32
  * 配置檔案中指定

setParameter:
  replWriterThreadCount: 32

方法二：增大oplog的大小

方法三：将writeOpsToOplog步驟分散到多個replWriter線程來并發執行，看官方開發者日志已經實作了這個（在3.4.0-rc2版本）

2.3 注意事項

initial sync單線程複制資料，效率比較低，生産環境應該盡量避免initial sync出現，需合理配置oplog。
新加⼊節點時，可以通過實體複制的⽅式來避免initial sync，将Primary上的dbpath拷⻉到新的節點，然後直接啟動。
當Secondary同步滞後是因為主上并發寫入太高導緻，db.serverStatus().metrics.repl.buffer的 sizeBytes值持續接近maxSizeBytes的時候，可通過調整Secondary上replWriter并發線程數來提升。

三、日志分析

3.1 初始化同步日志

将日志級别 verbosity設定為 1，然後過濾日志

cat mg36000.log |egrep "clone|index|oplog" >b.log

最後拿出過濾後的部分日志。

3.4.21新加入節點日志

因為日志太多，貼太多出來也沒什麼意義，下面貼出了對db01庫的某個
集合的日志。
可以發現是先建立collection索引，然後clone集合資料和索引資料，這樣就完成了該集合的clone。最後将配置改為下一個集合。

2019-08-21T16:50:10.880+0800 D STORAGE  [InitialSyncInserters-db01.test20] create uri: table:db01/index-27-154229953453504826 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "num" : 1 }, "name" : "num_1", "ns" : "db01.test2" }),
2019-08-21T16:50:10.882+0800 I INDEX    [InitialSyncInserters-db01.test20] build index on: db01.test2 properties: { v: 2, key: { num: 1.0 }, name: "num_1", ns: "db01.test2" }
2019-08-21T16:50:10.882+0800 I INDEX    [InitialSyncInserters-db01.test20]      building index using bulk method; build may temporarily use up to 500 megabytes of RAM
2019-08-21T16:50:10.882+0800 D STORAGE  [InitialSyncInserters-db01.test20] create uri: table:db01/index-28-154229953453504826 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "db01.test2" }),
2019-08-21T16:50:10.886+0800 I INDEX    [InitialSyncInserters-db01.test20] build index on: db01.test2 properties: { v: 2, key: { _id: 1 }, name: "_id_", ns: "db01.test2" }
2019-08-21T16:50:10.886+0800 I INDEX    [InitialSyncInserters-db01.test20]      building index using bulk method; build may temporarily use up to 500 megabytes of RAM
2019-08-21T16:50:10.901+0800 D INDEX    [InitialSyncInserters-db01.test20]      bulk commit starting for index: num_1
2019-08-21T16:50:10.906+0800 D INDEX    [InitialSyncInserters-db01.test20]      bulk commit starting for index: _id_
2019-08-21T16:50:10.913+0800 D REPL     [repl writer worker 11] collection clone finished: db01.test2
2019-08-21T16:50:10.913+0800 D REPL     [repl writer worker 11]     collection: db01.test2, stats: { ns: "db01.test2", documentsToCopy: 2000, documentsCopied: 2000, indexes: 2, fetchedBatches: 1, start: new Date(1566377410875), end: new Date(1566377410913), elapsedMillis: 38 }
2019-08-21T16:50:10.920+0800 D STORAGE  [InitialSyncInserters-db01.collection10] create uri: table:db01/index-30-154229953453504826 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "db01.collection1" }),

3.6.12加入新節點日志

3.6較3.4的差別是，複制資料庫的線程明确了是：repl writer worker 進行重放（看文檔其實3.4已經是如此了）
還有就是明确是用cursors來進行。
其他和3.4沒有差別，也是建立索引，然後clone資料。

2019-08-22T13:59:39.444+0800 D STORAGE  [repl writer worker 9] create uri: table:db01/index-32-3334250984770678501 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "db01.collection1" }),log=(enabled=true)
2019-08-22T13:59:39.446+0800 I INDEX    [repl writer worker 9] build index on: db01.collection1 properties: { v: 2, key: { _id: 1 }, name: "_id_", ns: "db01.collection1" }
2019-08-22T13:59:39.446+0800 I INDEX    [repl writer worker 9]      building index using bulk method; build may temporarily use up to 500 megabytes of RAM
2019-08-22T13:59:39.447+0800 D REPL     [replication-1] Collection cloner running with 1 cursors established.
2019-08-22T13:59:39.681+0800 D INDEX    [repl writer worker 7]      bulk commit starting for index: _id_
2019-08-22T13:59:39.725+0800 D REPL     [repl writer worker 7] collection clone finished: db01.collection1
2019-08-22T13:59:39.725+0800 D REPL     [repl writer worker 7]     database: db01, stats: { dbname: "db01", collections: 1, clonedCollections: 1, start: new Date(1566453579439), end: new Date(1566453579725), elapsedMillis: 286 }
2019-08-22T13:59:39.725+0800 D REPL     [repl writer worker 7]     collection: db01.collection1, stats: { ns: "db01.collection1", documentsToCopy: 50000, documentsCopied: 50000, indexes: 1, fetchedBatches: 1, start: new Date(1566453579440), end: new Date(1566453579725), elapsedMillis: 285 }
2019-08-22T13:59:39.731+0800 D STORAGE  [repl writer worker 8] create uri: table:test/index-34-3334250984770678501 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "test.user1" }),log=(enabled=true)

4.0.11加入新節點日志

使用cursors，和3.6基本一緻

2019-08-22T15:02:13.806+0800 D STORAGE  [repl writer worker 15] create uri: table:db01/index-30--463691904336459055 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "num" : 1 }, "name" : "num_1", "ns" : "db01.collection1" }),log=(enabled=false)
2019-08-22T15:02:13.816+0800 I INDEX    [repl writer worker 15] build index on: db01.collection1 properties: { v: 2, key: { num: 1.0 }, name: "num_1", ns: "db01.collection1" }
2019-08-22T15:02:13.816+0800 I INDEX    [repl writer worker 15]      building index using bulk method; build may temporarily use up to 500 megabytes of RAM
2019-08-22T15:02:13.816+0800 D STORAGE  [repl writer worker 15] create uri: table:db01/index-31--463691904336459055 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "db01.collection1" }),log=(enabled=false)
2019-08-22T15:02:13.819+0800 I INDEX    [repl writer worker 15] build index on: db01.collection1 properties: { v: 2, key: { _id: 1 }, name: "_id_", ns: "db01.collection1" }
2019-08-22T15:02:13.819+0800 I INDEX    [repl writer worker 15]      building index using bulk method; build may temporarily use up to 500 megabytes of RAM
2019-08-22T15:02:13.820+0800 D REPL     [replication-0] Collection cloner running with 1 cursors established.

3.2 複制日志

2019-08-22T15:15:17.566+0800 D STORAGE  [repl writer worker 2] create collection db01.collection2 { uuid: UUID("8e61a14e-280c-4da7-ad8c-f6fd086d9481") }
2019-08-22T15:15:17.567+0800 I STORAGE  [repl writer worker 2] createCollection: db01.collection2 with provided UUID: 8e61a14e-280c-4da7-ad8c-f6fd086d9481
2019-08-22T15:15:17.567+0800 D STORAGE  [repl writer worker 2] stored meta data for db01.collection2 @ RecordId(22)
2019-08-22T15:15:17.580+0800 D STORAGE  [repl writer worker 2] db01.collection2: clearing plan cache - collection info cache reset
2019-08-22T15:15:17.580+0800 D STORAGE  [repl writer worker 2] create uri: table:db01/index-43--463691904336459055 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "db01.collection2" }),log=(enabled=false)

參考：

https://docs.mongodb.com/v4.0/core/replica-set-sync/ https://docs.mongodb.com/v4.0/tutorial/resync-replica-set-member/#replica-set-auto-resync-stale-member http://www.mongoing.com/archives/2369

MongoDB副本集同步原理

一、Initial Sync

1.1 初始化同步過程

1.2 初始化同步場景

1.3 疑問點解釋

1.3.1 全量同步資料的時候，會不會源資料的oplog被覆寫了導緻全量同步失敗？

1.3.2 如果一個成員脫離了副本集，然後過了很長的時間，oplog已經被多次覆寫的時候，該節點再次加入進來，會直接進行replication嗎？

二、Replication

2.1 sync oplog的過程

2.2 對過程疑問點的解釋

2.2.1 為什麼oplog的回放要弄這麼多的線程？

2.2.2 為什麼需要replBatcher線程來中轉？

2.2.3 如何解決secondary節點oplog重放追不上primary問題？

2.3 注意事項

三、日志分析

3.1 初始化同步日志

3.2 複制日志

繼續閱讀

啟動mongo warning

MongoDB學習筆記一ID自增長

hadoop實戰多表關聯

詳解SQL中幾種常用的表連接配接方式

oracle 中不使用已有的索引解決辦法

MongoDB：資料庫角色權限1. 了解Admin資料庫2. 角色與權限3. 建立角色

redis主從複制與哨兵模式

對first_name建立唯一索引uniq_idx_firstname問題描述Sql語句

303、副本集如何擴容

Mongodb4.0操作指令

記一次因MySQL編碼問題導緻的慢查詢排查

Redis的快速上手與基本使用

Error: couldn‘t connect to server 127.0.0.1:27017, connection attempt failed: SocketException: Error

Error: couldn't connect to server 127.0.0.1:27017, connection attempt failed: SocketException: Erro

couldn‘t connect to server 127.0.0.1:27017, connection attempt failed: SocketException: Error conne

Ubuntu14.04 LTS下安裝mongodb