52.MongoDB replication (copy) set actual combat and analysis of its principles

MongoDB replica set architecture

High availability

In a production environment, it is not recommended to use a stand-alone version of the MongoDB server.

A Mongodb replication set (Replication Set) consists of a set of Mongod instances (processes), including a primary node and multiple secondary nodes. All data from the Mongodb Driver (client) is written to the primary, and the secondary writes data synchronized from the primary. This ensures that all members in the replication set store the same data set, providing high data availability. Replica sets provide redundancy and high availability and are the foundation of all production deployments. Its reality depends on two aspects of functionality:

Quickly copy data to another independent node as it is written
Automatically elect a new replacement node when the node receiving writes fails

Three-node replica set mode

PSS mode (official recommended mode)

The PSS mode consists of one primary node and two standby nodes, namely Primary + Secondary + Secondary.

PSA Mode

The PSA mode consists of a master node, a standby node and an arbiter node, namely Primary + Secondary + Arbiter

Security certification

 #mongo.key is generated using a random algorithm and used as a key file for internal communication of the node.
 openssl rand -base64 756 > /data/mongo.key
 #Permission must be 600
 chmod 600 /data/mongo.key
 # Start mongod
 mongod -f /data/db1/mongod.conf --keyFile /data/mongo.key

Replication set connection method

Connect to MongoDB through a high-availability Uri. When the Primary fails over, the MongoDB Driver can automatically detect and route the traffic to the new Primary node.

mongosh mongodb://fox:[email protected]:27017,192.168.139.136:27017,192.168.139.137:27017/admin?replicaSet=rs0

Replication set member role

Member role attributes

Priority = 0 When Priority is equal to 0, it cannot be elected as the master by the replica set. The higher the Priority value, the greater the probability of being elected as the master. Usually, this feature can be used when deploying a replica set in cross-machine room mode. Assume that computer room A and computer room B are used. Since the main business is closer to computer room A, you can set the priority of the replication set member of computer room B to 0, so that the master node will definitely be a member of computer room A.

Vote = 0 cannot participate in election voting. At this time, the Priority of the node must also be 0, that is, it cannot be elected as the master. Since there are only 7 voting members in a replica set at most, the additional members must have their vote attribute value set to 0, that is, these members will not be able to participate in voting.

Member roles

Primary: The master node receives all write requests and then synchronizes the modifications to all standby nodes. A replica set can only have one primary node.
Secondary: Standby node maintains the same data set as the primary node. When the master node “hangs”, participate in the election of the master node. Divided into three different types:
- Hidden = false: Normal read-only node, whether it can be selected as the main node and whether it can vote depends on the value of Priority and Vote;
- Hidden = true: Hidden node, invisible to the client, can participate in the election, but the Priority must be 0, that is, it cannot be promoted to the master. Since hidden nodes do not accept business access, some data backup and offline calculation tasks can be performed through hidden nodes, which will not affect the entire replication set.
- Delayed: Delayed node, which must have the characteristics of both hidden nodes and Priority0, will delay copying increments from the upstream for a certain period of time (determined by the secondaryDelaySecs configuration), and is often used in fast rollback scenarios.
Arbiter: Arbitration node, only used to participate in election voting. It does not carry any data and only serves as a voting role. For example, if you deploy a replica set with 2 nodes, 1 primary and 1 secondary, and any node goes down, the replica set will no longer be able to provide services (the primary cannot be selected). In this case, you can add an Arbiter node to the replica set. Even if a node is down, the Primary can still be elected. Arbiter itself does not store data and is a very lightweight service. When the number of replica set members is even, it is best to add ? Arbiter nodes to improve replica set availability.

#Configure hidden nodes
cfg = rs.conf()
cfg.members[1].priority = 0
cfg.members[1].hidden = true
rs.reconfig(cfg)

#Configure delay node
cfg = rs.conf()
cfg.members[1].priority = 0
cfg.members[1].hidden = true
#Delay 1 minute
cfg.members[1].secondaryDelaySecs = 60
rs.reconfig(cfg)

#Add voting node
# Create a data directory for the arbitration node to store configuration data. This directory will not hold the dataset
mkdir /data/arb
# Start the quorum node and specify the data directory and replication set name
mongod --port 30000 --dbpath /data/arb --replSet rs0
# Enter the mongo shell and add the quorum node to the replica set
rs.addArb("ip:30000")
# Excuting an order
db.adminCommand( {<!-- -->"setDefaultRWConcern" : 1, "defaultWriteConcern" : {<!-- --> "w" : 2 } } )


#Remove replica set node
rs.remove("ip:port")
#Remove nodes through rs.reconfig()
cfg = rs.conf()
cfg.members.splice(2,1) #Remove 1 element starting from 2
rs.reconfig(cfg)

#Change replication set nodes
cfg = rs.conf()
cfg.members[0].host = "ip:port"
rs.reconfig(cfg)

MongoDB replica set principle

data synchronization

MongoDB’s replica set election is implemented using the Raft algorithm (https://raft.github.io/). The necessary condition for successful election is that the majority of voting nodes survive.

MongoDB adds some of its own extensions to the raft protocol

Supports chainingAllowed chain replication, that is, the standby node not only synchronizes data from the primary node, but also selects a node closest to itself (with the smallest heartbeat delay) to copy data.
A pre-voting phase, preVote, is added, which is mainly used to avoid the problem of Term value surge when the network is partitioned.
Supports voting priority. If the standby node finds that its priority is higher than that of the primary node, it will actively initiate a vote and try to become the new primary node.

A replica set can have up to 50 members, but only 7 voting members.

Automatic failover

One factor that affects the detection mechanism is heartbeat. After the replication set is established, each member node will start a timer and continue to initiate heartbeats to other members. The parameter involved here is heartbeatIntervalMillis, which is the heartbeat interval. The default value is 2s. If the heartbeat is successful, the heartbeat will continue to be sent at a frequency of 2 seconds; if the heartbeat fails, the heartbeat will be retried immediately until the heartbeat is restored successfully.

To trigger an election in the electionTimeout task, the following conditions must be met:

(1) The current node is the standby node.

(2) The current node has election authority.

(3) There is still no successful heartbeat with the master node during the detection cycle.

Business Impact Assessment

When a replica set switches between active and standby nodes, there will be a brief period without a master node, and business write operations cannot be accepted at this time.
For very important businesses, it is recommended to implement some protection strategies at the business level, such as designing a retry mechanism.

# MongoDB Drivers enable retryable writes
mongodb://localhost/?retryWrites=true
#mongo shell
mongosh --retryWrites

How to gracefully restart a replica set

Restart all Secondary nodes in the replica set one by one
Send the rs.stepDown() command to the primary and wait for the primary to be downgraded to secondary.
Restart the downgraded Primary

Replication set data synchronization mechanism

MongoDB oplog is a collection under the Local library, used to save incremental logs generated by write operations (similar to Binlog in MySQL).

primary ———- write ———-》 local.oplog.rs ———- read———-》 secondary ———- write ———-》 local.oplog.rs

? ———- read———-》secondary ———- write ———-》 local.oplog. rs

ts in oplog is the key to incremental log synchronization on the standby node.

Each standby node maintains its own offset, which is the optime of the last log pulled from the primary node. When performing synchronization, it uses this optime to initiate a query to the oplog set of the primary node.

MongoDB provides the replSetResizeOplog command after version 4.0, which can dynamically modify the oplogSize without restarting the server.

# Modify the oplog size of replica set members to 60g
db.adminCommand({<!-- -->replSetResizeOplog: 1, size: 60000})
# Check oplog size
use local
db.oplog.rs.stats().maxSize

Impotence

The current value of the x field in a document is 100. The user sends a {KaTeX parse error: Expected ‘EOF’, got ‘}’ at position 12: inc: {x: 1} to Primary. }?, when recording oplog, it will be converted into an operation of…set: {x: 101} to ensure idempotence.

The cost of idempotence: Oplog writes are amplified, causing synchronization to fail to catch up.

When using arrays, try to pay attention to:

The number of elements in the array should not be too many, and the total size should not be too large.
Try to avoid updating the array
If you must update, try to only insert elements at the end. Complex logic can be supported at the business level.

Replication delay

In order to try to avoid the risks caused by replication delays, we can take some measures

Increase the oplog capacity and keep monitoring the replication window.
Reduce the write speed of the master node through some expansion methods.
Optimize the network between active and standby nodes.
Avoid using arrays that are too large for fields (may cause oplog bloat).

Data rollback

mongorestore --host 192.168.192:27018 --db test --collection emp -ufox -pfox
--authenticationDatabase=admin rollback/emp_rollback.bson

Sync source selection

When settings.chainingAllowed is turned on, the standby node automatically selects the nearest node (with the smallest ping command delay) for synchronization.

#By default, the standby node does not necessarily select the primary node for synchronization. This side effect is that it will increase the delay. It can be turned off by using the following command.
cfg = rs.config()
cfg.settings.chainingAllowed = false
rs.reconfig(cfg)

#Use the replSetSyncFrom command to temporarily change the synchronization source of the current node
db.adminCommand( {<!-- --> replSetSyncFrom: "hostname:port" })