[RocketMQ] Automatic fault recovery cluster practice based on RocketMQ 5.1.0 version (Controller embedded mode)

Article directory

need
Prepare
nameserver
dashboard
Broker
exporter
problems encountered
write at the end

Requirements

RocketMQ version is 5.1.0;
Build a cluster of 3 masters and 3 slaves, adopt cross-deployment (to avoid two machines being master-slave each other), and save machine resources;
3 nameservers, 1 exporter, 1 dashboard;
Supports automatic fault recovery, and the controller is deployed in a way embedded in the nameserver;
Asynchronous brushing;
When the master-slave switches, the message cannot be lost

Preparation

Machine preparation:

172.24.30.192
172.24.30.193
172.24.30.194

Deployment planning:

service name	IP	port
nameserver(controller-n0)	172.24.30.192	19876(controller: 19878)
nameserver(controller-n1)	172.24.30.193	19876(controller: 19878)
nameserver(controller-n2)	172.24.30.194	19876(controller: 19878)
broker-a	172.24.30.192	13210
broker-a-s	172.24.30.193	13210
broker-b	172.24.30.193	13220
broker-b-s	172.24.30.194	172.24.30.194	13220
broker-c	172.24.30.194	13230
broker-c-s	172.24.30.192	13230
dashboard	172.24. 30.193	18281
exporter	172.24.30.192	18282

Download the binary package: https://rocketmq.apache.org/download/

nameserver

Configuration file:

This is just to demonstrate how to configure, so only one nameserver configuration is shown, and the rest can be changed to the controllerDLegerSelfId corresponding to its own node. Note: this parameter cannot be repeated.

# nameserver related
listenPort = 19876
rocketmqHome = /neworiental/rocketmq-5.1.0/rocketmq-nameserver
useEpollNativeSelector = true
orderMessageEnable = true
serverPooledByteBufAllocatorEnable = true
kvConfigPath = /neworiental/rocketmq-5.1.0/rocketmq-nameserver/store/namesrv/kvConfig.json
configStorePath = /neworiental/rocketmq-5.1.0/rocketmq-nameserver/conf/nameserver.conf

# controller related
enableControllerInNamesrv = true
controllerDLegerGroup = littleCat-Controller
controllerDLegerPeers = n0-172.24.30.192:19878;n1-172.24.30.193:19878;n2-172.24.30.194:19878
controllerDLegerSelfId = n0
controllerStorePath = /neworiental/rocketmq-5.1.0/rocketmq-controller/store
enableElectUncleanMaster = false
notifyBrokerRoleChanged = true

enableElectUncleanMaster: whether to support the election of a node outside the synchronization state set as the master, if set to true, messages may be lost
Modify rmq.namesrv.logback.xml in the conf directory, mainly to modify the log path in batches, and ignore it if the default path is acceptable;
- Tip: The log directory can use a relative path, so that it can be done once and for all, and it is enough to ensure that each service uses a different directory;
Modify the runserver.sh in the bin directory, mainly modify the GC log file path, JVM startup parameters: xmx, xms, if you accept the default, ignore it, the default is 8G, if you are building a pseudo-cluster for testing, be careful that the machine cannot handle it;

Startup script:

#!/bin/sh
./etc/profile

nohup sh /neworiental/rocketmq-5.1.0/rocketmq-nameserver/bin/mqnamesrv -c /neworiental/rocketmq-5.1.0/rocketmq-nameserver/conf/nameserver.conf >/dev/null 2> &1 & amp;
echo "startup nameserver..."

Stop the script:

If multiple nameservers are deployed on one machine, do not use this method: sh /neworiental/rocketmq-5.1.0/rocketmq-nameserver/bin/mqshutdown namesrv to stop nameserver, this method will stop all nameservers

#!/bin/bash
./etc/profile

PID=`ps -ef | grep '/neworiental/rocketmq-5.1.0/rocketmq-nameserver' | grep -v grep | awk '{print $2}'`
if [[ "" != "$PID" ]]; then
  echo "killing rocketmq-nameserver : $PID"
  kill $PID
the fi

Before starting Broker, start all nameservers successfully.

dashboard

Slightly, see another article: https://blog.csdn.net/sinat_14840559/article/details/129737390?spm=1001.2014.3001.5501

Broker

The official document mentions: In this mode, you don’t need to specify brokerId and brokerRole. You can set the brokerRole of all nodes to SLAVE and brokerId to -1 (the master-slave switch back and forth, and the manual configuration of 0 is actually invalid). So the broker that successfully registers first is the master.

For verification, it is best to start a group of brokers first, and then start all brokers after confirming that the functions meet expectations.

Thinking: In fact, there is no master-slave asynchronous writing (ASYNC_MASTER) in this mode, and the master-slave synchronization is realized according to the Raft protocol.

broker-a master node:

brokerClusterName = littleCat
brokerName = broker-a
brokerId = -1
listenPort = 13210
namesrvAddr = 172.24.30.192:19876;172.24.30.193:19876;172.24.30.194:19876;
# Enable controller support
enableControllerMode = true
controllerAddr = 172.24.30.192:19878;172.24.30.193:19878;172.24.30.194:19878;
deleteWhen = 04
fileReservedTime = 48
brokerRole = SLAVE
flushDiskType = ASYNC_FLUSH
autoCreateTopicEnable = false
autoCreateSubscriptionGroup = false
maxTransferBytesOnMessageInDisk = 65536
rocketmqHome = /neworiental/rocketmq-5.1.0/broker-a
storePathConsumerQueue = /neworiental/rocketmq-5.1.0/broker-a/store/consumequeue
brokerIP2 = 172.24.30.192
brokerIP1 = 172.24.30.192
aclEnable = false
storePathRootDir = /neworiental/rocketmq-5.1.0/broker-a/store
storePathCommitLog = /neworiental/rocketmq-5.1.0/broker-a/store/commitlog
# 3000 days: 3600*24*3000
timerMaxDelaySec = 259200000
traceTopicEnable = true
timerPrecisionMs = 1000
timerEnableDisruptor = true

Startup script:

#!/bin/bash
./etc/profile

nohup sh /neworiental/rocketmq-5.1.0/broker-a/bin/mqbroker -c /neworiental/rocketmq-5.1.0/broker-a/conf/broker.conf >/dev/null 2> &1 & amp;
echo "deploying broker-a..."

Stop the script:

#!/bin/bash
./etc/profile

PID=`ps -ef | grep '/neworiental/rocketmq-5.1.0/rocketmq-broker-a' | grep -v grep | awk '{print $2}'`
if [[ "" != "$PID" ]]; then
  echo "killing rocketmq-5-broker-a : $PID"
  kill $PID
the fi

broker-a slave node:

brokerClusterName = littleCat
brokerName = broker-a
brokerId = -1
listenPort = 13210
namesrvAddr=172.24.30.192:19876;172.24.30.193:19876;172.24.30.194:19876;
# Enable controller support
enableControllerMode = true
controllerAddr = 172.24.30.192:19878;172.24.30.193:19878;172.24.30.194:19878;
deleteWhen = 04
fileReservedTime = 48
brokerRole = SLAVE
flushDiskType = ASYNC_FLUSH
autoCreateTopicEnable = false
autoCreateSubscriptionGroup = false
maxTransferBytesOnMessageInDisk = 65536
rocketmqHome=/neworiental/rocketmq-5.1.0/broker-a-s1
storePathConsumerQueue=/neworiental/rocketmq-5.1.0/broker-a-s1/store/consumequeue
brokerIP2=172.24.30.193
brokerIP1=172.24.30.193
aclEnable=false
storePathRootDir=/neworiental/rocketmq-5.1.0/broker-a-s1/store
storePathCommitLog=/neworiental/rocketmq-5.1.0/broker-a-s1/store/commitlog
# 3000 days: 3600*24*3000
timerMaxDelaySec=259200000
traceTopicEnable=true
timerPrecisionMs=1000
timerEnableDisruptor=true

After starting the two brokers, observe the dashboard, and the master-slave node of broker-a has been identified:

After killing the master node, the effect is as follows:

The slave is successfully switched to the master, and the node that was killed is restarted:

Become the slave node of the current master, as expected, and then start broker-b, broker-b-s, broker-c, broker-c-s normally to complete the cluster construction.

exporter

Slightly, see another article: https://blog.csdn.net/sinat_14840559/article/details/119782996

For the latest version of the cluster, it is best to download the new version of exporter: https://github.com/apache/rocketmq-exporter

Problems encountered

When multiple brokers are deployed on the same machine, the brokers started later fail to start and crash:

Reason: On the same machine, the port settings of the two brokers are too similar, and the broker will open several ports for internal communication. The default is 1 and 2 less than the configured port, and the higher version may occupy more, so the port setting should be as far as possible The difference is big, this problem has been delayed for a long time, and the log does not say that the port is occupied, which is very unfriendly.

In a group of brokers, two slaves appear at the same time, one of which logs is normal, and the other reports an error: Error happens when change sync state set

Reason: There is a problem with the internally maintained SyncStateSet during the back-and-forth master-slave switching process. Stop the two brokers, and then start the broker with a normal log first, and then start the broker with an error log. If you start the broker with an error report first, the master will be elected Failed: CODE: 2012 DESC: The broker has not master, and this new registered broker can’t be elected as master

Written at the end

I have used K8S for two years to manage the cluster. This time, because the Operater I wrote before does not support the new version of the cluster, I need to build one temporarily. Although I have built it many times before, I have to say that K8S is really better than manual construction. It’s so convenient, manual construction, no matter how careful you are, there will be mistakes, and you have to do systemd, resource coordination, etc. . .

Manually building a middleware cluster is a delicate task. A small configuration error may plunge you into an infinite abyss, and it will take you a long time to troubleshoot. Therefore, preparation is very important, and all files and configurations must be prepared first. Okay, check it out, and it’s done in one go.

Relatively speaking, this article just goes through the main process. You need to go to CV for the detailed configuration file, mainly to verify the new version of the Controller. To learn the complete construction process, you can refer to: https://blog.csdn .net/sinat_14840559/article/details/108391651