Zabbix monitoring keepalived split brain

1. Introduction to split-brain

In a high availability (HA) system, when the “heartbeat line” connecting two nodes is disconnected, the HA system, which was originally a whole and coordinated actions, split into two independent entities. Since they lost contact with each other, they both thought that the other party had malfunctioned. The HA software on the two nodes is like a “split-brain man”, competing for “shared resources” and “application services”, and serious consequences will occur – or the shared resources will be divided and the “services” on both sides will not be able to provide services. It’s coming; or the “services” on both sides are up, but they read and write the “shared storage” at the same time, causing data damage (commonly, such as errors in the online log of the database polling).

2. Strategies to prevent split-brain

Add redundant heartbeat lines, such as double lines (heartbeat lines are also HA), to minimize the chance of “split-brain”;
Enable disk lock. If the server is locking the shared disk, when “split-brain” occurs, the other party will be completely unable to take away the shared disk resources. However, there is a big problem with using a locked disk. If the party occupying the shared disk does not actively “unlock” it, the other party will never get the shared disk. In reality, if the service node suddenly freezes or crashes, it is impossible to execute the unlock command. The backup node cannot take over shared resources and application services. So someone designed a “smart” lock in HA. That is: the serving party only activates the disk lock when it finds that all heartbeat lines are disconnected (the peer is not aware of it). It’s usually not locked.
Set up arbitration mechanism. For example, if you set a reference IP (such as gateway IP), when the heartbeat line is completely disconnected, both nodes will ping the reference IP respectively. If there is no connection, it means that the breakpoint is on the local end. Not only the “heartbeat” but also the local network link for external “services” is broken. Even starting (or continuing) the application service is useless. Then actively give up the competition and let the end that can ping the reference IP start the service. . To be safer, the party that cannot ping the reference IP can simply restart itself to completely release the shared resources that may still be occupied.

3. Causes of split brain

Generally speaking, split brain occurs for the following reasons:

The heartbeat link between a high-availability server pair fails, resulting in the inability to communicate normally.
- Because the heartbeat cable is broken (including broken, aging)
- Because the network card and related drivers are broken, IP configuration and conflict issues (network card direct connection)
- Due to failure of the equipment connected between the heartbeat cables (network cards and switches)
- There is a problem with the arbitration machine (using the arbitration solution)
The iptables firewall is turned on on the high-availability server and blocks the transmission of heartbeat messages.
The heartbeat network card address and other information on the high-availability server are configured incorrectly, resulting in failure to send heartbeats.
Improper configuration of other services, such as different heartbeat methods, heartbeat insertion conflicts, software bugs, etc.

Note:

If the virtual_router_id parameter configuration at both ends of the same VRRP instance in the Keepalived configuration is inconsistent, it will also cause a split-brain problem.

4. Split-brain solution

Heartbeat mechanism: Establish heartbeat connections between nodes and send heartbeat signals regularly to detect the availability of nodes. If a node does not receive heartbeat signals from other nodes for a long time, it can be considered that a split-brain has occurred and corresponding measures will be taken.
Majority voting mechanism: In a multi-node system, majority voting is used to determine which node is a normal node. Only when most nodes consider a node to be normal can the task continue. This can avoid the occurrence of split-brain problems.
Cluster management software: Use specialized cluster management software, such as Keepalived, Pacemaker, etc., to monitor and manage the status of cluster nodes. These software can automatically handle split-brain problems through mechanisms such as heartbeat detection, resource management, and fault recovery.

Of course, when implementing a high-availability solution, you must determine whether such a loss can be tolerated based on actual business needs. For general website business, this loss is tolerable

5. Use zabbix to monitor split-brain

Environment preparation:

Host name	IP	Environmental role	Installing software	Operating system
zabbix.server.com	192.168.10.130	zabbix server	zabbix_server + zabbix_agent	CentOS-8
H1	192.168.10.131	Primary Load Balancer	haproxy + keepalived	CentOS-8
H2	192.168.10.132	Standby load balancer	haproxy + keepalived + zabbix_agent	CentOS-8

Note: Deploy the zabbix client on the backup load balancer to monitor this host. If a VIP appears on the backup load balancer, it means that the primary load balancer is faulty or has a split brain. Then monitor whether the VIP of the primary load balancer has disappeared. , if VIP exists in both the active and standby load balancers, it can be determined that a split-brain has occurred.

Before the start of the text, you need to deploy a zabbix server host, two hosts configured with keepalived high availability, and deploy the zabbix client on the master node

For keepalived high availability deployment steps, please refer to:

For zabbix deployment steps, please refer to:

zabbix monitoring deployment-CSDN blog

zabbix monitoring configuration process

zabbix custom monitoring-CSDN blog

zabbix monitoring keepalived

Write custom monitoring scripts

[root@H2 scripts]# vim /scripts/keep.sh #The function of this script is to detect whether the VIP exists. If it exists, print 1. If it does not exist, print 0.
#!/bin/bash
  
if [ `ip a show ens160 |grep 192.168.10.250|wc -l` -ne 0 ];then
    echo "1"
else
    echo "0"
fi

Modify zabbix client configuration file

[root@H2 ~]# vim /usr/local/etc/zabbix_agentd.conf
...
# Mandatory: no
#Default:
# HostnameItem=system.hostname
UserParameter=keepalived,/bin/bash /scripts/keep.sh #Set alias
........

Restart service

[root@H2 ~]# systemctl restart zabbix_agentd

First, stop the haproxy service on the main server and let the backup server grab the VIP for later testing.

[root@H1 ~]# systemctl stop haproxy
[root@H1 ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:0c:29:e4:b6:ad brd ff:ff:ff:ff:ff:ff
    inet 192.168.10.131/24 brd 192.168.10.255 scope global dynamic noprefixroute ens160
       valid_lft 1319sec preferred_lft 1319sec
    inet6 fe80::20c:29ff:fee4:b6ad/64 scope link noprefixroute
       valid_lft forever preferred_lft forever
[root@H1 ~]#

[root@H2 scripts]# ip a #VIP has been taken over
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:0c:29:6d:a9:3e brd ff:ff:ff:ff:ff:ff
    inet 192.168.10.132/24 brd 192.168.10.255 scope global dynamic noprefixroute ens160
       valid_lft 1376sec preferred_lft 1376sec
    inet 192.168.10.250/32 scope global ens160
       valid_lft forever preferred_lft forever
    inet6 fe80::20c:29ff:fe6d:a93e/64 scope link noprefixroute
       valid_lft forever preferred_lft forever
[root@H2 scripts]#

Tested on the zabbix_server server, the results indicate that the script can be used normally

[root@zabbix_service ~]# zabbix_get -s 192.168.10.132 -k keepalived
1

Add monitoring items to H2 on the zabbix web page

Add trigger

Simulation triggers split brain

[root@H2 scripts]# ip a #VIP appears on H2
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:0c:29:6d:a9:3e brd ff:ff:ff:ff:ff:ff
    inet 192.168.10.132/24 brd 192.168.10.255 scope global dynamic noprefixroute ens160
       valid_lft 1083sec preferred_lft 1083sec
    inet 192.168.10.250/32 scope global ens160
       valid_lft forever preferred_lft forever
    inet6 fe80::20c:29ff:fe6d:a93e/64 scope link noprefixroute
       valid_lft forever preferred_lft forever
[root@H2 scripts]#

Alarm successful