InnoDB Cluster Troubleshooting Manual

1. Cluster environment inspection and recovery

1.1 Cluster status check

1) Use any host name to log in to the cluster according to the host name in the cluster
 /data/mysql/mysql3306/mysql-shell/mysql-shell/bin/mysqlsh --uri [email protected]:3306

2) Check cluster status
var cluster = dba.getCluster()
cluster.status()

3) Status interpretation
1.Online Normally online
2.Offline Completely offline, group replication is turned off
3.Recovering Add back to the cluster and data is being recovered
4.Error was kicked out of the cluster
5.Unreachable The network is unavailable
6.MISSING is lost, try to connect

4) Interpretation of the overall status of the cluster
1.OK status is normal. A 3-node cluster allows 1 node to fail.
2.OK_NO_TOLERANCE cannot tolerate any failures, because one node in the 3-node cluster has already failed. If another node fails, the quorum will be lost and the primary will be lost.
3.NO_QUORUM The cluster cannot perform write transactions and becomes read-only.

1.2 Disposal operations

1) Abnormal nodes automatically rejoin the cluster
    --Clear the cluster information of abnormal nodes on normal hosts
    [root@docker ~]# /data/mysql/mysql3306/mysql-shell/mysql-shell/bin/mysqlsh --uri [email protected]:3306
    var cluster = dba.getCluster()
    cluster.rescan()
    --rescan command self-checks the cluster status and re-adds
    cluster.rejoinInstance and remove it from the cluster metadata

2) Log in to the damaged node and clean up the cluster information
     var cluster = dba.getCluster()
     dba.dropMetadataSchema()

3) Re-add the damaged node to the cluster
     var cluster = dba.getCluster()
     cluster.addInstance("user@hostname/ip:3306")

-- Abnormal nodes manually rejoin the cluster
1) Log in to the cluster
[root@docker ~]# /data/mysql/mysql3306/mysql-shell/mysql-shell/bin/mysqlsh --uri [email protected]:3306

2) Refresh cluster status
 var cluster = dba.getCluster()
 cluster.rescan()

3) Log in to the damaged node mysql and clean up the cluster information. (The abnormal node mysql can log in normally)
1. Log in to the damaged node stand-alone mysql.
mysql -user -p -hx.x.x.x

2. Clean up the original cluster information.
/sql>stop group_replication;
/sql>reset slave all;
/sql>set global super_read_only=off;
/sql>drop database mysql_innodb_cluster_metadata;

3. Restart the mysql service
service mysqld restart


---Use backup to restore data and clusters
1. The backup data is in the directory /mysqldata/backup/ and the command is as follows:
/>xtrabackup --defaults-file=/data/mysql_[port]/etc/my.cnf --prepare --apply-log-only --target-dir=path to backup location
/>xtrabackup --defaults-file=/data/mysql_[port]/etc/my.cnf --copy-back --target-dir=The path to the backup location

2. Modify the group of the recovered file and start the MySQL service:
chown mysql:mysql /data/mysql_[port]/dbdata/*
chown mysql:mysql /data/mysql_[port]/log/*
service mysqld start

Two cluster environment emergency scenarios

2.1 split-brain! split-brain

When some nodes in the cluster are in the UNREACHABLE state and the cluster is unable to make decisions, the following situation will occur. At this time, there is only one active node left. This node can only provide queries and cannot write. The write operation will hang. . If consensus cannot be reached, for example, if the remaining members cannot achieve a majority agreement, then consensus cannot be formed, which can cause a split-brain problem.

Solution: Remove the faulty node and rejoin the cluster

1. Log in to the faulty node, run stop group_replication, and close the group replication relationship.
SQL > stop group_replication;

2. Log in to the primary node and rejoin the failed node to the cluster.
JS > cluster.rejoinInstance(clusteruser@hostname\ip:port’)

2.2 Cluster single node error

Cluster single node error

-- Check the group replication status of the three-node cluster MySQL and confirm the abnormal output of the error log of the faulty node. If there is no problem, rejoin the faulty node to the cluster according to the following plan.
Solution 1:

1. Determine whether the /etc/hosts mapping relationship is correct.
2. Choose the following two options for implementation.
Solution 1: Restart the mgr faulty node. If the problem is not solved, refer to the above
MySQL >stop group_replication;
MySQL >start group_replication;

Option 2: The master-slave data difference causes the slave database replication link to be disconnected, which will be repaired later through the clone method.
MySQL JS > cluster.removeInstance("clusteruser@hostname/ip:port", {<!-- -->force: true})
MySQL JS > cluster.rescan()
MySQL JS > cluster.addInstance("clusteruser@hostname/ip:port")

Specify Clone mode to add nodes:
MySQL JS > cluster.addInstance("clusteruser@hostname/ip:port",{<!-- -->recoveryMethod:"clone"})

Question 3: Error when inserting data

The table does not comply with the requirements by an external plugin

Solution ideas:
MySQL MGR must have a primary key for table operations, and it is recommended to add it.
How to avoid:
MySQL MGR must have a primary key for table operations. It is recommended that all tables in the MGR cluster must have a primary key.

Problem 4: Memory usage is too high

Analysis of the problem of excessive memory usage of mysql database

Solution ideas:
Troubleshooting:
top to see the output
cat /proc/meminfo;
View performance_schema monitoring;

Question 5: Error when deleting table

delete from user;ERROR 3100 (HY000)： Error on observer while running replication hook berfore_commit

Solution ideas:
The error message indicates that the transaction is too large and exceeds the size of group_replication_transaction_size_limit. It is recommended to reduce the transaction and operate in batches. It is recommended to clean the table and directly use the truncate statement or delete to allocate operations;

How to avoid:
It is recommended to add where conditions to delete data in batches, reduce large transactions, split them, and formulate appropriate SQL rules.
If it is a large transaction that exists during the derivative, it is recommended to use the following statement for data import:
mysqlsh [email protected]:port/databasename --ssl-mode=DISABLED --util import-table /tmp/sbtest1.txt --schema=databasename --table=sbtest1 --bytes-per-chunk=2M --threads= 2 --maxRate="2M";

Question 6: Unable to join the cluster after executing DDL/DML

In an MGR cluster in multi-master mode. A node serving as the primary node performs DDL and DML operations after the MGR cluster status is abnormal and becomes a non-cluster member, resulting in the inability to rejoin the cluster.

Solution:
1. Turn off the replication function of the cluster
Mysql-SQL>stop group_replication

2. Use the node with the extra GTID as the primary node to allow other nodes to rejoin the cluster.
MySQL Shell command:
JS> dba.dropMetadataSchema({<!-- -->force:true})
JS> var cluster = dba.createCluster('myCluster')
JS> cluster.addInstance('clusteruser@hostname\ip:port',{<!-- -->recoveryMethod:"clone"})
JS> cluster.addInstance('clusteruser@hostname\ip:port ',{<!-- -->recoveryMethod:"clone"})

3) Check the status of the new cluster:
JS> cluster.status()

Problem 7: UUID is consistent

Cannot add an instance with the same server UUID(xxxxxxxx-xxxxxxx-).Please change the server UUID of the instance to add

Solution ideas:
If the uuid is the same, you need to delete auto.cnf in the datadir directory and restart after deletion.
If you fail to join the cluster after deletion and report a 3096 error, it may be because the host name mapping relationship (cat /etc/hosts) is written incorrectly and you need to check it carefully.

How to avoid:
It is recommended to clean up related files in advance when cloning the machine to generate a data backup database.

Question 8: Failed to add members

 Failed to add cluster members

Solution ideas:
Since each host name and IP address is not added to the /etc/hosts of the node members, when the host name is used to join the cluster, the host name cannot be recognized. After adding the IP and host name,

solution:
Write the IP and host name of the node to be added to /etc/hosts

Problem 9: File damage and single node damage

When disk damage, link problems, operating system problems, or database bugs cause disk damage and a large number of bad blocks appear on the data disk, application access cannot be provided.
When disk damage, link problems, operating system problems, or database bugs cause disk damage, the log file currently in use is damaged, resulting in the inability to provide application access.
When disk damage, network problems, operating system problems, or database bugs cause server interruption or shutdown, damage to a single node will not affect normal access to the application.

1. Check the cluster
/> cat /etc/hosts
/> mysqlsh --uri user@hostname/ip:3306
/> var cluster = dba.getCluster()
/> cluster.status()


2. Emergency response
/> var cluster = dba.getCluster()
/> cluster.rescan()
PS: The rescan command can self-check the cluster status. If any abnormality is found on the node, it will notify you to take measures. and follow the prompts to execute


3. Rebuild the host
Install a host with the same configuration, IP, host name and other information as the damaged node server.


4. Add the rebuild host to the cluster
Log in to the cluster. (master node)
1. Add new nodes to the cluster.
/> var cluster = dba.getCluster()
/> cluster.addInstance('user@hostname/ip:3306')

2. Enter C to use clone to add the host to the cluster.

3. After the addition is successful, repeat step 1 to check the cluster environment.

Problem 10: Most nodes in the cluster are damaged

When disk damage, network problems, operating system problems, or database bugs cause server interruption or shutdown, damage to a single node will not affect normal access to the application.

Emergency measures:
When most nodes in the cluster are damaged, for example, if 2 nodes are damaged in a 3-node cluster, the cluster will lose the primary. The surviving nodes will not be able to accept write operations. The primary needs to be restored as soon as possible to provide services, and then the damaged nodes will be repaired.

```bash
1. Check the cluster
2. Use mysqlsh to log in to the surviving node and restore the cluster.
JS > cluster.forceQuorumUsingPartitionOf("clusteruser@ hostname\ip:3306")
3. Recover the failed node
Restoring data and clusters using backups

Problem 11: All nodes are damaged

1. Check the cluster
2. Emergency response
Restore from backup