Dr-autosync TiDB cluster’s planned and unplanned switching verification steps

Author: pepezzzz Original source: https://tidb.net/blog/0cdd8bd0

Environment preparation

Cluster name and version

tidb cluster: tidb-h

Version: v6.1.0

Cluster topology: Dr-Autosync cluster deployed in two centers

Data copy: five copies + one Learner copy

Check storage node topology

The TiKV node on the DashBoard page has been correctly configured with a four-layer label system (dc / logic / rack / host).

In the figure below, the primary data center dc1 uses the 192.168.1.x network segment and the label logic1/2/3, and the slave data center dc2 uses the 192.168.4.x network segment and the label logic3/4/5.

Check Dr-Autosync placement-rules configuration

View placement-rules configuration

# tiup ctl:v<CLUSTER_VERSION> pd -u {pd-ip}:2379 -i
>config placement-rules show

The rule output is as follows, dc1 uses the 192.168.1.x network segment and the label logic1/2/3, and the secondary data center dc2 uses the 192.168.4.x network segment and the label logic3/4/5

[
  {
    "group_id": "pd",
    "id": "dc1-logic1",
    "start_key":"",
    "end_key":"",
    "role": "voter",
    "count": 1,
    "label_constraints": [
      {
        "key": "logic",
        "op": "in"
        "values": [
          "logic1"
        ]
      }
    ],
    "location_labels": [
      "dc",
      "logic",
      "rack",
      "host"
    ]
  },
  {
    "group_id": "pd",
    "id": "dc1-logic2",
    "start_key":"",
    "end_key":"",
    "role": "voter",
    "count": 1,
    "label_constraints": [
      {
        "key": "logic",
        "op": "in"
        "values": [
          "logic2"
        ]
      }
    ],
    "location_labels" configuration is the same as omitted
  },
  {
    "group_id": "pd",
    "id": "dc1-logic3",
    "start_key":"",
    "end_key":"",
    "role": "voter",
    "count": 1,
    "label_constraints": [
      {
        "key": "logic",
        "op": "in"
        "values": [
          "logic3"
        ]
      }
    ],
    "location_labels" configuration is the same as omitted
  },
  {
    "group_id": "pd",
    "id": "dc2-logic4",
    "start_key":"",
    "end_key":"",
    "role": "learner",
    "count": 1,
    "label_constraints": [
      {
        "key": "logic",
        "op": "in"
        "values": [
          "logic4"
        ]
      }
    ],
    "location_labels" configuration is the same as omitted
  },
  {
    "group_id": "pd",
    "id": "dc2-logic5",
    "start_key":"",
    "end_key":"",
    "role": "voter",
    "count": 1,
    "label_constraints": [
      {
        "key": "logic",
        "op": "in"
        "values": [
          "logic5"
        ]
      }
    ],
    "location_labels" configuration is the same as omitted
  },
  {
    "group_id": "pd",
    "id": "dc2-logic6",
    "start_key":"",
    "end_key":"",
    "role": "voter",
    "count": 1,
    "label_constraints": [
      {
        "key": "logic",
        "op": "in"
        "values": [
          "logic6"
        ]
      }
    ],
    "location_labels" configuration is the same as omitted
  }
]

Check Dr-Autosync replication-mode configuration

View the replication-mode configuration, the parameter output is as follows, the replication mode uses dr-auto-sync, the label of the data center redundancy level is dc, the primary data center is dc1, the secondary data center is dc2, the primary data center is configured with 3 copies, and the secondary data center The center configures 2 replicas, and the wait-store-timeout timeout is 1 minute by default.

>config show all
...
"replication-mode": {
  "replication-mode": "dr-auto-sync",
  "dr-auto-sync": {
    "label-key": "dc",
    "primary": "dc1",
    "dr": "dc2",
    "primary-replicas": 3,
    "dr-replicas": 2,
    "wait-store-timeout": "1m0s"
...
  }
}

Check the sync status of Dr-Autosync cluster

Confirm the replication status through the DR_STATE file stored in the Data Dir path of the PD node of dc2:

#cat {tidb-datadir}/pd_data/DR_STATE
{"state":"sync","state_id":2158}

Check the Leader distribution of all TiKV server nodes

The leader’s monitoring output is as follows, except for the learner role instance (192.168.4.105 / 106) of dc2-logic4 Label, the distribution of other node leaders is balanced.

Check PD Server node and Leader distribution

# tiup ctl:v<CLUSTER_VERSION> pd -u {pd-ip}:2379 -i
>member
  1. Pd members should be configured with an odd number of nodes and the majority nodes are in the dc1 data center.
  2. The Leader is on the PD node in the 192.168.1.x network segment of the dc1 data center, and the status output is as follows:
...
  "leader": {
  "name":"pd-192.168.1.x-2379",
  "member_id": 11430528617142211933,
  "peer urls": [
    "http://192.168.1.x:2380",
  ],
  "client_urls": [
    "http://192.168.1.x:2379"
  ],
...

Check the load balancing configuration of all tidb server nodes

The two centers have their own load balancers, which are respectively routed to the tidb server in their respective data centers.

Start upstream application

Simulate business applications: Sysbench, BANK.

Bank simulation test consistency application pressure test dc1 data center 192.168.1.x network segment load balancing, used to verify the subsequent RPO=0 unplanned switchover.

./bank_arm64 -dsn 'bankuser:bankuser@tcp({tidb-h-dc1-ip}:4000)/cdcdb' -accounts 10000

After the execution is completed, there should be a prompt of INFO[0002] verify success in xxxx-xx-xx xx:xx:xx ... that the total balance verification was successful.

Sysbench simulates the load balancing of the business application pressure test from the 192.168.4.x network segment of the data center dc2 data center.

sysbench /usr/local/share/sysbench/oltp_read_write.lua --mysql-host={tidb-h-dc2-ip} --mysql-port=4000 --mysql-db=cdcdb --mysql-user =sbuser --mysql-password=sbuser --tables=20 --threads=20 --time=600 run > /tmp/syebench.log &

Simulate master-slave switchover in dr-autosync cluster plan

Switch PD Leader

Switch the Leader to the designated PD node in the 192.168.4.x network segment of the dc2 data center.

# tiup ctl:v<CLUSTER_VERSION> pd -u {pd-ip}:2379 -i
>member leader transfer "pd-192.168.4.x-2379"

Check the status of the new leader.

>member

Confirm that the Leader status output is as follows, the Leader is on the PD node in the 192.168.4.x network segment of the dc2 data center.

...
  "leader": {
  "name":"pd-192.168.4.x-2379",
  "member_id": 11430528617142211933,
  "peer urls": [
    "http://192.168.4.x:2380",
  ],
  "client_urls": [
    "http://192.168.4.x:2379"
  ],
  ...

Affected by PD Leader being unavailable for a short period of time, sysbench and bank should report an error for about 10 seconds, and recover after about 10 seconds.

The sysbench log is as follows:

57s thds: 20 tps: 688.00 qps: 13750.93 (r/w1/o: 9617.95/2756.99/1375.99) lat (ms,95%): 36.24 err/s: 0.00 reconn/s: 0.00
58s thds: 20 tps: 703.00 qps: 14072.05 (r/w/o: 9864.03/2802.01/1406.00) lat (ms,95%): 33.72 err/s: 0.00 reconn/s: 0.00
59s thds: 20 tps: 105.00 qps: 2104.99 (r/w/o: 1454.99/444.00/206.00) lat (ms,95%): 33.12 err/s: 9.00 reconn/s: 0.00
60s thds: 20 tps: 0.00 qps: 57.00 (r/w/o:42.00/12.00/3.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
61s thds: 20 tps: 0.00 qps: 152.00 (r/w/o: 112.00/32.00/8.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
62s thds: 20 tps: 0.00 qps: 19.00 (r/w/o: 14.00/4.00/1.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
63s thds: 20 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
64s thds: 20 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
65s thds: 20 tps: 0.00 qps: 148.00 (r/w/o: 112.00/28.00/8.00) lat (ms,95%): 0.00 err/s: 8.00 reconn/s: 0.00
66s thds: 20 tps: 0.00 qps: 195.00 (r/w/o: 152.00/32.00/11.00) lat (ms,95%): 0.00 err/s: 11.00 reconn/s: 0.00
67s thds: 20 tps: 0.00 qps: 25.00 (r/w/o: 16.00/8.00/1.00) lat (ms,95%): 0.00 err/s: 1.00 reconn/s: 0.00
68s thds: 20 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
69s thds: 20 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
70s thds: 20 tps: 0.00 qps: 145.00 (r/w/o: 98.00/40.00/7.00) lat (ms,95%): 0.00 err/s: 7.00 reconn/s: 0.00
71s thds: 20 tps: 0.00 qps: 143.00 (r/w/o: 109.00/26.00/8.00) lat (ms,95%): 0.00 err/s: 8.00 reconn/s: 0.00
72s thds: 20 tps: 0.00 qps: 42.00 (r/w/o: 31.00/9.00/2.00) lat (ms,95%): 0.00 err/s: 2.00 reconn/s: 0.00
73s thds: 20 tps: 18.00 qps: 325.00 (r/w/o: 238.00/51.00/36.00) lat (ms,95%): 14827.42 err/s: 0.00 reconn/s: 0.00
74s thds: 20 tps: 580.99 qps: 11558.89 (r/w/o: 8105.92/2291.98/1160.99) lat (ms,95%): 39.65 err/s: 0.00 reconn/s: 0.00
75s thds: 20 tps: 636.00 qps: 12696.02 (r/w/o: 8872.01/2551.00/1273.00) lat (ms,95%): 37.56 err/s: 0.00 reconn/s: 0.00
76s thds: 20 tps: 634.00 qps: 12648.05 (r/w/o: 8872.03/2508.01/1268.00) lat (ms,95%): 37.56 err/s: 0.00 reconn/s: 0.00

The bank’s log is as follows:

INF0[0986] verify success in 2022-09-16 18:09:47.91554303 + 0800 CST m= + 986.045061901
INF0[0988] verify success in 2022-09-16 18:09:49.914612 + 0800 CST m= + 988.044130871
ERRO「0992] move money error: Error 8027: Information schema is out of date: schema failed to update in 1 lease, please make sure TiDB can connect to TiKV
...
ERRO「0992] move money error: Error 8027: Information schema is out of date: schema failed to update in 1 lease, please make sure TiDB can connect to TiKV
INFO[0994] verify success in 2022-09-16 18:09:55.91037518 + 0800 CST m- + 994.039894051
INFO[0996] verify success in 2022-09-16 18:09:57.91316108 + 0800 CST m= + 996.042679951

Expand new PD

Expand the capacity of the PD nodes on the 192.168.4.x network segment of the dc2 data center and convert it into a majority data center.

vi pd4.yaml
tiup cluster scale-out tidb -h pd4.yaml -uroot -p

During the expansion process, the business should be normal. After the capacity expansion is completed, the number of PD nodes in the two data centers should be the same.

tiup cluster scale-out tidb-h

Scale down PD nodes

Scale down the PD nodes on the 192.168.1.x network segment of the dc1 data center, and convert it into a minority.

During the shrinking process, the business should be normal. After the scaling is complete, the number of PD nodes in the dc1 data center is less than that in the dc2 data center.

tiup cluster scale-in tidb-h -N 192.168.1.x:2379

Switch TiKV roles

Check the placement-rules configuration after data center switching, dc1-logic1 should be designated as learner, and dc2-logic4 should be designated as voter.

# vi /home/tidb/dr-rules/dr-rules-dc2.json
[
  {
    "group_id": "pd",
    "id": "dc1-logic1",
    "start_key":"",
    "end_key":"",
    "role": "voter",
    "count": 1,
    "label_constraints": [
      {
        "key": "logic",
        "op": "in"
        "values": [
          "logic1"
        ]
      }
    ],
    "location_labels": [
      "dc",
      "logic",
      "rack",
      "host"
    ]
  },
...
    {
    "group_id": "pd",
    "id": "dc2-logic4",
    "start_key":"",
    "end_key":"",
    "role": "learner",
    "count": 1,
    "label_constraints": [
      {
        "key": "logic",
        "op": "in"
        "values": [
          "logic4"
        ]
      }
    ],
    "location_labels" configuration is the same as omitted
  }
...

After backing up the current configuration, load the new placement-rules to implement the swap of TiKV Leader, and modify the configuration of replication-mode accordingly.

# tiup ctl:v<CLUSTER_VERSION> pd -u {pd-ip}:2379 -i
>config placement-rules rule-bundle load --out="/home/tidb/dr-rules/dr-rules-dc1_backup.json"
>config placement-rules rule-bundle save --in="/home/tidb/dr-rules/dr-rules-dc2.json"
>config set replication-mode dr-auto-sync primary dc2
>config set replication-mode dr-auto-sync dr dc1

Observe TiKV leader switching

After the role of the data center is reversed, the Leader on the TiKV node belonging to the dc1-logic2 / dc2-logic4 label will be switched, and the switching time is usually less than 1 minute.

Check the sync status of Dr-Autosync cluster

Confirm the replication status through the DR_STATE file stored in the Data Dir path of the PD node of dc1:

#cat {tidb-datadir}/pd_data/DR_STATE
{"state":"sync","state_id":2158}

So far, the planned master-slave switchover of the Dr-Autosync cluster has been completed

(Optional) Shut down all instances of dc1 to simulate primary datacenter outage

Method 1 Restore the majority mode and then close it

Revert to majority mode.

# tiup ctl:v<CLUSTER_VERSION> pd -u {pd-ip}:2379 -i
>config set replication-mode majority

Add a scheduler that removes all leaders on all stores in the dc1 data center to reduce the application impact when it is shut down.

>scheduler add evict-leader-scheduler 1

Shut down all instance nodes in all dc1 data centers to simulate the outage of the main data center. Theoretically, there should be almost no impact on the application business of dc2.

tiup cluster stop tidb-h -N {dc1-node-ip1}:{port},{dc1-node-ip2}:{port},...,{dc1-node-ipN}:{port}

Method 2 Close directly

Directly shut down all instance nodes in all dc1 data centers, simulating the outage of the main data center.

tiup cluster stop tidb-h -N {dc1-node-ip1}:{port},{dc1-node-ip2}:{port},...,{dc1-node-ipN}:{port}

The dr-autosync parameter wait-store-timeout is configured as 1m. After the main center tikv stops serving, the cluster will downgrade to async asynchronous mode, and all read and write operations will be blocked during this time.

The sysbench service resumes after about 85 seconds of interruption.

The Bank program connection dc1 load balancing needs to be changed to dc2 load balancing. After the application is restarted, the BANK business will resume after restarting.

Simulate dr-autosync cluster unplanned switchover from cluster takeover

Configure backup tiup node from data center

  1. Check the meta.yaml file of the tiup node in the slave data center, and only the node instance of the slave data center should be configured.
  2. The output of tiup cluster display tidb-h is also only the node instance from the 192.168.4.x network segment of the data center.

Check dc2 restore script

Query the storeids of all tikv in dc1.

select STORE_ID,STORE_STATE,ADDRESS from TIKV_STORE_STATUS where address like ' 2.168.1.%' order by ADDRESS;
select group_concat(STORE_ID) from TIKV_STORE_STATUS where address like ' 2.168.1.%' order by ADDRESS;

Check the restore script’s store id information on the tiup node in dc2.

#parameter
CLUSTER_NAME=tidb-h
# unhealthy store_id list
STORE_ID="32,31,33,25,22,23,27,29,20,13,21,14,28,17,26,19,34,37,36,35,24,15,16, 18"
CTL_VERSION=v6.1.1
PD_RECOVER_DIR= /home/tidb/tidb-community-toolkit-v6.1.1-linux-arm64/

Check the placement rules file for dc2

cat > rules_dr.json <<EOF
[
  {
    "group_id": "pd",
    "group_index":0,
    "group_override": false,
    "rules": [
      {
        "group_id": "pd",
        "id": "dc2-logic4",
        "start_key":"",
        "end_key":"",
        "role": "voter",
        "count": 1,
        "label_constraints": [
          {
            "key": "logic",
            "op": "in"
            "values": [
              "logic4"
            ]
          }
        ],
        "location_labels": [
          "dc",
          "logic",
          "rack",
          "host"
        ]
      },
      {
        "group_id": "pd",
        "id": "dc2-logic5",
        "start_key":"",
        "end_key":"",
        "role": "voter",
        "count": 1,
        "label_constraints": [
          {
            "key": "logic",
            "op": "in"
            "values": [
              "logic5"
            ]
          }
        ],
        "location_labels" configuration is the same as omitted
      },
      {
        "group_id": "pd",
        "id": "dc2-logic6",
        "start_key":"",
        "end_key":"",
        "role": "voter",
        "count": 1,
        "label_constraints": [
          {
            "key": "logic",
            "op": "in"
            "values": [
              "logic6"
            ]
          }
        ],
      "location_labels" configuration is the same as omitted
      }
    ]
  }
]

Shut down all instances of dc1 to simulate primary data center outage

Directly shut down all instance nodes of dc1 to simulate the outage of the main data center.

tiup cluster stop tidb-h -N {dc1-node-ip1}:{port},{dc1-node-ip2}:{port},...,{dc1-node-ipN}:{port}

Stop the upstream application

sysbench and bank should report errors, use pkill to stop sysbench and bank.

pkill sysbench
pkill bank_arm64

Perform a forced switch

Execute the minority recovery script disaster_recover on the tiup node of dc2.

./disaster_recover

According to the number of stores and regions, the execution time is about 5 minutes.

Check that the leader distribution of TiKV is only in the dc2 data center.

Depending on the topology, re-push the ticdc configuration file.

tiup cluster reload tidb -h -R cdc

The original minority does business data verification

Business applications can manually check whether they are consistent. For example, BANK applications can confirm that transactions are consistent through the latest records and record the current current_ts.

select sum(balance) from accounts;
select * from record where tso=(select max(tso) from record);
select * from accounts where tso=(select max(tso) from accounts);
select tidb_current_ts();

After the data is recorded, the business application can be resumed.

The original majority faction does business data verification

Start the majority nodes of the original tidb-h cluster in the dc1 data center except ticdc.

tiup cluster start tidb-h -N 192.168.1.x:2379,192.168.1.x:4000,192.168.1.x:20160

Business applications can manually check whether they are consistent. For example, BANK applications can confirm that transactions are consistent through the latest records and record the current current_ts. The data query results of the independent clusters of the two data centers are consistent, indicating that the dr-autosync solution can achieve RPO=0.

select sum(balance) from accounts;
select * from record where tso=(select max(tso) from record);
select * from accounts where tso=(select max(tso) from accounts);
select tidb_current_ts();

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. Java skill treeHomepageOverview 108563 people are studying systematically