etcd failure-recovering backend from snapshot error: failed to find database snapshot file

1. Problem description

The server crashed unexpectedly and k8s could not be started. Checking the kubelet log showed that node “master” was not found. Checking the etcd log reported an error as follows:
Obviously the database file is damaged
[root@master01 ~]# journalctl -u etcd -f ... Oct 08 19:13:40 master01 etcd[66468]: recovering backend from snapshot error: failed to find database snapshot file (snap: snapshot file doesn't 't exist) Oct 08 19:13:40 master01 etcd[66468]: panic: recovering backend from snapshot error: failed to find database snapshot file (snap: snapshot file doesn't exist)

2. Solve the problem

Further inspection found that both etcd1 and etcd2 had the above error, but 3 did not

2.1 Backup

Execute on etcd1 and etcd2:

mv /var/lib/etcd/member /opt/

2.2 Copy data

Copy normal data on etcd3 to etcd1 and etcd2

scp /var/lib/etcd/member 10.0.0.87:/var/lib/etcd/
scp /var/lib/etcd/member 10.0.0.97:/var/lib/etcd/

2.3 Start etcd1 and etcd2

[root@master01 etcd]# systemctl start etcd
[root@master01 etcd]# systemctl status etcd
● etcd.service - Etcd Service
   Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; vendor preset: disabled)
   Active: active (running) since Sun 2023-10-08 19:24:37 CST; 27min ago
     Docs: https://coreos.com/etcd/docs/latest/
 Main PID: 71124 (etcd)
    Tasks: 11
   Memory: 84.5M
   CGroup: /system.slice/etcd.service
           └─71124 /usr/local/bin/etcd --config-file=/etc/etcd/etcd.config.yml

----------------------------------------
[root@master02 etcd]# systemctl start etcd
[root@master02 etcd]# systemctl status etcd
● etcd.service - Etcd Service
   Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; vendor preset: disabled)
   Active: active (running) since Sun 2023-10-08 19:24:37 CST; 27min ago
     Docs: https://coreos.com/etcd/docs/latest/
 Main PID: 4643 (etcd)
    Tasks: 12
   Memory: 72.2M
   CGroup: /system.slice/etcd.service
           └─4643 /usr/local/bin/etcd --config-file=/etc/etcd/etcd.config.yml

2.4 Start etcd3

But an error occurred when starting etcd3

[root@master03 etcd]# systemctl start etcd
Job for etcd.service failed because the control process exited with error code. See "systemctl status etcd.service" and "journalctl -xe" for details.

Because etcd1 and etcd2 are both normal, you can delete the data of etcd3 and let it synchronize the data by itself.

[root@master03 etcd]# pwd
/var/lib/etcd
[root@master03 etcd]# rm -rf ./*

#verify
[root@master03 etcd]# journalctl -u etcd -f
-- Logs begin at Sat 2023-10-07 20:57:40 CST. --
Oct 08 19:54:41 master03 etcd[12512]: store.index: compact 2706
Oct 08 19:54:41 master03 etcd[12512]: finished scheduled compaction at 2706 (took 1.3901ms)
Oct 08 19:54:41 master03 etcd[12512]: store.index: compact 3302
Oct 08 19:54:41 master03 etcd[12512]: finished scheduled compaction at 3302 (took 2.115ms)
Oct 08 19:54:41 master03 etcd[12512]: published {<!-- -->Name:master03 ClientURLs:[https://10.0.0.107:2379]} to cluster 514c88d14c1a2aa1
Oct 08 19:54:41 master03 etcd[12512]: ready to serve client requests
Oct 08 19:54:41 master03 etcd[12512]: ready to serve client requests
Oct 08 19:54:41 master03 etcd[12512]: serving insecure client requests on 127.0.0.1:2379, this is strongly discouraged!
Oct 08 19:54:41 master03 systemd[1]: Started Etcd Service.
Oct 08 19:54:41 master03 etcd[12512]: serving client requests on 10.0.0.107:2379
Oct 08 19:54:42 master03 etcd[12512]: updated the cluster version from 3.0 to 3.4
Oct 08 19:54:42 master03 etcd[12512]: enabled capabilities for version 3.4

[root@master03 etcd]# systemctl status etcd
● etcd.service - Etcd Service
   Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; vendor preset: disabled)
   Active: active (running) since Sun 2023-10-08 19:54:41 CST; 2min 41s ago
     Docs: https://coreos.com/etcd/docs/latest/
 Main PID: 12512 (etcd)
    Tasks: 10
   Memory: 58.7M
   CGroup: /system.slice/etcd.service
           └─12512 /usr/local/bin/etcd --config-file=/etc/etcd/etcd.config.yml

3. Verification data

3.1 Verify etcd cluster

[root@master01 init]# export ETCDCTL_API=3
[root@master01 init]# etcdctl --endpoints="10.0.0.87:2379,10.0.0.97:2379,10.0.0.107:2379" --cacert=/etc/kubernetes/pki/etcd/etcd-ca.pem - -cert=/etc/kubernetes/pki/etcd/etcd.pem --key=/etc/kubernetes/pki/etcd/etcd-key.pem endpoint status --write-out=table
 + ----------------- + ------------------ + --------- + -- ------- + --------
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
 + ----------------- + ------------------ + --------- + -- ------- + --------
| 10.0.0.87:2379 | 949b9ccaa465bea8 | 3.4.13 | 3.9 MB | true | false | 14 | 5239 | 5239 | |
| 10.0.0.97:2379 | 795272eff6c8418e | 3.4.13 | 3.8 MB | false | false | 14 | 5239 | 5239 | |
| 10.0.0.107:2379 | 41172b80a9c89e7f | 3.4.13 | 3.9 MB | false | false | 14 | 5239 | 5239 | |
 + ----------------- + ------------------ + --------- + -- ------- + --------

3.2 Verify k8s cluster

[root@master01 ~]# kubectl get node
NAME STATUS ROLES AGE VERSION
master01 Ready <none> 35m v1.20.0
master02 Ready <none> 35m v1.20.0
master03 Ready <none> 35m v1.20.0
node02 Ready <none> 35m v1.20.0

[root@master01 dashboard]# kubectl get pod -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-kube-controllers-5f6d4b864b-jf7sl 1/1 Running 1 50m
kube-system calico-node-5vkdg 1/1 Running 2 50m
kube-system calico-node-k4jtq 1/1 Running 1 50m
kube-system calico-node-l27hd 1/1 Running 1 50m
kube-system calico-node-vt9jf 1/1 Running 2 50m
kube-system calico-node-w7x9b 1/1 Running 1 50m
kube-system coredns-867d46bfc6-6rx4r 1/1 Running 2 38m
kube-system metrics-server-595f65d8d5-wvp8s 1/1 Running 1 36m
kubernetes-dashboard dashboard-metrics-scraper-79c5968bdc-ghmzr 1/1 Running 2 31m
kubernetes-dashboard kubernetes-dashboard-9f9799597-vqzmq 1/1 Running 1 24m