1. Problem description
The server crashed unexpectedly and k8s could not be started. Checking the kubelet log showed that node “master” was not found. Checking the etcd log reported an error as follows:
Obviously the database file is damaged
[root@master01 ~]# journalctl -u etcd -f ... Oct 08 19:13:40 master01 etcd[66468]: recovering backend from snapshot error: failed to find database snapshot file (snap: snapshot file doesn't 't exist) Oct 08 19:13:40 master01 etcd[66468]: panic: recovering backend from snapshot error: failed to find database snapshot file (snap: snapshot file doesn't exist)
2. Solve the problem
Further inspection found that both etcd1 and etcd2 had the above error, but 3 did not
2.1 Backup
Execute on etcd1 and etcd2:
mv /var/lib/etcd/member /opt/
2.2 Copy data
Copy normal data on etcd3 to etcd1 and etcd2
scp /var/lib/etcd/member 10.0.0.87:/var/lib/etcd/ scp /var/lib/etcd/member 10.0.0.97:/var/lib/etcd/
2.3 Start etcd1 and etcd2
[root@master01 etcd]# systemctl start etcd [root@master01 etcd]# systemctl status etcd ● etcd.service - Etcd Service Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; vendor preset: disabled) Active: active (running) since Sun 2023-10-08 19:24:37 CST; 27min ago Docs: https://coreos.com/etcd/docs/latest/ Main PID: 71124 (etcd) Tasks: 11 Memory: 84.5M CGroup: /system.slice/etcd.service └─71124 /usr/local/bin/etcd --config-file=/etc/etcd/etcd.config.yml ---------------------------------------- [root@master02 etcd]# systemctl start etcd [root@master02 etcd]# systemctl status etcd ● etcd.service - Etcd Service Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; vendor preset: disabled) Active: active (running) since Sun 2023-10-08 19:24:37 CST; 27min ago Docs: https://coreos.com/etcd/docs/latest/ Main PID: 4643 (etcd) Tasks: 12 Memory: 72.2M CGroup: /system.slice/etcd.service └─4643 /usr/local/bin/etcd --config-file=/etc/etcd/etcd.config.yml
2.4 Start etcd3
But an error occurred when starting etcd3
[root@master03 etcd]# systemctl start etcd Job for etcd.service failed because the control process exited with error code. See "systemctl status etcd.service" and "journalctl -xe" for details.
Because etcd1 and etcd2 are both normal, you can delete the data of etcd3 and let it synchronize the data by itself.
[root@master03 etcd]# pwd /var/lib/etcd [root@master03 etcd]# rm -rf ./* #verify [root@master03 etcd]# journalctl -u etcd -f -- Logs begin at Sat 2023-10-07 20:57:40 CST. -- Oct 08 19:54:41 master03 etcd[12512]: store.index: compact 2706 Oct 08 19:54:41 master03 etcd[12512]: finished scheduled compaction at 2706 (took 1.3901ms) Oct 08 19:54:41 master03 etcd[12512]: store.index: compact 3302 Oct 08 19:54:41 master03 etcd[12512]: finished scheduled compaction at 3302 (took 2.115ms) Oct 08 19:54:41 master03 etcd[12512]: published {<!-- -->Name:master03 ClientURLs:[https://10.0.0.107:2379]} to cluster 514c88d14c1a2aa1 Oct 08 19:54:41 master03 etcd[12512]: ready to serve client requests Oct 08 19:54:41 master03 etcd[12512]: ready to serve client requests Oct 08 19:54:41 master03 etcd[12512]: serving insecure client requests on 127.0.0.1:2379, this is strongly discouraged! Oct 08 19:54:41 master03 systemd[1]: Started Etcd Service. Oct 08 19:54:41 master03 etcd[12512]: serving client requests on 10.0.0.107:2379 Oct 08 19:54:42 master03 etcd[12512]: updated the cluster version from 3.0 to 3.4 Oct 08 19:54:42 master03 etcd[12512]: enabled capabilities for version 3.4
[root@master03 etcd]# systemctl status etcd ● etcd.service - Etcd Service Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; vendor preset: disabled) Active: active (running) since Sun 2023-10-08 19:54:41 CST; 2min 41s ago Docs: https://coreos.com/etcd/docs/latest/ Main PID: 12512 (etcd) Tasks: 10 Memory: 58.7M CGroup: /system.slice/etcd.service └─12512 /usr/local/bin/etcd --config-file=/etc/etcd/etcd.config.yml
3. Verification data
3.1 Verify etcd cluster
[root@master01 init]# export ETCDCTL_API=3 [root@master01 init]# etcdctl --endpoints="10.0.0.87:2379,10.0.0.97:2379,10.0.0.107:2379" --cacert=/etc/kubernetes/pki/etcd/etcd-ca.pem - -cert=/etc/kubernetes/pki/etcd/etcd.pem --key=/etc/kubernetes/pki/etcd/etcd-key.pem endpoint status --write-out=table + ----------------- + ------------------ + --------- + -- ------- + -------- | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | + ----------------- + ------------------ + --------- + -- ------- + -------- | 10.0.0.87:2379 | 949b9ccaa465bea8 | 3.4.13 | 3.9 MB | true | false | 14 | 5239 | 5239 | | | 10.0.0.97:2379 | 795272eff6c8418e | 3.4.13 | 3.8 MB | false | false | 14 | 5239 | 5239 | | | 10.0.0.107:2379 | 41172b80a9c89e7f | 3.4.13 | 3.9 MB | false | false | 14 | 5239 | 5239 | | + ----------------- + ------------------ + --------- + -- ------- + --------
3.2 Verify k8s cluster
[root@master01 ~]# kubectl get node NAME STATUS ROLES AGE VERSION master01 Ready <none> 35m v1.20.0 master02 Ready <none> 35m v1.20.0 master03 Ready <none> 35m v1.20.0 node02 Ready <none> 35m v1.20.0 [root@master01 dashboard]# kubectl get pod -A NAMESPACE NAME READY STATUS RESTARTS AGE kube-system calico-kube-controllers-5f6d4b864b-jf7sl 1/1 Running 1 50m kube-system calico-node-5vkdg 1/1 Running 2 50m kube-system calico-node-k4jtq 1/1 Running 1 50m kube-system calico-node-l27hd 1/1 Running 1 50m kube-system calico-node-vt9jf 1/1 Running 2 50m kube-system calico-node-w7x9b 1/1 Running 1 50m kube-system coredns-867d46bfc6-6rx4r 1/1 Running 2 38m kube-system metrics-server-595f65d8d5-wvp8s 1/1 Running 1 36m kubernetes-dashboard dashboard-metrics-scraper-79c5968bdc-ghmzr 1/1 Running 2 31m kubernetes-dashboard kubernetes-dashboard-9f9799597-vqzmq 1/1 Running 1 24m