k8s etcd startup error

Background

An unexpected power outage in the computer’s hibernation state causes the virtual machine to directly enter a shutdown state.

Question

kubectl command error

[root@master01 ~]#kubectl get node
The connection to the server master01.kktb.org:6443 was refused - did you specify the right host or port?

kubelet service error

Oct 15 08:39:37 master01.kktb.org kubelet[747]: E1015 08:39:37.251784 747 kubelet.go:2424] "Error getting node" err="node "master01.kktb.org\ "not found"
Oct 15 08:39:37 master01.kktb.org kubelet[747]: E1015 08:39:37.329952 747 controller.go:144] failed to ensure lease exists, will retry in 7s, error: Get "https://master01 .kktb.org:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/master01.kktb.org?timeout=10s": dial tcp 10.0.6.5:6443: connect: connection refused
Oct 15 08:39:37 master01.kktb.org kubelet[747]: E1015 08:39:37.352614 747 kubelet.go:2424] "Error getting node" err="node "master01.kktb.org" not found "
Oct 15 08:39:37 master01.kktb.org kubelet[747]: I1015 08:39:37.384900 747 kubelet_node_status.go:70] "Attempting to register node" node="master01.kktb.org"
Oct 15 08:39:37 master01.kktb.org kubelet[747]: E1015 08:39:37.385258 747 kubelet_node_status.go:92] "Unable to register node with API server" err="Post "https://master01 .kktb.org:6443/api/v1/nodes": dial tcp 10.0.6.5:6443: connect: connection refused" node="master01.kktb.org"

There is no program monitoring port 6443. It may be that etcd is faulty.

Containers on the master node

[root@master01 kubelet.service.d]#docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
9924571324d2 e3ed7dee73e9 "kube-scheduler --au…" 18 minutes ago Up 18 minutes k8s_kube-scheduler_kube-scheduler-master01.kktb.org_kube-system_bca604f760be36f9a20d9b66b0bf821d_7
9d6171792af2 88784fb4ac2f "kube-controller-man…" 18 minutes ago Up 18 minutes k8s_kube-controller-manager_kube-controller-manager-master01.kktb.org_kube-system_13461bcfe1de9f0e85e957cde36c42d2_10
eb3fdcf0b392 registry.k8s.io/pause:3.6 "/pause" 18 minutes ago Up 18 minutes k8s_POD_etcd-master01.kktb.org_kube-system_67c8baa117269300c9b5e18d5b0a0e44_5
302e36ffb616 registry.k8s.io/pause:3.6 "/pause" 18 minutes ago Up 18 minutes k8s_POD_kube-scheduler-master01.kktb.org_kube-system_bca604f760be36f9a20d9b66b0bf821d_5
a8f8f02011c7 registry.k8s.io/pause:3.6 "/pause" 18 minutes ago Up 18 minutes k8s_POD_kube-controller-manager-master01.kktb.org_kube-system_13461bcfe1de9f0e85e957cde36c42d2_5
3fde7a0d5aae registry.k8s.io/pause:3.6 "/pause" 18 minutes ago Up 18 minutes k8s_POD_kube-apiserver-master01.kktb.org_kube-system_5846def9d7d3ade7311eb0f023db33ff_5

Mirrors exist

[root@master01 kubelet.service.d]#docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
istio/proxyv2 1.19.1 b3547b3ef18b 2 weeks ago 251MB
ubuntu jammy 3565a89d9e81 2 weeks ago 77.8MB
ubuntu latest 3565a89d9e81 2 weeks ago 77.8MB
flannel/flannel v0.22.3 e23f7ca36333 3 weeks ago 70.2MB
flannel/flannel-cni-plugin v1.2.0 a55d1bad692b 2 months ago 8.04MB
hello-world latest 9c7a54a9a43c 5 months ago 13.3kB
k8s.gcr.io/kube-apiserver v1.24.0 529072250ccc 17 months ago 130MB
k8s.gcr.io/kube-proxy v1.24.0 77b49675beae 17 months ago 110MB
k8s.gcr.io/kube-scheduler v1.24.0 e3ed7dee73e9 17 months ago 51MB
k8s.gcr.io/kube-controller-manager v1.24.0 88784fb4ac2f 17 months ago 119MB
k8s.gcr.io/etcd 3.5.3-0 aebe758cef4c 18 months ago 299MB
k8s.gcr.io/pause 3.7 221177c6082a 19 months ago 711kB
ikubernetes/proxy v0.1.1 a4fcedf206e8 23 months ago 62.4MB
k8s.gcr.io/coredns/coredns v1.8.6 a4ca41631cc7 2 years ago 46.8MB
registry.k8s.io/pause 3.6 6270bb605e12 2 years ago 683kB
ikubernetes/demoapp v1.0 3342b7518915 3 years ago 92.7MB

ps -a to see exited containers

[root@master01 kubelet.service.d]#docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
07195c9438d3 aebe758cef4c "etcd --advertise-cl…" 34 seconds ago Exited (2) 33 seconds ago k8s_etcd_etcd-master01.kktb.org_kube-system_67c8baa117269300c9b5e18d5b0a0e44_30
532fabab4479 529072250ccc "kube-apiserver --ad…" 3 minutes ago Exited (1) 3 minutes ago k8s_kube-apiserver_kube-apiserver-master01.kktb.org_kube-system_5846def9d7d3ade7311eb0f023db33ff_27

etcd container error

[root@master01 kubelet.service.d]#docker logs 07195c9438d3
{<!-- -->"level":"info","ts":"2023-10-15T09:31:34.713Z","caller":"etcdmain/etcd.go:73","msg" :"Running: ","args":["etcd","--advertise-client-urls=https://10.0.6.5:2379","--cert-file=/etc/kubernetes/pki/etcd /server.crt","--client-cert-auth=true","--data-dir=/var/lib/etcd","--experimental-initial-corrupt-check=true","-- initial-advertise-peer-urls=https://10.0.6.5:2380","--initial-cluster=master01.kktb.org=https://10.0.6.5:2380","--key-file= /etc/kubernetes/pki/etcd/server.key","--listen-client-urls=https://127.0.0.1:2379,https://10.0.6.5:2379","--listen-metrics -urls=http://127.0.0.1:2381","--listen-peer-urls=https://10.0.6.5:2380","--name=master01.kktb.org","--peer -cert-file=/etc/kubernetes/pki/etcd/peer.crt","--peer-client-cert-auth=true","--peer-key-file=/etc/kubernetes/pki/etcd /peer.key","--peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt","--snapshot-count=10000","--trusted-ca-file= /etc/kubernetes/pki/etcd/ca.crt"]}
{<!-- -->"level":"info","ts":"2023-10-15T09:31:34.714Z","caller":"etcdmain/etcd.go:116","msg" :"server has been already initialized","data-dir":"/var/lib/etcd","dir-type":"member"}
{<!-- -->"level":"info","ts":"2023-10-15T09:31:34.714Z","caller":"embed/etcd.go:131","msg" :"configuring peer listeners","listen-peer-urls":["https://10.0.6.5:2380"]}
{<!-- -->"level":"info","ts":"2023-10-15T09:31:34.714Z","caller":"embed/etcd.go:479","msg" :"starting with peer TLS","tls-info":"cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, client-cert=, client-key=, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file = ","cipher-suites":[]}
{<!-- -->"level":"info","ts":"2023-10-15T09:31:34.714Z","caller":"embed/etcd.go:139","msg" :"configuring client listeners","listen-client-urls":["https://10.0.6.5:2379","https://127.0.0.1:2379"]}
{<!-- -->"level":"info","ts":"2023-10-15T09:31:34.714Z","caller":"embed/etcd.go:308","msg" :"starting an etcd server","etcd-version":"3.5.3","git-sha":"0452feec7","go-version":"go1.16.15","go-os":"linux ","go-arch":"amd64","max-cpu-set":4,"max-cpu-available":4,"member-initialized":true,"name":"master01.kktb.org ","data-dir":"/var/lib/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/etcd /member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count ":10000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["https://10.0.6.5:2380"],"listen-peer-urls":["https: //10.0.6.5:2380"],"advertise-client-urls":["https://10.0.6.5:2379"],"listen-client-urls":["https://10.0.6.5: 2379","https://127.0.0.1:2379"],"listen-metrics-urls":["http://127.0.0.1:2381"],"cors":["*"],"host -whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-size-bytes": 2147483648,"pre-vote":true,"initial-corrupt-check":true,"corrupt-check-time-interval":"0s","auto-compaction-mode":"periodic","auto-compaction -retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
{<!-- -->"level":"info","ts":"2023-10-15T09:31:34.716Z","caller":"etcdserver/backend.go:81","msg" :"opened backend db","path":"/var/lib/etcd/member/snap/db","took":"655.189μs"}
{<!-- -->"level":"info","ts":"2023-10-15T09:31:35.189Z","caller":"etcdserver/server.go:508","msg" :"recovered v2 store from snapshot","snapshot-index":450049,"snapshot-size":"11 kB"}
{<!-- -->"level":"warn","ts":"2023-10-15T09:31:35.190Z","caller":"snap/db.go:88","msg" :"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":450049,"snapshot-file-path":"/var/lib/etcd/member/snap/000000000006de01.snap.db" ,"error":"snap: snapshot file doesn't exist"}
{<!-- -->"level":"panic","ts":"2023-10-15T09:31:35.190Z","caller":"etcdserver/server.go:515","msg" :"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/server/v3 /etcdserver.NewServer\
\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/server.go:515\
go.etcd.io/etcd/server/v3/embed.StartEtcd \
\t/go/src/go.etcd.io/etcd/release/etcd/server/embed/etcd.go:245\
go.etcd.io/etcd/server/v3/etcdmain.startEtcd\
\t /go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:228\
go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\
\t/go/src /go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:123\
go.etcd.io/etcd/server/v3/etcdmain.Main\
\t/go/src/go.etcd .io/etcd/release/etcd/server/etcdmain/main.go:40\
main.main\
\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32 \
runtime.main\
\t/go/gos/go1.16.15/src/runtime/proc.go:225"}
panic: failed to recover v3 backend from snapshot

It is due to failure to restore snapshot data. Check whether the other two nodes also report the same error.

Fortunately, etcd on node 2 starts normally.

[root@master02 snap]#docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
daa59123cbbf aebe758cef4c "etcd --advertise-cl…" 50 seconds ago Up 50 seconds k8s_etcd_etcd-master02.kktb.org_kube-system_3564c10547c1c588ab6c79bcad0e90d0_19

some keyword logs

[root@master02 snap]#docker logs -f daa59123cbbf

{<!-- -->"level":"info","ts":"2023-10-15T09:39:06.599Z","caller":"etcdserver/backend.go:81","msg" :"opened backend db","path":"/var/lib/etcd/member/snap/db","took":"2.526769ms"}
{<!-- -->"level":"warn","ts":"2023-10-15T09:39:06.599Z","caller":"wal/util.go:90","msg" :"ignored file in WAL directory","path":"0000000000000004-00000000000685b1.wal.broken"}
{<!-- -->"level":"info","ts":"2023-10-15T09:39:07.119Z","caller":"etcdserver/server.go:508","msg" :"recovered v2 store from snapshot","snapshot-index":450048,"snapshot-size":"11 kB"}
{<!-- -->"level":"info","ts":"2023-10-15T09:39:07.119Z","caller":"etcdserver/server.go:521","msg" :"recovered v3 backend from snapshot","backend-size-bytes":6635520,"backend-size":"6.6 MB","backend-size-in-use-bytes":2555904,"backend-size-in -use":"2.6 MB"}
{<!-- -->"level":"warn","ts":"2023-10-15T09:39:07.119Z","caller":"wal/util.go:90","msg" :"ignored file in WAL directory","path":"0000000000000004-00000000000685b1.wal.broken"}
{<!-- -->"level":"info","ts":"2023-10-15T09:39:07.161Z","caller":"etcdserver/raft.go:483","msg" :"restarting local member","cluster-id":"e3d7c840b56e7ccb","local-member-id":"77c05bfe945af1a1","commit-index":458027}
{<!-- -->"level":"info","ts":"2023-10-15T09:39:07.161Z","logger":"raft","caller":"etcdserver/zap_raft. go:77","msg":"77c05bfe945af1a1 switched to configuration voters=(6311186738797484145 8628998035010679201 11257876798449243399)"}
{<!-- -->"level":"info","ts":"2023-10-15T09:39:07.161Z","logger":"raft","caller":"etcdserver/zap_raft. go:77","msg":"77c05bfe945af1a1 became follower at term 16"}



{<!-- -->"level":"warn","ts":"2023-10-15T09:40:06.425Z","caller":"etcdhttp/metrics.go:86","msg" :"/health error","output":"{<!-- -->"health":"false","reason":"RAFT NO LEADER"}","status -code":503}
{<!-- -->"level":"warn","ts":"2023-10-15T09:40:07.179Z","caller":"rafthttp/probing_status.go:68","msg" :"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"5795d6e69d2e1871","rtt":"0s","error":"dial tcp 10.0.6.5 :2380: connect: connection refused"}
{<!-- -->"level":"warn","ts":"2023-10-15T09:40:07.179Z","caller":"rafthttp/probing_status.go:68","msg" :"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"9c3c034d28add507","rtt":"0s","error":"dial tcp 10.0.6.7 :2380: connect: connection refused"}
{<!-- -->"level":"warn","ts":"2023-10-15T09:40:07.179Z","caller":"rafthttp/probing_status.go:68","msg" :"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"9c3c034d28add507","rtt":"0s","error":"dial tcp 10.0.6.7 :2380: connect: connection refused"}
{<!-- -->"level":"warn","ts":"2023-10-15T09:40:07.179Z","caller":"rafthttp/probing_status.go:68","msg" :"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"5795d6e69d2e1871","rtt":"0s","error":"dial tcp 10.0.6.5 :2380: connect: connection refused"}

Solution

The way to solve the problem is to directly remove the etcd data directory folder of node 1 and try to copy the data on etcd2.

# First move the data file of node 1 to opt.
[root@master01 ~]#mv /var/lib/etcd/member /opt/
# Copy data from node 2 to node 1
[root@master02 snap]#scp -r /var/lib/etcd/member 10.0.6.5:/var/lib/etcd/

etcd, api-server service recovery
Cluster status

sh-5.1# etcdctl --endpoints=https://10.0.6.5:2379,https://10.0.6.6:2379,https://10.0.6.7:2379 --cacert=/etc/kubernetes/ pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key member list --write-out=table
 + ------------------ + --------- + ------------------ + ----------------------- + ----------------------- + -- ---------- +
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
 + ------------------ + --------- + ------------------ + ----------------------- + ----------------------- + -- ---------- +
| 5795d6e69d2e1871 | started | master01.kktb.org | https://10.0.6.5:2380 | https://10.0.6.5:2379 | false |
| 77c05bfe945af1a1 | started | master02.kktb.org | https://10.0.6.6:2380 | https://10.0.6.6:2379 | false |
| 9c3c034d28add507 | started | master03.kktb.org | https://10.0.6.7:2380 | https://10.0.6.7:2379 | false |
 + ------------------ + --------- + ------------------ + ----------------------- + ----------------------- + -- ---------- +

The kubectl command is used normally

[root@master01 ~]#kubectl get node
NAME STATUS ROLES AGE VERSION
master01.kktb.org Ready control-plane 22d v1.24.3
master02.kktb.org Ready control-plane 22d v1.24.3
master03.kktb.org Ready control-plane 22d v1.24.3