Kubernetes Affinity Anti-Affinity Taint Tolerance and Maintenance Eviction

Affinity

Official website:
https://kubernetes.io/zh/docs/concepts/scheduling-eviction/assign-pod-node/

(1) Node affinity

pod.spec.nodeAffinity
●preferredDuringSchedulingIgnoredDuringExecution: soft strategy
●requiredDuringSchedulingIgnoredDuringExecution: hard strategy

(2) Pod affinity

pod.spec.affinity.podAffinity/podAntiAffinity
●preferredDuringSchedulingIgnoredDuringExecution: soft strategy
●requiredDuringSchedulingIgnoredDuringExecution: hard strategy

Simple understanding:

Suppose you are a traveler and you plan to travel to a strange city. You can think of yourself as a Pod, and different attractions in the city can be regarded as nodes. Node affinity means that you are more inclined to go to nodes where the attractions you are interested in are located.
If you must go to a specific attraction, such as a museum, this is a hard strategy; if you say you want to go and preferably go to a museum, but if not you can go to other attractions, this is a soft strategy.
In addition, suppose you have a good friend named Xiao Ming who also plans to travel. You tend to travel with Xiao Ming. This is Pod affinity. If you must travel with Xiao Ming, this is a hard strategy; and if you say that you want to travel with Xiao Ming, but if not, you can travel alone, this is a soft strategy.

//Key value operation relationship

●In: The value of label is pending in a list
●NotIn: The value of label is not in a list
●Gt: The value of label is greater than a certain value
●Lt: The value of label is less than a certain value
●Exists: A certain label exists
●DoesNotExist: A certain label does not exist

#View the information of all nodes in the cluster and display the node labels.
kubectl get nodes --show-labels

requiredDuringSchedulingIgnoredDuringExecution: Hard policy

mkdir /opt/affinity
cd /opt/affinity

vimpod1.yaml

apiVersion: v1
Kind: Pod
metadata:
  name: with-node-affinity
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/hostname #Specify the label of the node
            operator: NotIn #Set the label value of the Pod installed to kubernetes.io/hostname is not on the node in the values list
            values:
            -node02
  containers:
  - name: with-node-affinity
    image: soscscs/myapp:v1

kubectl apply -f pod1.yaml

kubectl get pods -o wide

#Delete all Pods first, then create a new Pod and obtain detailed information of all Pods
kubectl delete pod --all & amp; & amp; kubectl apply -f pod1.yaml & amp; & amp; kubectl get pods -o wide
#If the hard policy does not meet the conditions, the Pod status will always be in Pending status.

//preferredDuringSchedulingIgnoredDuringExecution: soft strategy

vimpod2.yaml
apiVersion: v1
Kind: Pod
metadata:
  name: affinity
  labels:
    app: node-affinity-pod
spec:
  containers:
  - name: with-node-affinity-1
    image: soscscs/myapp:v1
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1 #If there are multiple soft strategy options, the greater the weight, the higher the priority.
        preference:
          matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - node03 #If node03 is available, choose it. If not, it’s ok.


kubectl apply -f pod2.yaml

kubectl get pods -o wide

//Change the value of values: to node01, and the Pod will be created on node01 first.

kubectl delete pod --all & amp; & amp; kubectl apply -f pod2.yaml & amp; & amp; kubectl get pods -o wide

//If hard strategy and soft strategy are used together, the hard strategy must be satisfied first and then the soft strategy can be satisfied.
//Example:

vimpod3.yaml
apiVersion: v1
Kind: Pod
metadata:
  name: affinity
  labels:
    app: node-affinity-pod
spec:
  containers:
  - name: with-node-affinity
    image: soscscs/myapp:v1
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution: #First meet the hard policy and exclude nodes with the kubernetes.io/hostname=node02 label
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/hostname
            operator: NotIn
            values:
            -node02
      preferredDuringSchedulingIgnoredDuringExecution: #If the soft policy is satisfied, nodes with the xiaoma=a label will be preferred.
        - weight: 1
          preference:
            matchExpressions:
            - key: xiaoma
              operator: In
              values:
              -a

//Pod affinity and anti-affinity

Scheduling strategy matching label operator topology domain support scheduling target
nodeAffinity host In, NotIn, Exists,DoesNotExist, Gt, Lt No specified host
podAffinity Pod In, NotIn, Exists,DoesNotExist Yes Pod is in the same topological domain as the specified Pod
podAntiAffinity Pod In, NotIn, Exists,DoesNotExist Yes The Pod is not in the same topological domain as the specified Pod


kubectl label nodes node01 kgc=a
kubectl label nodes node02 kgc=a

//Create a Pod with the label app=myapp01

vimpod3.yaml
apiVersion: v1
Kind: Pod
metadata:
  name: myapp01
  labels:
    app: myapp01
spec:
  containers:
  - name: with-node-affinity
    image: soscscs/myapp:v1
\t

kubectl apply -f pod3.yaml

kubectl get pods --show-labels -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES LABELS
myapp01 1/1 Running 0 37s 10.244.2.3 node01 <none> <none> app=myapp01

//Use Pod affinity scheduling to create multiple Pod resources

vimpod4.yaml
apiVersion: v1
Kind: Pod
metadata:
  name: myapp02
  labels:
    app: myapp02
spec:
  containers:
  - name: myapp02
    image: soscscs/myapp:v1
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          -key: app
            operator: In
            values:
            - myapp01
        topologyKey: kgc
\t\t
#The Pod can be scheduled to a node only if the node is in the same topological domain as at least one running Pod with a label with the key "app" and the value "myapp01". (More precisely, a Pod is eligible to run on node N if node N has a label with key kgc and some value V such that at least one node in the cluster with key kgc and value V is running with A pod with a label with key "app" and value "myapp01".)
#topologyKey is the key of the node label. If two nodes are tagged with this key and have the same label value, the scheduler treats the two nodes as being in the same topological domain. The scheduler attempts to place a balanced number of Pods in each topological domain.
#If the corresponding values of kgc are different, they are different topological domains. For example, if Pod1 is on the Node of kgc=a, Pod2 is on the Node of kgc=b, and Pod3 is on the Node of kgc=a, then Pod2, Pod1, and Pod3 are not in the same topological domain, but Pod1 and Pod3 are in the same topological domain.

kubectl apply -f pod4.yaml

kubectl get pods --show-labels -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES LABELS
myapp01 1/1 Running 0 15m 10.244.1.3 node01 <none> <none> app=myapp01
myapp02 1/1 Running 0 8s 10.244.1.4 node01 <none> <none> app=myapp02
myapp03 1/1 Running 0 52s 10.244.2.53 node02 <none> <none> app=myapp03
myapp04 1/1 Running 0 44s 10.244.1.51 node01 <none> <none> app=myapp03
myapp05 1/1 Running 0 38s 10.244.2.54 node02 <none> <none> app=myapp03
myapp06 1/1 Running 0 30s 10.244.1.52 node01 <none> <none> app=myapp03
myapp07 1/1 Running 0 24s 10.244.2.55 node02 <none> <none> app=myapp03

//Use Pod anti-affinity scheduling
Example 1:

vimpod5.yaml
apiVersion: v1
Kind: Pod
metadata:
  name: myapp10
  labels:
    app: myapp10
spec:
  containers:
  - name: myapp10
    image: soscscs/myapp:v1
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            -key: app
              operator: In
              values:
              - myapp01
          topologyKey: kubernetes.io/hostname

#If the node is in the same topological domain as the Pod and has a label with key "app" and value "myapp01", then the pod should not be scheduled to that node. (If topologyKey is kubernetes.io/hostname, it means that when the node and the Pod with key "app" and value "myapp01" are in the same topology domain, the Pod cannot be scheduled to the node.)

kubectl apply -f pod5.yam

kubectl get pods --show-labels -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES LABELS
myapp01 1/1 Running 0 44m 10.244.1.3 node01 <none> <none> app=myapp01
myapp02 1/1 Running 0 29m 10.244.1.4 node01 <none> <none> app=myapp02
myapp10 1/1 Running 0 75s 10.244.2.4 node02 <none> <none> app=myapp03

Example 2:

vimpod6.yaml
apiVersion: v1
Kind: Pod
metadata:
  name: myapp20
  labels:
    app: myapp20
spec:
  containers:
  - name: myapp20
    image: soscscs/myapp:v1
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          -key: app
            operator:In
            values:
            - myapp01
        topologyKey: kgc
\t\t
//Since the node01 node where the specified Pod is located has a label with the key kgc and the label value a, node02 also has the label kgc=a, so node01 and node02 are in a topological domain, and anti-affinity requires the new Pod to be with the specified The Pod is not in the same topological domain, so the new Pod has no available node, which is the Pending state.
kubectl get pod --show-labels -wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES LABELS
myapp01 1/1 Running 0 43s 10.244.1.68 node01 <none> <none> app=myapp01
myapp20 0/1 Pending 0 4s <none> <none> <none> <none> app=myapp03

kubectl label nodes node02 kgc=b --overwrite

kubectl get pod --show-labels -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES LABELS
myapp01 1/1 Running 0 7m40s 10.244.1.68 node01 <none> <none> app=myapp01
myapp21 1/1 Running 0 7m1s 10.244.2.65 node02 <none> <none> app=myapp03

//Taint and Tolerations
//Taint
Node affinity is an attribute (preference or hard requirement) of a Pod that causes the Pod to be attracted to a specific type of node. Taint, on the other hand, enables nodes to exclude a specific class of Pods.
Taint and Toleration work together to prevent Pods from being assigned to inappropriate nodes. One or more taints can be applied to each node, which means that Pods that cannot tolerate these taints will not be accepted by the node. If toleration is applied to Pods, it means that these Pods can (but are not necessarily) scheduled on nodes with matching taints.

You can use the kubectl taint command to set a taint on a Node. After the Node is tainted, it will have an exclusive relationship with the Pod, which allows the Node to refuse the scheduling and execution of the Pod, or even evict the Pod that already exists on the Node. go out.

The composition format of the stain is as follows:
key=value:effect

Each stain has a key and value as the label of the stain, where value can be empty and effect describes the effect of the stain.

Currently, the taint effect supports the following three options:
●NoSchedule: Indicates that k8s will not schedule the Pod to the Node with this taint
●PreferNoSchedule: Indicates that k8s will try to avoid scheduling Pods on Nodes with this taint
●NoExecute: Indicates that k8s will not schedule the Pod to the Node with this taint, and will expel the Pod that already exists on the Node.

kubectl get nodes
NAME STATUS ROLES AGE VERSION
master ready master 11d v1.20.11
node01 Ready <none> 11d v1.20.11
node02 Ready <none> 11d v1.20.11

//Master is because of the NoSchedule taint, so k8s will not schedule Pods to the master node.
kubectl describe node master
…
Taints: node-role.kubernetes.io/master:NoSchedule

#set stain

kubectl taint node node01 key1=value1:NoSchedule

#In the node description, look for the Taints field

kubectl describe node node-name

#remove stains

kubectl taint node node01 key1:NoSchedule-

kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
myapp01 1/1 Running 0 4h28m 10.244.2.3 node02 <none> <none>
myapp02 1/1 Running 0 4h13m 10.244.2.4 node02 <none> <none>
myapp03 1/1 Running 0 3h45m 10.244.1.4 node01 <none> <none>

kubectl taint node node02 check=mycheck:NoExecute

//Check the Pod status, you will find that all Pods on node02 have been evicted (Note: If it is a Deployment or StatefulSet resource type, new Pods will be created on other Nodes in order to maintain the number of copies)
kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
myapp03 1/1 Running 0 3h48m 10.244.1.4 node01 <none> <none>

//Tolerations
The tainted Node will have a mutually exclusive relationship with the Pod based on the taint effect: NoSchedule, PreferNoSchedule, NoExecute, and the Pod will not be scheduled to the Node to a certain extent. But we can set tolerances on Pods, which means that Pods with tolerances can tolerate the existence of taints and can be scheduled to Nodes with taints.

kubectl taint node node01 check=mycheck:NoExecute

vimpod3.yaml
 
\t
kubectl apply -f pod3.yaml

//After setting the taint on both Nodes, the Pod will not be created successfully at this time
kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
myapp01 0/1 Pending 0 17s <none> <none> <none> <none>

vimpod3.yaml
apiVersion: v1
Kind: Pod
metadata:
  name: myapp01
  labels:
    app: myapp01
spec:
  containers:
  - name: with-node-affinity
    image: soscscs/myapp:v1
  tolerations:
  - key: "check"
    operator: "Equal"
    value: "mycheck"
    effect: "NoExecute"
    tolerationSeconds: 3600

#The key, value, and effect must be consistent with the taint set on the Node.
If the value of #operator is Exists, the value value will be ignored, that is, it exists.
#tolerationSeconds is used to describe how long a Pod can continue to run on a Node when it needs to be evicted.

kubectl apply -f pod3.yaml

//After setting the tolerance, the Pod is created successfully

kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
myapp01 1/1 Running 0 10m 10.244.1.5 node01 <none> <none>

//Other notes
(1) When the key value is not specified, it means that all tainted keys are tolerated

 tolerations:
  - operator: "Exists"

(2) When the effect value is not specified, it means that all taint effects are tolerated

 tolerations:
  - key: "key"
    operator: "Exists"

(3) When there are multiple Masters, to prevent resource waste, you can set it as follows

kubectl taint node Master-Name node-role.kubernetes.io/master=:PreferNoSchedule

//If a Node updates and upgrades system components, in order to prevent long-term business interruption, you can first set the NoExecute taint on the Node to expel all Pods on the Node.

kubectl taint node node01 check=mycheck:NoExecute

//If other Node resources are not enough at this time, you can temporarily set the PreferNoSchedule stain on the Master so that Pods can be temporarily created on the Master.

kubectl taint node master node-role.kubernetes.io/master=:PreferNoSchedule

//Wait until all Node update operations are completed, then remove the stains

kubectl taint node node01 check=mycheck:NoExecute-

maintenance operations
//cordon and drain
##Perform maintenance operations on nodes:

kubectl get nodes

//Mark Node as unschedulable so that newly created Pods will not be allowed to run on this Node

kubectl cordon <NODE_NAME> #The node will become SchedulingDisabled state

//kubectl drain can cause the Node node to start releasing all pods and not accept new pod processes. The original meaning of drain is to drain water, which means to transfer the Pod under the problematic Node to run under other Node.

kubectl drain <NODE_NAME> --ignore-daemonsets --delete-local-data --force

--ignore-daemonsets: Ignore Pods managed by DaemonSet.
--delete-local-data: If there is a pod that mounts local volume, the pod will be forcibly killed.
--force: Forcefully release Pods that are not managed by the controller, such as kube-proxy.

Note: Executing the drain command will automatically do two things:
(1) Set this node to an unschedulable state (cordon)
(2) evict (evicted) the Pod

//kubectl uncordon marks Node as schedulable

kubectl uncordon <NODE_NAME>

//Pod startup phase (phase phase)
After the Pod is created, until it runs permanently, there are many steps in between, and there are many possibilities for errors, so there will be many different states.
Generally speaking, the pod process includes the following steps:
(1) Schedule to a certain node. Kubernetes selects a node based on a certain priority algorithm and uses it as a node to run the Pod.
(2) Pull the image
(3) Mount storage configuration, etc.
(4) Run it. If there is a health check, its status will be set based on the results of the check.

The possible states of phase are:

●Pending: Indicates that APIServer has created a Pod resource object and has been stored in etcd, but it has not been scheduled (for example, it has not been scheduled to a node), or it is still in the process of downloading the image from the warehouse. .

●Running: The Pod has been scheduled to a certain node, and all containers in the Pod have been created by kubelet. At least one container is running, or is being started or restarted (that is, Pods in the Running state may not be accessible normally).

●Succeeded: Some pods are not long-running, such as jobs and cronjobs. After a period of time, all containers in the Pod are successfully terminated and will not be restarted. Feedback on the results of task execution is required.

●Failed: All containers in the Pod have been terminated, and at least one container has terminated due to failure. In other words, the container exits with a non-zero status or is terminated by the system. For example, there is a problem with the command writing.

●Unknown: Indicates that the Pod status cannot be read, usually because kube-controller-manager cannot communicate with the Pod.

##Troubleshooting steps:
//View Pod events

kubectl describe TYPE NAME_PREFIX

//View Pod log (in Failed state)

kubectl logs <POD_NAME> [-c Container_NAME]

//Enter the Pod (the status is running, but the service is not provided)

kubectl exec –it <POD_NAME> bash

//View cluster information

kubectl get nodes

// Found that the cluster status is normal

kubectl cluster-info

//View kubelet log discovery

journalctl -xefu kubelet