Affinity and anti-affinity of k8s promotion

nodeSelector provides a very simple way to limit Pods to nodes containing specific labels. Affinity/anti-affinity properties greatly expand the expression of limitations. The main enhancements are:

Expressions are more efficient (not just an “and” relationship of multiple exact matching expressions)
The rule can be marked as “soft” / “preference” instead of hard requirement. At this time, if the scheduler finds that the rule cannot be satisfied, the Pod can still be scheduled.
You can compare the labels of other Pods running on the node (or other topological domains), not just the node’s own labels. At this time, you can define rules like this: two types of Pods cannot be on the same node (or topological domain)

Node affinity

The concept of node affinity is similar to nodeSelector, which can limit which nodes a Pod can be scheduled to based on the label of the node.

Currently two types of node affinity are supported, requiredDuringSchedulingIgnoredDuringExecution (hard, the target node must meet this condition) and preferredDuringSchedulingIgnoredDuringExecution (soft, the target node should preferably meet this condition) . The IgnoredDuringExecution in the name means: if the node’s label changes after the Pod has been scheduled to the node, so that the node no longer matches the affinity rule, the Pod will continue to execute on the node ( This is similar to nodeSelector). In the future, Kubernetes will provide the option requiredDuringSchedulingRequiredDuringExecution, which is similar to requiredDuringSchedulingIgnoredDuringExecution. The difference is that when the label of the node no longer matches the affinity rule, the Pod will be removed from the Evicted from the node.

An example of requiredDuringSchedulingIgnoredDuringExecution is to only run the Pod on Intel CPU. An example of preferredDuringSchedulingIgnoredDuringExecution is to try to run the Pod in a high availability zone Run this Pod in XYZ, but if that's not possible, you can run the Pod elsewhere.

Node affinity is defined through the affinity.nodeAffinity field in PodSpec

apiVersion: v1
Kind: Pod
metadata:
  name: with-node-affinity
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/e2e-az-name
            operator:In
            values:
            - e2e-az1
            -e2e-az2
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: another-node-label-key
            operator:In
            values:
            -another-node-label-value
  containers:
  - name: with-node-affinity
    image: k8s.gcr.io/pause:2.0

The affinity rules here indicate that the Pod can only be scheduled to contain the key as kubernetes.io/e2e-az-name and the value as e2e-az1 or < on the node with the label code>e2e-az2. In addition, if the node has met the aforementioned conditions, priority will be given to labels containing key another-node-label-key and value another-node-label-value node.

The example uses the operator In. Node affinity supports the following operators: In, NotIn, Exists, DoesNotExist, Gt, Lt. Use NotIn and DoesNotExist to achieve the effect of node anti-affinity, or you can use [Taint] to exclude certain types of Pods for nodes.

If a Pod specifies both nodeSelector and nodeAffinity, the target node must meet both conditions before the Pod can be scheduled to the node.

If multiple nodeSelectorTerms are specified for nodeAffinity, the target node only needs to meet the requirements of any one nodeSelectorTerms before the Pod can be scheduled to the node. .

If you specify multiple matchExpressions for nodeSelectorTerms, the target node must meet the requirements of all matchExpressions before the Pod can be scheduled to that node.

After a Pod is scheduled to a node, if the node’s label is removed or modified, the Pod will continue to run on the node. In other words, node affinity rules only take effect when scheduling the Pod.

The value range of the weight field in preferredDuringSchedulingIgnoredDuringExecution is 1-100. For each node that satisfies scheduling requirements (resource requests, affinity/anti-affinity rules, etc.), the scheduler will iterate over all weights that match the node’s preferredDuringSchedulingIgnoredDuringExecution code> and sum. The result of this sum is combined with the score from the node’s other priority calculations. The node with the highest score is selected first.

Pod affinity and anti-affinity

Inter-pod affinity and anti-affinity (inter-pod affinity and anti-affinity) can limit which node a Pod can be scheduled to based on the label of the Pod already running on the node (rather than the label of the node). . The expression of such rules is:

When X is already running one or more Pods that satisfy rule Y, the Pod to be scheduled should (or should not – anti-affinity) run on X
- Rule Y is expressed as a LabelSelector, with an optional list of namespaces
  
  Unlike nodes, Pods are in the namespace (therefore, the label of the Pod is in the namespace). The LabelSelector for the Pod must also specify the corresponding namespace.
- X is the concept of a topological domain, such as node, cabinet, cloud provider availability zone, cloud provider region, etc. X is expressed in the form of topologyKey, which Key represents a label on the node that represents the topology domain.

An example of pod affinity

apiVersion: v1
Kind: Pod
metadata:
  name: with-pod-affinity
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: security
            operator:In
            values:
            -S1
        topologyKey: failure-domain.beta.kubernetes.io/zone
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: security
              operator:In
              values:
              -S2
          topologyKey: failure-domain.beta.kubernetes.io/zone
  containers:
  - name: with-pod-affinity
    image: k8s.gcr.io/pause:2.0

The Pod’s affinity defines a Pod affinity rule and a Pod anti-affinity rule. In the example, podAffinity is requiredDuringSchedulingIgnoredDuringExecution, and podAntiAffinity is preferredDuringSchedulingIgnoredDuringExecution.

Pod affinity rules require that the availability zone zone where the node to which the Pod can be scheduled must already have an already running Pod containing the labels key=security, value=S1, or more precisely , the node must meet the following conditions:
- The node contains a label with key failure-domain.beta.kubernetes.io/zone, assuming the value of the label is V
- At least one node containing a label with key failure-domain.beta.kubernetes.io/zone and value V is already running a node containing label with key security and value is S1 Pod
Pod anti-affinity rules require that the Pod should not be scheduled on a node that is already running a Pod containing a label with key security and value S2, or More precisely, the following conditions must be met:
- If topologyKey is failure-domain.beta.kubernetes.io/zone, then the Pod cannot be scheduled to the same zone that is already running with the label security : On the node of S2

In principle, topologyKey can be any legal tag key. However, for performance and security reasons, topologyKey still has the following restrictions:

For affinity and requiredDuringSchedulingIgnoredDuringExecution Pod anti-affinity, topologyKey cannot be empty
For requiredDuringSchedulingIgnoredDuringExecution Pod anti-affinity, the management controller LimitPodHardAntiAffinityTopology is used to limit the topologyKey to be kubernetes.io/hostname code>. If you want to use other custom topologies, you must modify the management controller or disable it
For preferredDuringSchedulingIgnoredDuringExecution Pod anti-affinity, if topologyKey is empty, it represents all topologies (at this time, it is not limited to kubernetes.io/hostname, a combination of failure-domain.beta.kubernetes.io/zone and failure-domain.beta.kubernetes.io/region)
Except for the above cases, topologyKey can be any legal tag Key

In addition to labelSelector and topologyKey, you can also specify a list of namespaces to be used as the scope of labelSelector (similar to labelSelector and topologyKey are defined at the same level). If not defined or this field is empty, it defaults to the namespace where the Pod is located.

All matchExpressions associated with the requiredDuringSchedulingIgnoredDuringExecution affinity and anti-affinity must be satisfied before the Pod can be scheduled to the target node.

More practical examples

Pod affinity and anti-affinity can be very useful when used in conjunction with high-level controllers such as ReplicaSet, StatefulSet, Deployment, etc. At this time, it is easy to complexly schedule a group of jobs to the same topology, for example, the same node.

Always on the same node

In a three-node cluster, deploy a web application using redis, and expect the web-server to be on the same node as redis as much as possible.

Below is the yaml snippet for the redis deployment, containing three copies and the app=store tag selector. PodAntiAffinity is configured in the Deployment to ensure that the scheduler will not schedule three replicas to one node:

apiVersion: apps/v1
Kind: Deployment
metadata:
  name: redis-cache
spec:
  selector:
    matchLabels:
      app: store
  replicas: 3
  template:
    metadata:
      labels:
        app: store
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              -key: app
                operator:In
                values:
                - store
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: redis-server
        image: redis:3.2-alpine

Configuration details:

This YAML file describes a Kubernetes Deployment object used to deploy an application named “redis-cache”. The following is a detailed interpretation of the document:

apiVersion: apps/v1: This field specifies the Kubernetes API version used, here is apps/v1, indicating that we are defining a Deployment object.
kind: Deployment: This field specifies the type of Kubernetes resource we want to create. This is a Deployment resource used to deploy and manage copies of the application.
metadata: This field contains metadata information about the Deployment object, such as name and tags.
- name: redis-cache: This field specifies the name of the Deployment object, which will be used to identify and reference this Deployment.
spec: This field contains the Deployment specification, which defines the rules for how to create and run Pods.
- selector: This section defines which Pods will be included in the Deployment.
  - matchLabels: This specifies a label selector that requires matching Pods with the “app: store” label to be included in the Deployment.
- replicas: 3: This field specifies the number of Pod replicas to be created, here it is 3.
- template: This section defines the template of the Pod to be created.
  - metadata: This specifies the tags of the Pod template that match the Deployment’s tag selector.
    - labels: A label “app: store” is defined here.
  - spec: This section defines the specification of the Pod.
    - affinity: The affinity rules of Pods are defined here to ensure the distribution of Pods on nodes.
      - podAntiAffinity: This field defines the anti-affinity rule of the Pod, which requires that Pods with the same label “app: store” will not be scheduled to the same node.
        
        requiredDuringSchedulingIgnoredDuringExecution: This field specifies the anti-affinity rules that must be met during scheduling.
        
        labelSelector: A label selector is defined here, which is required to match Pods with the “app: store” label.
        
        topologyKey: "kubernetes.io/hostname": This field specifies the key of the topology domain, indicating that the Pod’s anti-affinity rules will be executed based on the node’s hostname.
    - containers: This defines the list of containers to be run in the Pod.
      - name: redis-server: This field specifies the name of the container running a service named “redis-server”.
      - image: redis:3.2-alpine: This field specifies the container image to be used. The Alpine Linux base image of Redis 3.2 version is used here.

The function is to create a Deployment named “redis-cache”, which contains 3 replicas. Each replica runs a Redis container named “redis-server”, and ensures that these Pods are running on different nodes to improve Availability and fault tolerance. This Deployment uses label selectors to select Pods to manage and defines affinity rules for Pods to ensure their distribution on nodes.

The following is the yaml fragment of the webserver deployment, which configures podAntiAffinity and podAffinity. It is required that its copy be placed on the same node as the Pod containing the app=store tag; it is also required that the copy of web-server is not scheduled to the same node.

apiVersion: apps/v1
Kind: Deployment
metadata:
  name: web-server
spec:
  selector:
    matchLabels:
      app: web-store
  replicas: 3
  template:
    metadata:
      labels:
        app: web-store
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              -key: app
                operator:In
                values:
                - web-store
            topologyKey: "kubernetes.io/hostname"
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              -key: app
                operator: In
                values:
                - store
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: web-app
        image: nginx:1.12-alpine

If you create the above two deployments, the cluster will look like this:

Node-1	Node-2	Node-3
web-server-1	webserver-2	webserver-3
cache-1	cache-2	cache-3

All three copies of web-server automatically run on the same node as the copy of cach.

kubectl get pods -o wide
 
        Copied to clipboard!

The output results are as follows

NAME READY STATUS RESTARTS AGE IP NODE
redis-cache-1450370735-6dzlj 1/1 Running 0 8m 10.192.4.2 kube-node-3
redis-cache-1450370735-j2j96 1/1 Running 0 8m 10.192.2.2 kube-node-1
redis-cache-1450370735-z73mh 1/1 Running 0 8m 10.192.3.1 kube-node-2
web-server-1287567482-5d4dz 1/1 Running 0 7m 10.192.2.3 kube-node-1
web-server-1287567482-6f7v5 1/1 Running 0 7m 10.192.4.3 kube-node-3
web-server-1287567482-s330j 1/1 Running 0 7m 10.192.3.2 kube-node-2

Always not on the same node

The above example uses the PodAntiAffinity rule and topologyKey: "kubernetes.io/hostname" to deploy the redis cluster, so no two replicas are scheduled to the same node superior. Refer to the ZooKeeper tutorial to learn how to configure anti-affinity for a StatefulSet in the same way for high availability.