nodeSelector
provides a very simple way to limit Pods to nodes containing specific labels. Affinity/anti-affinity properties greatly expand the expression of limitations. The main enhancements are:
- Expressions are more efficient (not just an “and” relationship of multiple exact matching expressions)
- The rule can be marked as “soft” / “preference” instead of hard requirement. At this time, if the scheduler finds that the rule cannot be satisfied, the Pod can still be scheduled.
- You can compare the labels of other Pods running on the node (or other topological domains), not just the node’s own labels. At this time, you can define rules like this: two types of Pods cannot be on the same node (or topological domain)
Node affinity
The concept of node affinity is similar to nodeSelector
, which can limit which nodes a Pod can be scheduled to based on the label of the node.
Currently two types of node affinity are supported, requiredDuringSchedulingIgnoredDuringExecution
(hard, the target node must meet this condition) and preferredDuringSchedulingIgnoredDuringExecution
(soft, the target node should preferably meet this condition) . The IgnoredDuringExecution
in the name means: if the node’s label changes after the Pod has been scheduled to the node, so that the node no longer matches the affinity rule, the Pod will continue to execute on the node ( This is similar to nodeSelector
). In the future, Kubernetes will provide the option requiredDuringSchedulingRequiredDuringExecution
, which is similar to requiredDuringSchedulingIgnoredDuringExecution
. The difference is that when the label of the node no longer matches the affinity rule, the Pod will be removed from the Evicted from the node.
An example of requiredDuringSchedulingIgnoredDuringExecution
is to only run the Pod on Intel CPU
. An example of preferredDuringSchedulingIgnoredDuringExecution
is to try to run the Pod in a high availability zone Run this Pod in XYZ, but if that's not possible, you can run the Pod elsewhere
.
Node affinity is defined through the affinity.nodeAffinity
field in PodSpec
apiVersion: v1 Kind: Pod metadata: name: with-node-affinity spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/e2e-az-name operator:In values: - e2e-az1 -e2e-az2 preferredDuringSchedulingIgnoredDuringExecution: - weight: 1 preference: matchExpressions: - key: another-node-label-key operator:In values: -another-node-label-value containers: - name: with-node-affinity image: k8s.gcr.io/pause:2.0
The affinity rules here indicate that the Pod can only be scheduled to contain the key as kubernetes.io/e2e-az-name
and the value as e2e-az1
or < on the node with the label code>e2e-az2. In addition, if the node has met the aforementioned conditions, priority will be given to labels containing key another-node-label-key
and value another-node-label-value
node.
The example uses the operator In
. Node affinity supports the following operators: In
, NotIn
, Exists
, DoesNotExist
, Gt
, Lt
. Use NotIn
and DoesNotExist
to achieve the effect of node anti-affinity, or you can use [Taint] to exclude certain types of Pods for nodes.
If a Pod specifies both nodeSelector
and nodeAffinity
, the target node must meet both conditions before the Pod can be scheduled to the node.
If multiple nodeSelectorTerms
are specified for nodeAffinity
, the target node only needs to meet the requirements of any one nodeSelectorTerms
before the Pod can be scheduled to the node. .
If you specify multiple matchExpressions
for nodeSelectorTerms
, the target node must meet the requirements of all matchExpressions
before the Pod can be scheduled to that node.
After a Pod is scheduled to a node, if the node’s label is removed or modified, the Pod will continue to run on the node. In other words, node affinity rules only take effect when scheduling the Pod.
The value range of the weight
field in preferredDuringSchedulingIgnoredDuringExecution
is 1-100. For each node that satisfies scheduling requirements (resource requests, affinity/anti-affinity rules, etc.), the scheduler will iterate over all weights
that match the node’s preferredDuringSchedulingIgnoredDuringExecution
code> and sum. The result of this sum is combined with the score from the node’s other priority calculations. The node with the highest score is selected first.
Pod affinity and anti-affinity
Inter-pod affinity and anti-affinity (inter-pod affinity and anti-affinity) can limit which node a Pod can be scheduled to based on the label of the Pod already running on the node (rather than the label of the node). . The expression of such rules is:
-
When X is already running one or more Pods that satisfy rule Y, the Pod to be scheduled should (or should not – anti-affinity) run on X
-
Rule Y is expressed as a LabelSelector, with an optional list of namespaces
Unlike nodes, Pods are in the namespace (therefore, the label of the Pod is in the namespace). The LabelSelector for the Pod must also specify the corresponding namespace.
-
X is the concept of a topological domain, such as node, cabinet, cloud provider availability zone, cloud provider region, etc. X is expressed in the form of
topologyKey
, which Key represents a label on the node that represents the topology domain.
-
An example of pod affinity
apiVersion: v1 Kind: Pod metadata: name: with-pod-affinity spec: affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: security operator:In values: -S1 topologyKey: failure-domain.beta.kubernetes.io/zone podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: security operator:In values: -S2 topologyKey: failure-domain.beta.kubernetes.io/zone containers: - name: with-pod-affinity image: k8s.gcr.io/pause:2.0
The Pod’s affinity
defines a Pod affinity rule and a Pod anti-affinity rule. In the example, podAffinity
is requiredDuringSchedulingIgnoredDuringExecution
, and podAntiAffinity
is preferredDuringSchedulingIgnoredDuringExecution
.
-
Pod affinity rules require that the availability zone
zone
where the node to which the Pod can be scheduled must already have an already running Pod containing the labels key=security, value=S1, or more precisely , the node must meet the following conditions:- The node contains a label with key
failure-domain.beta.kubernetes.io/zone
, assuming the value of the label isV
- At least one node containing a label with key
failure-domain.beta.kubernetes.io/zone
and valueV
is already running a node containing label with keysecurity
and value isS1
Pod
- The node contains a label with key
-
Pod anti-affinity rules require that the Pod should not be scheduled on a node that is already running a Pod containing a label with key
security
and valueS2
, or More precisely, the following conditions must be met:- If
topologyKey
isfailure-domain.beta.kubernetes.io/zone
, then the Pod cannot be scheduled to the same zone that is already running with the labelsecurity : On the node of S2
- If
In principle, topologyKey
can be any legal tag key. However, for performance and security reasons, topologyKey
still has the following restrictions:
- For affinity and
requiredDuringSchedulingIgnoredDuringExecution
Pod anti-affinity,topologyKey
cannot be empty - For
requiredDuringSchedulingIgnoredDuringExecution
Pod anti-affinity, the management controllerLimitPodHardAntiAffinityTopology
is used to limit thetopologyKey
to bekubernetes.io/hostname
code>. If you want to use other custom topologies, you must modify the management controller or disable it - For
preferredDuringSchedulingIgnoredDuringExecution
Pod anti-affinity, iftopologyKey
is empty, it represents all topologies (at this time, it is not limited tokubernetes.io/hostname
, a combination offailure-domain.beta.kubernetes.io/zone
andfailure-domain.beta.kubernetes.io/region
) - Except for the above cases,
topologyKey
can be any legal tag Key
In addition to labelSelector
and topologyKey
, you can also specify a list of namespaces
to be used as the scope of labelSelector
(similar to labelSelector
and topologyKey
are defined at the same level). If not defined or this field is empty, it defaults to the namespace where the Pod is located.
All matchExpressions
associated with the requiredDuringSchedulingIgnoredDuringExecution
affinity and anti-affinity must be satisfied before the Pod can be scheduled to the target node.
More practical examples
Pod affinity and anti-affinity can be very useful when used in conjunction with high-level controllers such as ReplicaSet, StatefulSet, Deployment, etc. At this time, it is easy to complexly schedule a group of jobs to the same topology, for example, the same node.
Always on the same node
In a three-node cluster, deploy a web application using redis, and expect the web-server to be on the same node as redis as much as possible.
Below is the yaml snippet for the redis deployment, containing three copies and the app=store
tag selector. PodAntiAffinity
is configured in the Deployment to ensure that the scheduler will not schedule three replicas to one node:
apiVersion: apps/v1 Kind: Deployment metadata: name: redis-cache spec: selector: matchLabels: app: store replicas: 3 template: metadata: labels: app: store spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: -key: app operator:In values: - store topologyKey: "kubernetes.io/hostname" containers: - name: redis-server image: redis:3.2-alpine
Configuration details:
This YAML file describes a Kubernetes Deployment object used to deploy an application named “redis-cache”. The following is a detailed interpretation of the document:
apiVersion: apps/v1
: This field specifies the Kubernetes API version used, here is apps/v1, indicating that we are defining a Deployment object.kind: Deployment
: This field specifies the type of Kubernetes resource we want to create. This is a Deployment resource used to deploy and manage copies of the application.metadata
: This field contains metadata information about the Deployment object, such as name and tags.name: redis-cache
: This field specifies the name of the Deployment object, which will be used to identify and reference this Deployment.
spec
: This field contains the Deployment specification, which defines the rules for how to create and run Pods.selector
: This section defines which Pods will be included in the Deployment.matchLabels
: This specifies a label selector that requires matching Pods with the “app: store” label to be included in the Deployment.
replicas: 3
: This field specifies the number of Pod replicas to be created, here it is 3.template
: This section defines the template of the Pod to be created.metadata
: This specifies the tags of the Pod template that match the Deployment’s tag selector.labels
: A label “app: store” is defined here.
spec
: This section defines the specification of the Pod.affinity
: The affinity rules of Pods are defined here to ensure the distribution of Pods on nodes.podAntiAffinity
: This field defines the anti-affinity rule of the Pod, which requires that Pods with the same label “app: store” will not be scheduled to the same node.requiredDuringSchedulingIgnoredDuringExecution
: This field specifies the anti-affinity rules that must be met during scheduling.labelSelector
: A label selector is defined here, which is required to match Pods with the “app: store” label.topologyKey: "kubernetes.io/hostname"
: This field specifies the key of the topology domain, indicating that the Pod’s anti-affinity rules will be executed based on the node’s hostname.
containers
: This defines the list of containers to be run in the Pod.name: redis-server
: This field specifies the name of the container running a service named “redis-server”.image: redis:3.2-alpine
: This field specifies the container image to be used. The Alpine Linux base image of Redis 3.2 version is used here.
The function is to create a Deployment named “redis-cache”, which contains 3 replicas. Each replica runs a Redis container named “redis-server”, and ensures that these Pods are running on different nodes to improve Availability and fault tolerance. This Deployment uses label selectors to select Pods to manage and defines affinity rules for Pods to ensure their distribution on nodes.
The following is the yaml fragment of the webserver deployment, which configures podAntiAffinity
and podAffinity
. It is required that its copy be placed on the same node as the Pod containing the app=store
tag; it is also required that the copy of web-server is not scheduled to the same node.
apiVersion: apps/v1 Kind: Deployment metadata: name: web-server spec: selector: matchLabels: app: web-store replicas: 3 template: metadata: labels: app: web-store spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: -key: app operator:In values: - web-store topologyKey: "kubernetes.io/hostname" podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: -key: app operator: In values: - store topologyKey: "kubernetes.io/hostname" containers: - name: web-app image: nginx:1.12-alpine
If you create the above two deployments, the cluster will look like this:
Node-1 | Node-2 | Node-3 |
---|---|---|
web-server-1 | webserver-2 | webserver-3 |
cache-1 | cache-2 | cache-3 |
All three copies of web-server
automatically run on the same node as the copy of cach.
kubectl get pods -o wide Copied to clipboard!
1
The output results are as follows
NAME READY STATUS RESTARTS AGE IP NODE redis-cache-1450370735-6dzlj 1/1 Running 0 8m 10.192.4.2 kube-node-3 redis-cache-1450370735-j2j96 1/1 Running 0 8m 10.192.2.2 kube-node-1 redis-cache-1450370735-z73mh 1/1 Running 0 8m 10.192.3.1 kube-node-2 web-server-1287567482-5d4dz 1/1 Running 0 7m 10.192.2.3 kube-node-1 web-server-1287567482-6f7v5 1/1 Running 0 7m 10.192.4.3 kube-node-3 web-server-1287567482-s330j 1/1 Running 0 7m 10.192.3.2 kube-node-2
Always not on the same node
The above example uses the PodAntiAffinity
rule and topologyKey: "kubernetes.io/hostname"
to deploy the redis cluster, so no two replicas are scheduled to the same node superior. Refer to the ZooKeeper tutorial to learn how to configure anti-affinity for a StatefulSet in the same way for high availability.