One of the most common mistakes that people make when increasing the number of replicas of their deployments is not checking whether their replicas are (evenly) distributed across all the nodes in their Kubernetes cluster. Most of the time, the reason why this happens is that people assume that increasing replicas means each replica will be running on a different node. In this post, I will go over why distributing pods across nodes is important and how to achieve it.
Why should you distribute pods across nodes?
If all the pods are running on one node and that node fails, then the failed pods will be scheduled on the remaining nodes in your cluster. However, if there isn’t sufficient compute or memory resources on the remaining nodes, then your deployment will be running at a reduced capacity or worse, not at all until the failed node is restored.
Realistically, node failures happen infrequently. But what is more frequent to occur is unexpected spikes in resources like CPU, memory, disk, and/or network bandwidth. If all the pods of your deployment are running on the same node, then one bad pod can inflict serious damage on your entire deployment. While this problem can be prevented by adding resource limits on the pod template on resources like CPU and memory, there is no easy way to add resource limits on resources like network bandwidth.
How to spread pods across nodes?
There is a whole section in Kubernetes documentation called Assigning Pods to Nodes. This page covers over a wide range of features that allow us to “influence” the scheduling of pods to nodes. My complaint with this page is there are just too many different features discussed in a very short span of space. It is very difficult to determine which one of the features described over there helps us achieve our goals.
When we want to influence the scheduling of a pod based on the existence of another pod, we need to use PodAffinity or PodAntiAffinity feature. Technically, we can even specify both of them at the same time. An example of when you would want something like that is when you want to schedule the current pod close to the DB pod but not close to another pod of the same deployment as the current pod.
Hard vs. Soft scheduling of a Pod to a Node
When trying to influence the behavior of Kubernetes scheduler in terms of scheduling of a pod to a node, we need to decide if a specific pod exists on a node, which one of the two behaviors below do we want?
- If a specific pod exists on a node, then absolutely prevent the scheduling of the current pod on that node. If there are no other nodes where this pod can be scheduled, this pod’s state will be stuck in Pending until there is a node where this pod can be scheduled to. (requiredDuringSchedulingIgnoredDuringExecution)
- If a specific pod exists on a node, then avoid the scheduling of the current pod on that node. If there are no other nodes where this pod can be scheduled, then it’s okay if the current pod is scheduled on that node. (preferredDuringSchedulingIgnoredDuringExecution)
Full documentation of both requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution can be found over here at https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scheduling/podaffinity.md.
In the following example, the podAntiAffinity is set up such that when a new pod with the label of “app: myapp” is being scheduled, that node will NOT be scheduled on any node that already contains another pod with the label of “app: myapp”.
apiVersion: apps/v1 kind: Deployment metadata: labels: app: myapp name: myapp spec: selector: matchLabels: app: myapp template: spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - myapp topologyKey: kubernetes.io/hostname containers: // regular declaration
In the following example, the podAntiAffinity is set up such that when a new pod with the label of “app: myapp” is being scheduled, that node will first be scheduled on any node that doesn’t contain another pod with the label of “app: myapp”. If no nodes are available, then the new pod will be scheduled on any of the nodes.
apiVersion: apps/v1 kind: Deployment metadata: labels: app: myapp name: myapp spec: selector: matchLabels: app: myapp template: spec: affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: - myapp topologyKey: kubernetes.io/hostname weight: 100 containers: // regular declaration