Recently I got some questions about the availability of applications on Kubernetes also in relation to Persistent Volumes (PV) and Persistent Claims (PVC) in a Tanzu Kubernetes Grid environment using the vSphere CSI storage plugin. I did some research that I am sharing in this article with the long title “Tanzu Kubernetes Grid Availability – About deployments, statefulsets, deamon-sets, PVCs on vSAN and K8S node (un)availability“. I hope it’s useful for you.
In this article I want to share some basics about how deployments, statefulsets and deamon-sets work and how these objects work together with storage and available access modes. We will specifically look at what happens if a Kubernetes node (so in case of Tanzu Kubernetes Grid that are the VMs that are running as a K8S control- or worker node) is lost (deleted, powered-off or unavailable because a ESXi hosts fails) in the context of vSphere with Tanzu running on vSphere 8.
K8S App Availability Options
First some basics about K8S app availability options, we have a couple of constructs in Kubernetes. That are:
- A deployment manages a set of pods running the same workload, usually one that doesn’t maintain state.
- A statefulset is object that manages a stateful application.
- A deamonset ensures that each Kubernetes node runs a copy of a pod.
PVC Access Modes & vSphere CSI
When we look at PVCs in Kubernetes we have different types of access modes available:
- RWO – Read Write Once, the volume can be mounted as read-write by a single node. ReadWriteOnce access mode still can allow multiple pods to access the volume when the pods are running on the same node.
- ROX – Read Only Many, the volume can be mounted as read-only by many nodes (and pods).
- RWM – Read Write Many, the volume can be mounted as read-write by many nodes (and pods).
- RWOP – Read Write Once Pod, the volume can be mounted as read-write by a single Pod. Use ReadWriteOncePod access mode if you want to ensure that only one pod across whole cluster can read that PVC or write to it. This is only supported for CSI volumes and Kubernetes version 1.22+.
The vSphere CSI plugin is a volume plugin that runs in any Kubernetes cluster (not necessarily TKG) and takes care of provisioning persistent volumes on vSphere storage. It takes care that volumes are mounted to the correct Kubernetes node (running in a VM) so the volume can be mounted to a pod or pods running on that node. Read more about the vSphere CSI storage plugin in the documentation.
The vSphere CSI plugin supports RWO, ROX and RWM claims. For ROX and RWM vSAN and vSAN File Services is required. RWO will work on any type of (VMFS/vSAN) storage, also on external storage arrays. The vSphere CSI is part of Tanzu Kubernetes Grid (TKG), but can also be installed on other Kubernetes distributions. You can also choose to install additional/other CSI(s) to TKG, if you want to use the capabilities of a 3rd party storage solution.
In this article we will specifically look at RWO and RWM volumes in combination with deployments and statefulsets running on VMware vSAN using a RWO or RWX PVC.
My preparations
To see what happens in different scenarios, I’ve deployed some workloads that I’m running on my TKG cluster using the following YAML files:
- A deployment without any PVCs (nginx01-deployment.yaml).
- A statefulset with RWO PVCs (nginx02-statefulset.yaml).
- A single pod deployment with a single RWO PVC (nginx04-deployment-single-pod-pvc.yaml) – not discussed in this blogpost.
- A daemonset without any PVCs (nginx05-daemonset.yaml).
- A deployment with a RWM PVC (nginx06-deployment-pvc-rwx.yaml).
The PVCs are (pre-)created with these files: pvc04-rwo.yaml (RWO) and pvc06-rwx.yaml (RWM). The PVC for the statefulset is defined in the statefulset YAML file using a volumeClaimTemplates. The services for all the (test) website are defined here: nginx01-svc.yaml, nginx02-svc.yaml, nginx04-svc.yaml, nginx05-svc.yaml and nginx06-svc.yaml.
I’m deploying all these apps on my 3 node TKG cluster in vSphere. First the PVCs are created, then the deployments/statefulsets/deamonsets and then the services (of course this could all be combined).
Applying all these YAML results in:
k get deployments,statefulsets,daemonsets NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/nginx01 3/3 3 3 12d deployment.apps/nginx04 1/1 1 1 12d deployment.apps/nginx06 3/3 3 3 31h NAME READY AGE statefulset.apps/nginx02 5/5 12d NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/nginx05 3 3 3 3 3 27h
and
k get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx01-78b6f94d-8hps8 1/1 Running 0 9d 192.0.6.14 tkg01-node-pool-1-hjr5m-74685f76ff-tpjvg nginx01-78b6f94d-8r6dw 1/1 Running 0 30h 192.0.7.199 tkg01-node-pool-1-hjr5m-74685f76ff-6tbbz nginx01-78b6f94d-rpm7c 1/1 Running 0 30h 192.0.7.207 tkg01-node-pool-1-hjr5m-74685f76ff-6tbbz nginx02-0 1/1 Running 0 30h 192.0.7.210 tkg01-node-pool-1-hjr5m-74685f76ff-6tbbz nginx02-1 1/1 Running 0 9d 192.0.6.23 tkg01-node-pool-1-hjr5m-74685f76ff-tpjvg nginx02-2 1/1 Running 0 30h 192.0.8.10 tkg01-node-pool-1-hjr5m-74685f76ff-l7944 nginx02-3 1/1 Running 0 9d 192.0.7.9 tkg01-node-pool-1-hjr5m-74685f76ff-6tbbz nginx02-4 1/1 Running 0 9d 192.0.6.24 tkg01-node-pool-1-hjr5m-74685f76ff-tpjvg nginx04-57d5d8fc89-jgqr9 1/1 Running 0 30h 192.0.7.211 tkg01-node-pool-1-hjr5m-74685f76ff-6tbbz nginx05-69qg6 1/1 Running 0 27h 192.0.6.30 tkg01-node-pool-1-hjr5m-74685f76ff-tpjvg nginx05-9jllq 1/1 Running 0 27h 192.0.8.203 tkg01-node-pool-1-hjr5m-74685f76ff-l7944 nginx05-rtchl 1/1 Running 0 27h 192.0.7.212 tkg01-node-pool-1-hjr5m-74685f76ff-6tbbz nginx06-6455594cb9-2qnwr 1/1 Running 0 31h 192.0.6.28 tkg01-node-pool-1-hjr5m-74685f76ff-tpjvg nginx06-6455594cb9-jtc75 1/1 Running 0 31h 192.0.7.183 tkg01-node-pool-1-hjr5m-74685f76ff-6tbbz nginx06-6455594cb9-zf57b 1/1 Running 0 30h 192.0.7.208 tkg01-node-pool-1-hjr5m-74685f76ff-6tbbz
and
k get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE pvc02-nginx02-0 Bound pvc-1cb6fdf6-bf23-4b82-b9d8-785e34c3affe 1Gi RWO XYZ 12d pvc02-nginx02-1 Bound pvc-e517aae9-2b52-42b9-ac5c-61ad91f5da9c 1Gi RWO XYZ 12d pvc02-nginx02-2 Bound pvc-05ecb287-80e5-4879-89aa-a4021e4a9731 1Gi RWO XYZ 12d pvc02-nginx02-3 Bound pvc-228a8a44-7104-4b56-b60a-c7c2662bd90a 1Gi RWO XYZ 12d pvc02-nginx02-4 Bound pvc-5c45aad5-b35a-4eaf-8cd7-658d29579fc5 1Gi RWO XYZ 12d pvc04 Bound pvc-ce1da553-6834-411e-aa9c-ef374c8e3fe2 1Gi RWO XYZ 12d pvc06 Bound pvc-daaf093e-1ffd-453d-9505-f0198d04f22f 5Gi RWX XYZ 3d
The RWO PVCs are created as First Class Disks on vSAN (or VMFS storage), the RWX pvc06 is created as a file on vSAN file services.
You can get an overview of PVC volumes on the vSAN cluster under Monitor->Cloud Native Volumes (on the vSAN cluster).
You will see File for RWX (and ROM) volumes, and Block for RWO volumes. RWO volumes are saved as First Class Disks (FCDs) on in the FCD folder on vSAN.
While the RWX/ROW volumes are saved on a file share on vSAN file services.
Some tests
So let’s do some testing and see what happens.
- Simulate an ESXi node down by powering off a TKG Kubernetes node/VM (using the vSphere webclient).
- Powering off and deleting a TKG Kubernetes node/VM (using the vSphere webclient).
- Delete a TKG Kubernetes node/VM using kubectl after first draining the node (recommended).
- Delete a TKG Kubernetes node/VM using kubectl without draining the node (not recommended).
Notice that TKG and VMware HA are closely working together to guarantee availability of the TKG nodes. If an ESXi node fails, TKG VMs (the K8S nodes) are restarted automatically on the remaining ESXi nodes using VMware HA. If a TKG node is powered off (accidentally), the supervisor cluster will start the VM again. More details are in this blogpost: “What Happens When a Physical Node Fails in VMware vSphere with Tanzu“. Because the lab environment only consist of one ESXi host, we cannot test VMware HA. Although (from a Kubernetes level) a powered off that is restarted by the supervisor cluster simulates a similar scenario (at least from the Kubernetes level).
Power off TKG Kubernetes node/VM using vSphere WebClient
This situation doesn’t require any manual interventions to recovery. TKG supervisor cluster will detect the power off and start the TKG node (VM). The node will report status “Ready” and Kubernetes will restore the desired state for deployments, statefulsets and daemonsets.
Note: From a Kubernetes perspective this situation is similar to an ESXi node down that runs a TKG node/VM, the difference is that VMware HA will take care of restarting the node/VM on one of the remaining ESXi hosts.
After the power off (unavaibility of the TKG node/VM) has been detected:
- The supervisor will try to start the powered off TKG Kubernetes node, this can take up to 10 minutes. From my experience this action initiated within 5 minutes.
- All deployments will be redeployed. By default there’s a Kubernetes timeout of 300 seconds before pods are redeployed, however you can optimize/change this setting by configuring tolerations in your YAML file. Check my nginx01-deployment.yaml on how to setup a 10 seconds delay before pods that are part of a deployment are recreated, using “tolerations”.
- However, with the statefulset nothing will happen. Why? Because with a statefulset it’s guarenteed that a maximum of only one instance of a specific pod gets lost. So if the statefulset is not 100% sure that a pod has been lost, the pod will not be redeployed. On top of that, data consistency is more important then availability for statefulsets. A PVC used by a pod that’s part of a statefulset can only be attached to another Kubernetes node (running a newer version of the pod) after the volumeattachement has timed out (by default 300 seconds). More information about statefulsets is here.
- After the TKG node/VM is powered on again and shows status “ready”, the node will join the Kubernetes cluster. Depending on state of the cluster statefulsets will be redeployed to the (restored) TKG node again (if not yet deployed to another TKG node). The pods that are deployed as part of deployment are probably already deployed to another node and will not deployed again on the restored node.
Power off and accidentally deletion TKG Kubernetes node using vSphere WebClient
TKG supervisor cluster will detect the deleted TKG node (VM). Cluster API will automatically deploy a new cluster node with an identical name. The issue is that this new TKG node (with the same name) will report status “NotReady,SchedulingDisabled” and will not automatically go to ready state. Deployments in the environments will be redeployed to the remaining Kubernets nodes after the timeout has exceeded.
To solve the issue with node reporting being “NotReady” I ran through the following process: Delete the node from the TKG cluster using kubectl delete node <NODENAME>. This will delete and remove the node from the kubernetes node (this will also delete the VM on the vSphere level), TKG Cluster API will automatically deploy a brand new node. Now we have to wait for the timeout of the PVCs, which takes 5 minutes. After that the volumeattachments to the original (deleted) node are released, new volume attachements to the original PVC can be created and (specifcally) the PODs that are part of the statefulset are restarted.
Delete a TKG Kubernetes node using kubectl with draining (recommended)
If you ever need to remove a node from your TKG cluster the recommended process is to:
- Drain the node first using
kubectl drain <NODENAME> --ignore-daemonsets --delete-emptydir-data
- Now delete the node using
kubectl delete <NODENAME>
The first command will disable scheduling and remove all running workloads. It will also move your PVC to one of the remaining nodes. After the node has been deleted, cluster API will redeploy a new Kubernetes node. The control plane of the TKG cluster will take care of the desired state of the TKG cluster and of course use the new TKG Kubernetes node.
Delete a TKG Kubernetes node using kubectl without draining (not recommended)
In case you delete a TKG Kubernetes node without draining, you have to wait for the timeout of the PVC volumeattachments. This takes about 5 minutes. After that PVC volumeattachements are released, new volumeattachements will be recreated and the PODs using a PVC will be redeployed and connected to the original PVC.
That’s that…I hope this is useful. Questions? Comments? Please leave a comment below!