Tanzu Kubernetes Grid Availability – About deployments, statefulsets, deamon-sets, PVCs /w vSphere CSI and K8S node (un)availability

Posted By: viktoriouson: August 22, 2023In: Tanzu Kubernetes GridNo Comments

Recently I got some questions about the availability of applications on Kubernetes also in relation to Persistent Volumes (PV) and Persistent Claims (PVC) in a Tanzu Kubernetes Grid environment using the vSphere CSI storage plugin. I did some research that I am sharing in this article with the long title “Tanzu Kubernetes Grid Availability – About deployments, statefulsets, deamon-sets, PVCs on vSAN and K8S node (un)availability“. I hope it’s useful for you.

In this article I want to share some basics about how deployments, statefulsets and deamon-sets work and how these objects work together with storage and available access modes. We will specifically look at what happens if a Kubernetes node (so in case of Tanzu Kubernetes Grid that are the VMs that are running as a K8S control- or worker node) is lost (deleted, powered-off or unavailable because a ESXi hosts fails) in the context of vSphere with Tanzu running on vSphere 8.

K8S App Availability Options

First some basics about K8S app availability options, we have a couple of constructs in Kubernetes. That are:

A deployment manages a set of pods running the same workload, usually one that doesn’t maintain state.
A statefulset is object that manages a stateful application.
A deamonset ensures that each Kubernetes node runs a copy of a pod.

PVC Access Modes & vSphere CSI

When we look at PVCs in Kubernetes we have different types of access modes available:

RWO – Read Write Once, the volume can be mounted as read-write by a single node. ReadWriteOnce access mode still can allow multiple pods to access the volume when the pods are running on the same node.
ROX – Read Only Many, the volume can be mounted as read-only by many nodes (and pods).
RWM – Read Write Many, the volume can be mounted as read-write by many nodes (and pods).
RWOP – Read Write Once Pod, the volume can be mounted as read-write by a single Pod. Use ReadWriteOncePod access mode if you want to ensure that only one pod across whole cluster can read that PVC or write to it. This is only supported for CSI volumes and Kubernetes version 1.22+.

The vSphere CSI plugin is a volume plugin that runs in any Kubernetes cluster (not necessarily TKG) and takes care of provisioning persistent volumes on vSphere storage. It takes care that volumes are mounted to the correct Kubernetes node (running in a VM) so the volume can be mounted to a pod or pods running on that node. Read more about the vSphere CSI storage plugin in the documentation.

The vSphere CSI plugin supports RWO, ROX and RWM claims. For ROX and RWM vSAN and vSAN File Services is required. RWO will work on any type of (VMFS/vSAN) storage, also on external storage arrays. The vSphere CSI is part of Tanzu Kubernetes Grid (TKG), but can also be installed on other Kubernetes distributions. You can also choose to install additional/other CSI(s) to TKG, if you want to use the capabilities of a 3rd party storage solution.

In this article we will specifically look at RWO and RWM volumes in combination with deployments and statefulsets running on VMware vSAN using a RWO or RWX PVC.

My preparations

To see what happens in different scenarios, I’ve deployed some workloads that I’m running on my TKG cluster using the following YAML files:

A deployment without any PVCs (nginx01-deployment.yaml).
A statefulset with RWO PVCs (nginx02-statefulset.yaml).
A single pod deployment with a single RWO PVC (nginx04-deployment-single-pod-pvc.yaml) – not discussed in this blogpost.
A daemonset without any PVCs (nginx05-daemonset.yaml).
A deployment with a RWM PVC (nginx06-deployment-pvc-rwx.yaml).

The PVCs are (pre-)created with these files: pvc04-rwo.yaml (RWO) and pvc06-rwx.yaml (RWM). The PVC for the statefulset is defined in the statefulset YAML file using a volumeClaimTemplates. The services for all the (test) website are defined here: nginx01-svc.yaml, nginx02-svc.yaml, nginx04-svc.yaml, nginx05-svc.yaml and nginx06-svc.yaml.

I’m deploying all these apps on my 3 node TKG cluster in vSphere. First the PVCs are created, then the deployments/statefulsets/deamonsets and then the services (of course this could all be combined).

Applying all these YAML results in:

k get deployments,statefulsets,daemonsets

NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/nginx01   3/3     3            3           12d
deployment.apps/nginx04   1/1     1            1           12d
deployment.apps/nginx06   3/3     3            3           31h

NAME                       READY   AGE
statefulset.apps/nginx02   5/5     12d

NAME                     DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/nginx05   3         3         3       3            3                     27h

and

k get pods -o wide

NAME                       READY   STATUS    RESTARTS   AGE   IP            NODE                                       NOMINATED NODE   READINESS GATES
nginx01-78b6f94d-8hps8     1/1     Running   0          9d    192.0.6.14    tkg01-node-pool-1-hjr5m-74685f76ff-tpjvg              
nginx01-78b6f94d-8r6dw     1/1     Running   0          30h   192.0.7.199   tkg01-node-pool-1-hjr5m-74685f76ff-6tbbz              
nginx01-78b6f94d-rpm7c     1/1     Running   0          30h   192.0.7.207   tkg01-node-pool-1-hjr5m-74685f76ff-6tbbz              
nginx02-0                  1/1     Running   0          30h   192.0.7.210   tkg01-node-pool-1-hjr5m-74685f76ff-6tbbz              
nginx02-1                  1/1     Running   0          9d    192.0.6.23    tkg01-node-pool-1-hjr5m-74685f76ff-tpjvg              
nginx02-2                  1/1     Running   0          30h   192.0.8.10    tkg01-node-pool-1-hjr5m-74685f76ff-l7944              
nginx02-3                  1/1     Running   0          9d    192.0.7.9     tkg01-node-pool-1-hjr5m-74685f76ff-6tbbz              
nginx02-4                  1/1     Running   0          9d    192.0.6.24    tkg01-node-pool-1-hjr5m-74685f76ff-tpjvg              
nginx04-57d5d8fc89-jgqr9   1/1     Running   0          30h   192.0.7.211   tkg01-node-pool-1-hjr5m-74685f76ff-6tbbz              
nginx05-69qg6              1/1     Running   0          27h   192.0.6.30    tkg01-node-pool-1-hjr5m-74685f76ff-tpjvg              
nginx05-9jllq              1/1     Running   0          27h   192.0.8.203   tkg01-node-pool-1-hjr5m-74685f76ff-l7944              
nginx05-rtchl              1/1     Running   0          27h   192.0.7.212   tkg01-node-pool-1-hjr5m-74685f76ff-6tbbz              
nginx06-6455594cb9-2qnwr   1/1     Running   0          31h   192.0.6.28    tkg01-node-pool-1-hjr5m-74685f76ff-tpjvg              
nginx06-6455594cb9-jtc75   1/1     Running   0          31h   192.0.7.183   tkg01-node-pool-1-hjr5m-74685f76ff-6tbbz              
nginx06-6455594cb9-zf57b   1/1     Running   0          30h   192.0.7.208   tkg01-node-pool-1-hjr5m-74685f76ff-6tbbz

and

k get pvc

NAME              STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS         AGE
pvc02-nginx02-0   Bound    pvc-1cb6fdf6-bf23-4b82-b9d8-785e34c3affe   1Gi        RWO            XYZ                  12d
pvc02-nginx02-1   Bound    pvc-e517aae9-2b52-42b9-ac5c-61ad91f5da9c   1Gi        RWO            XYZ                  12d
pvc02-nginx02-2   Bound    pvc-05ecb287-80e5-4879-89aa-a4021e4a9731   1Gi        RWO            XYZ                  12d
pvc02-nginx02-3   Bound    pvc-228a8a44-7104-4b56-b60a-c7c2662bd90a   1Gi        RWO            XYZ                  12d
pvc02-nginx02-4   Bound    pvc-5c45aad5-b35a-4eaf-8cd7-658d29579fc5   1Gi        RWO            XYZ                  12d
pvc04             Bound    pvc-ce1da553-6834-411e-aa9c-ef374c8e3fe2   1Gi        RWO            XYZ                  12d
pvc06             Bound    pvc-daaf093e-1ffd-453d-9505-f0198d04f22f   5Gi        RWX            XYZ                  3d

The RWO PVCs are created as First Class Disks on vSAN (or VMFS storage), the RWX pvc06 is created as a file on vSAN file services.

You can get an overview of PVC volumes on the vSAN cluster under Monitor->Cloud Native Volumes (on the vSAN cluster).

You will see File for RWX (and ROM) volumes, and Block for RWO volumes. RWO volumes are saved as First Class Disks (FCDs) on in the FCD folder on vSAN.

While the RWX/ROW volumes are saved on a file share on vSAN file services.

Some tests

So let’s do some testing and see what happens.

Simulate an ESXi node down by powering off a TKG Kubernetes node/VM (using the vSphere webclient).
Powering off and deleting a TKG Kubernetes node/VM (using the vSphere webclient).
Delete a TKG Kubernetes node/VM using kubectl after first draining the node (recommended).
Delete a TKG Kubernetes node/VM using kubectl without draining the node (not recommended).

Notice that TKG and VMware HA are closely working together to guarantee availability of the TKG nodes. If an ESXi node fails, TKG VMs (the K8S nodes) are restarted automatically on the remaining ESXi nodes using VMware HA. If a TKG node is powered off (accidentally), the supervisor cluster will start the VM again. More details are in this blogpost: “What Happens When a Physical Node Fails in VMware vSphere with Tanzu“. Because the lab environment only consist of one ESXi host, we cannot test VMware HA. Although (from a Kubernetes level) a powered off that is restarted by the supervisor cluster simulates a similar scenario (at least from the Kubernetes level).

Power off TKG Kubernetes node/VM using vSphere WebClient

This situation doesn’t require any manual interventions to recovery. TKG supervisor cluster will detect the power off and start the TKG node (VM). The node will report status “Ready” and Kubernetes will restore the desired state for deployments, statefulsets and daemonsets.

Note: From a Kubernetes perspective this situation is similar to an ESXi node down that runs a TKG node/VM, the difference is that VMware HA will take care of restarting the node/VM on one of the remaining ESXi hosts.

After the power off (unavaibility of the TKG node/VM) has been detected:

The supervisor will try to start the powered off TKG Kubernetes node, this can take up to 10 minutes. From my experience this action initiated within 5 minutes.
All deployments will be redeployed. By default there’s a Kubernetes timeout of 300 seconds before pods are redeployed, however you can optimize/change this setting by configuring tolerations in your YAML file. Check my nginx01-deployment.yaml on how to setup a 10 seconds delay before pods that are part of a deployment are recreated, using “tolerations”.
However, with the statefulset nothing will happen. Why? Because with a statefulset it’s guarenteed that a maximum of only one instance of a specific pod gets lost. So if the statefulset is not 100% sure that a pod has been lost, the pod will not be redeployed. On top of that, data consistency is more important then availability for statefulsets. A PVC used by a pod that’s part of a statefulset can only be attached to another Kubernetes node (running a newer version of the pod) after the volumeattachement has timed out (by default 300 seconds). More information about statefulsets is here.
After the TKG node/VM is powered on again and shows status “ready”, the node will join the Kubernetes cluster. Depending on state of the cluster statefulsets will be redeployed to the (restored) TKG node again (if not yet deployed to another TKG node). The pods that are deployed as part of deployment are probably already deployed to another node and will not deployed again on the restored node.

Power off and accidentally deletion TKG Kubernetes node using vSphere WebClient

TKG supervisor cluster will detect the deleted TKG node (VM). Cluster API will automatically deploy a new cluster node with an identical name. The issue is that this new TKG node (with the same name) will report status “NotReady,SchedulingDisabled” and will not automatically go to ready state. Deployments in the environments will be redeployed to the remaining Kubernets nodes after the timeout has exceeded.

To solve the issue with node reporting being “NotReady” I ran through the following process: Delete the node from the TKG cluster using kubectl delete node <NODENAME>. This will delete and remove the node from the kubernetes node (this will also delete the VM on the vSphere level), TKG Cluster API will automatically deploy a brand new node. Now we have to wait for the timeout of the PVCs, which takes 5 minutes. After that the volumeattachments to the original (deleted) node are released, new volume attachements to the original PVC can be created and (specifcally) the PODs that are part of the statefulset are restarted.

Delete a TKG Kubernetes node using kubectl with draining (recommended)

If you ever need to remove a node from your TKG cluster the recommended process is to:

Drain the node first using

kubectl drain <NODENAME> --ignore-daemonsets --delete-emptydir-data

Now delete the node using
```
kubectl delete <NODENAME>
```

The first command will disable scheduling and remove all running workloads. It will also move your PVC to one of the remaining nodes. After the node has been deleted, cluster API will redeploy a new Kubernetes node. The control plane of the TKG cluster will take care of the desired state of the TKG cluster and of course use the new TKG Kubernetes node.

Delete a TKG Kubernetes node using kubectl without draining (not recommended)

In case you delete a TKG Kubernetes node without draining, you have to wait for the timeout of the PVC volumeattachments. This takes about 5 minutes. After that PVC volumeattachements are released, new volumeattachements will be recreated and the PODs using a PVC will be redeployed and connected to the original PVC.

That’s that…I hope this is useful. Questions? Comments? Please leave a comment below!

Tags: pv pvc tkg vsphere csi

How to access EKS clusters created by Tanzu Mission Control

Cloud-Native Event on October 3rd in Utrecht: Enterprise DevOps TechCon

About the author

viktorious

Related Articles

Leave a ReplyCancel reply

About viktorious.nl

vExpert

Subscribe to Blog via Email

Recent Comments

Niranjan on 26 Feb in: Setup Harbor Proxy Cache and Harbor Container Webhook to overcome Docker Hub Pull Rate Limits in Kubernetes

VSphere 7 Update 3 broadens app acceleration, cloud initiatives - TechTarget - ColorMag on 13 Mar in: Deploy a Tanzu Kubernetes cluster on vSphere 7

David Feng on 10 Mar in: Automated deployment of a NAT network with VMware Cloud Assembly and NSX-T

viktorious on 09 Mar in: Automated deployment of a NAT network with VMware Cloud Assembly and NSX-T

David Feng on 06 Mar in: Automated deployment of a NAT network with VMware Cloud Assembly and NSX-T