vSphere troubleshooting

PV/PVC cannot be created (Unable to find VM by UUID)

This may happen if the virtual machines in the cluster were re-created, e.g. to recover after a VM failure.

The cluster-manager logs on the master will show a similar message: Unable to find VM by UUID. VM UUID: 4215cbe6-2a4f-b8a8-9178-6219df59cd40.

The problem is due to the way kubelet registers nodes in the Kubernetes master.

After a node is registered in the master for the first time, providerID field cannot be changed in the kubernetes node object.

To make sure that you have exactly this error execute the following command:

$ kubectl get nodes -o custom-columns=NAME:.metadata.name,PROVIDER_ID:.spec.providerID,UUID:.status.nodeInfo.systemUUID

NAME                               PROVIDER_ID                                      UUID
cluster-247-vsp1-group1-worker-0   vsphere://42152218-c14d-f00d-dc13-84176e986471   42152218-C14D-F00D-DC13-84176E986471
cluster-247-vsp1-group1-worker-1   vsphere://42155ab0-e5a1-7bc3-e769-8d8cb56f2c2b   42155AB0-E5A1-7BC3-E769-8D8CB56F2C2B
cluster-247-vsp1-master-0          vsphere://42155d4b-9ee2-344e-e610-b80db41130f0   42155D4B-9EE2-344E-E610-B80DB41130F0
cluster-247-vsp1-master-1          vsphere://4215cbe6-2a4f-b8a8-9178-6219df59cd40   4215DE4E-975F-1DCC-57DD-67330B1653D5
cluster-247-vsp1-master-2          vsphere://4215f891-37f9-8cf3-8bb4-8fde7762db6b   42150441-AB5B-A87E-A2EA-A01AB7560430

Note that cluster-247-vsp1-master-1 and cluster-247-vsp1-master-2 nodes have different UUID values in fields .spec.providerID and .status.nodeInfo.systemUUID.

To fix it:

  1. Remove kubernetes nodes with incorrect UUID values by running the following command

    $ kubectl delete node cluster-247-vsp1-master-1
    $ kubectl delete node cluster-247-vsp1-master-2
  2. Restart the kubelet on this nodes via ssh by running the following command, or just restart the nodes.

    # systemctl restart kublr-kubelet

Group is in error state after scaling cluster down or removing a group

When any operation is performed on a vSphere or VCD cluster that results in removing nodes - such as scaling a node group down, or removing a node group - Kubernetes does not remove the node from Kubernetes API automatically, and the node is reported to be in an error state.

Current workaround is to remove the node from Kubernetes API manually.

To identify the problematic nodes run the following command:

$ kubectl get nodes

NAME                               STATUS   ROLES    AGE    VERSION
cluster-247-vsp1-group1-worker-0   Ready    <none>   98d    v1.18.15
cluster-247-vsp1-group1-worker-1   NotReady <none>   98d    v1.18.15
cluster-247-vsp1-master-0          Ready    master   27d    v1.18.18
cluster-247-vsp1-master-1          Ready    <none>   259d   v1.18.15
cluster-247-vsp1-master-2          Ready    <none>   259d   v1.18.15

Note that cluster-247-vsp1-group1-worker-1 node has NotReady status.

Remove the node from Kubernetes API:

$ kubectl delete node cluster-247-vsp1-group1-worker-1