Can’t bring up virtual vCenter server after un-registering and re-registering VM.

So, due to some unrelated disk locking issues (see https://baumaeam.wordpress.com/2015/09/22/unable-to-start-vms-failed-to-lock-the-file/), my vCenter VM failed to start today.

From the aforementioned blog post, the solution for any other VM besides vCenter would have been to powercycle the responsible host for the file lock (or all of them, for good measure), and then restart the affected VM. However, this is not possible if the vCenter VM is the victim, as you can’t really do vMotion without vCenter!

Regardless, I went down a long rabbit hole full of attempted fixes that ultimately required me to restore the vCenter VM from a Veeam backup directly to a host. It worked great, except I couldn’t vMotion the vCenter VM anymore! vSphere kept throwing the following error whenever I attempted to vMotion the VM:

vim.fault.NotFound

“That’s odd”, I thought to myself. Maybe because the VM was registered directly on the ESXi host and then brought up, vCenter somehow see itself there? So when I tried to vMotion, it wouldn’t figure it out? Not sure. I figured a possible fix would be to shut down the vCenter server, open the c# client to the host that it was on, and unregister and reregister. Perhaps doing that would get the process right, and things would work.

… not so much.

I connected the client to the target host, and unregistered the VM, then reregistered it on a different host. After reregistering it, the network adapter for the vCenter server could no longer connect to the distributed switch. So the VM would come up, but vCenter couldn’t start because it didn’t have a network adapter to talk to the Platform Services Controller.

My solution was to create a new portgroup (with the appropriate VLAN) on an existing vSphere Standard Switch, steal a host NIC away from the LAG in the vDS, add it to the VSS, and then power up the vCenter VM. Once it came up, it was able to connect to the PSC, and get the vCenter Server process up. Then I moved it back to the vDS, and things seemed to work okay again!

Hope this helps anyone facing the same issue, where their vCenter server is unable to get going because of vDS inaccessibility.

Unable to start VMs – Failed to lock the file.

Recently, I encountered an issue in my vSphere environment where VMs were randomly dying, and HA was unable to turn them back on. When trying to manually start these failed VMs, I received the following error message:

An error was received from the ESX host while powering on VM vCenter Support Assistant Appliance.
Failed to start the virtual machine.
Module Disk power on failed.
Cannot open the disk ‘/vmfs/volumes/4c0ed2a0-cbb490fe-2645-0018fe2e950a/vCenter Support Assistant Appliance/vCenter Support Assistant Appliance_1-000002.vmdk’ or one of the snapshot disks it depends on.
Failed to lock the file

(The issue was happening with my vCenter Support Assistant Appliance in this example).

Some investigation of the issue revealed that it was occurring after Veeam had backed up the machine in the routine overnight backup job. I pursued a support ticket with Veeam, to have them refer me to VMware as the issue was occurring after a normal call to a vSphere API.

Doing more digging that day, I uncovered the following messages in the vmware.log file for the VM in question:

2015-08-26T02:49:36.674Z| vcpu-0| W110: Mirror_DisconnectMirrorNode: Failed to send disconnect ioctl for mirror node ‘28763a-24763d-svmmirror’: (Device or resource busy)
2015-08-26T02:49:36.674Z| vcpu-0| W110: Mirror: scsi0:1: MirrorDisconnectDiskMirrorNode: Failed to disconnect mirror node ‘/vmfs/devices/svm/28763a-24763d-svmmirror’
2015-08-26T02:49:36.674Z| vcpu-0| W110: ConsolidateDiskCloseCB: Failed to destroy mirror node while consolidating disks ‘/vmfs/volumes/4c0ed2a0-cbb490fe-2645-0018fe2e950a/vCenter Support Assistant Appliance/vCenter Support Assistant Appliance_1-000001.vmdk’ -> ‘/vmfs/volumes/4c0ed2a0-cbb490fe-2645-0018fe2e950a/vCenter Support Assistant Appliance/vCenter Support Assistant Appliance_1.vmdk’.
2015-08-26T02:49:36.674Z| vcpu-0| I120: NOT_IMPLEMENTED bora/vmx/checkpoint/consolidateESX.c:382
2015-08-26T02:49:40.270Z| vcpu-0| W110: A core file is available in “/vmfs/volumes/4c0ed2a0-cbb490fe-2645-0018fe2e950a/vCenter Support Assistant Appliance/vmx-zdump.000”
2015-08-26T02:49:40.270Z| vcpu-0| W110: Writing monitor corefile “/vmfs/volumes/4c0ed2a0-cbb490fe-2645-0018fe2e950a/vCenter Support Assistant Appliance/vmmcores.gz”
2015-08-26T02:49:40.345Z| vcpu-0| W110: Dumping core for vcpu-0

Odd, I thought. something with the mirror driver causing problems?

A bit of quick googling yielded this KB article: Investigating virtual machine file locks on ESXi/ESX (10051)

Using the info in that KB article, I went onto an ESX host and used vmkfstools to try and discover the host that was causing the lock on the VMDK(s) in question. On each file (not just the one in question, but all VMDKs for the machine), no host was being reported as holding a lock. Yet the inability to power on the machine persisted. I rebooted all of the hosts in the cluster, and the VM came back up. At this point, I invoked VMware’s technical support.

The support representative went through all the steps that I had done prior to calling, and uncovered the same information. However, they also discovered SCSI device reservation conflicts during the same time as the file locking issues. Their diagnosis?

Incompatible SCSI HBAs.

Sure enough, after going on the VMware website and checking the HCL, my HBAs (specifically the driver version) were not supported for ESXi 6.0. I installed the updated driver on the affected hosts, and haven’t seen the problem since!

Hopefully this helps someone else facing the same issue – make sure you check the version of the drivers for your HBAs, as it could cause issues.