Recently, I encountered an issue in my vSphere environment where VMs were randomly dying, and HA was unable to turn them back on. When trying to manually start these failed VMs, I received the following error message:
An error was received from the ESX host while powering on VM vCenter Support Assistant Appliance.
Failed to start the virtual machine.
Module Disk power on failed.
Cannot open the disk ‘/vmfs/volumes/4c0ed2a0-cbb490fe-2645-0018fe2e950a/vCenter Support Assistant Appliance/vCenter Support Assistant Appliance_1-000002.vmdk’ or one of the snapshot disks it depends on.
Failed to lock the file
(The issue was happening with my vCenter Support Assistant Appliance in this example).
Some investigation of the issue revealed that it was occurring after Veeam had backed up the machine in the routine overnight backup job. I pursued a support ticket with Veeam, to have them refer me to VMware as the issue was occurring after a normal call to a vSphere API.
Doing more digging that day, I uncovered the following messages in the vmware.log file for the VM in question:
2015-08-26T02:49:36.674Z| vcpu-0| W110: Mirror_DisconnectMirrorNode: Failed to send disconnect ioctl for mirror node ‘28763a-24763d-svmmirror’: (Device or resource busy)
2015-08-26T02:49:36.674Z| vcpu-0| W110: Mirror: scsi0:1: MirrorDisconnectDiskMirrorNode: Failed to disconnect mirror node ‘/vmfs/devices/svm/28763a-24763d-svmmirror’
2015-08-26T02:49:36.674Z| vcpu-0| W110: ConsolidateDiskCloseCB: Failed to destroy mirror node while consolidating disks ‘/vmfs/volumes/4c0ed2a0-cbb490fe-2645-0018fe2e950a/vCenter Support Assistant Appliance/vCenter Support Assistant Appliance_1-000001.vmdk’ -> ‘/vmfs/volumes/4c0ed2a0-cbb490fe-2645-0018fe2e950a/vCenter Support Assistant Appliance/vCenter Support Assistant Appliance_1.vmdk’.
2015-08-26T02:49:36.674Z| vcpu-0| I120: NOT_IMPLEMENTED bora/vmx/checkpoint/consolidateESX.c:382
2015-08-26T02:49:40.270Z| vcpu-0| W110: A core file is available in “/vmfs/volumes/4c0ed2a0-cbb490fe-2645-0018fe2e950a/vCenter Support Assistant Appliance/vmx-zdump.000”
2015-08-26T02:49:40.270Z| vcpu-0| W110: Writing monitor corefile “/vmfs/volumes/4c0ed2a0-cbb490fe-2645-0018fe2e950a/vCenter Support Assistant Appliance/vmmcores.gz”
2015-08-26T02:49:40.345Z| vcpu-0| W110: Dumping core for vcpu-0
Odd, I thought. something with the mirror driver causing problems?
A bit of quick googling yielded this KB article: Investigating virtual machine file locks on ESXi/ESX (10051)
Using the info in that KB article, I went onto an ESX host and used vmkfstools to try and discover the host that was causing the lock on the VMDK(s) in question. On each file (not just the one in question, but all VMDKs for the machine), no host was being reported as holding a lock. Yet the inability to power on the machine persisted. I rebooted all of the hosts in the cluster, and the VM came back up. At this point, I invoked VMware’s technical support.
The support representative went through all the steps that I had done prior to calling, and uncovered the same information. However, they also discovered SCSI device reservation conflicts during the same time as the file locking issues. Their diagnosis?
Incompatible SCSI HBAs.
Sure enough, after going on the VMware website and checking the HCL, my HBAs (specifically the driver version) were not supported for ESXi 6.0. I installed the updated driver on the affected hosts, and haven’t seen the problem since!
Hopefully this helps someone else facing the same issue – make sure you check the version of the drivers for your HBAs, as it could cause issues.