
Troubleshooting Virtual Machine snapshot problems by Ruben Miguelez Garcia is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License
The information in this document is provided "as is" and no guarantee or warranty is given that the information is fit for any particular purpose. The user thereof uses the information as its sole risk and liability.
This troubleshooting guide explains basic concepts about Virtual Machine snapshots and different troubleshooting paths depending on the problem. This guide was designed for ESX3.5 and extra considerations have to be taken if working with ESX3.5i or ESX4(i).
The formulas and most of the procedures described in this document were created by the author as part of a continuous troubleshooting improvement process.
Virtual Machine (VM) snapshots allow you to preserve the state of the virtual machine so you can return to the same state repeatedly [1].
A snapshot captures the entire state of the virtual machine at the time you take it. This includes:
When you revert to a snapshot, you return all these items to the state they were in at the time you took that snapshot.
Snapshots are useful when you need to revert repeatedly to the same state but you don't want to create multiple virtual machines or when you are going to apply changes which results you are unsure of.
You must power off the virtual machine before taking a snapshot if the virtual machine has multiple disks in different disk modes. Also Memory cannot be snapshot'ed if any of the disks of the Virtual Machine are in independent mode.
Although you can take an undefined amount of snapshots, VMware supports only up to 32 levels.
When you take a snapshot, be aware of other activity going on in the virtual machine and the likely effect of reverting to that snapshot. In general, it is best to take a snapshot when no applications in the virtual machine are communicating with other computers.
You can take a snapshot while a virtual machine is powered on, powered off, or suspended.
The Snapshot Manager lets you review all the snapshots for the active virtual machine and act on them directly.
Throughout all this document we will use delete and commit as synonyms. Do not interpret 'delete' as deleting physically a file. Why do we use delete instead of commit? Because the snapshot manager has only 2 buttons for the operation of committing snapshots and their names are 'Delete' and 'Delete all'.
Every time a snapshot is created several files are created or updated. They are the following:
Each incremental snapshot description is held in the .vmsd, the settings and virtual memory (optional) is kept in the .vmsn file.
This .vmsd file contains the structure of the snapshot tree. It describes the characteristics of every snapshot : with or without memory, UID numbers, names of related files, etc.
The reality is that this file gets corrupt very easily and that is when the snapshots problem begins. In this guide we will not explain in detail the characteristics and structure of this file since during the troubleshooting process it is commonly ignored.
See [2] if you are interested in a deeper knowledge of these files.
The .vmdk is just a descriptor of the -delta.vmdk. It contains as well a pointer to the parent .vmdk
The .vmdk of a snapshot is just like the .vmdk of a Base Disk. Below we will use the command sgrep to see only the lines with relevant information. The only difference is that the Base Disk has no parent.
# sgrep VM_Name-000002.vmdk CID=7d8e30a3 parentCID=8d4aba54 parentFileNameHint="VM_Name-000001.vmdk" RW 1638400 VMFSSPARSE "VM_Name-000002-delta.vmdk"
Here is the explanation of every line:
CID=7d8e30a3 |
Content ID: 8 Digit hex number identifying the disk |
parentCID=8d4aba54 |
The parent's Content ID |
parentFileNameHint="VM_Name-000001.vmdk" |
Path and name of the parent disk |
RW 1638400 VMFSSPARSE "VM_Name-000002-delta.vmdk" |
Logical size of the snapshot (same than the BaseDisk) and name of the delta file. |
As you can see, the .vmdk of the Base Disk is pretty similar:
# sgrep VM_Name.vmdk CID=7d8e30a3 parentCID=ffffffff RW 1638400 VMFS "VM_Name-flat.vmdk"
The -delta.vmdk is the file that contains the changes done on this snapshot in relation with its parents. It is like a -flat.vmdk with a few differences:
If for any reason the .vmdk disappears, you can recreate it, however the -delta.vmdk is irreplaceable.
As you may know this is the VM settings configuration file and it will be in most cases the first place we will check in the troubleshooting process.
We will come back to this file later, so for now lets see only what is interesting for us in the snapshots context.
If you check its content with sgrep we may see something like this
# sgrep VM_Name.vmx scsi0:0.present = "true" scsi0:0.fileName = "VM_Name.vmdk"
scsi0:0.present = "true" |
This indicates that the disk in scsi0:0 is presently attached to the VM |
scsi0:0.fileName = "VM_Name.vmdk" |
This is the name of the disk attached on scsi0:0 |
The severity of the 'snapshots problem' can go from trivial (customer has no knowledge of how snapshots work) to really critical and very time consuming (only one VM with several huge snapshots in the only Datastore which is full).
Use the flowchart below to help you in the troubleshooting process. Many issues have more than one solution, so you may need to evaluate the options and the criticality of the system. Don't be afraid to tell the customer that you need some time to evaluate the options and have no rush checking everything before starting implementing your plan. There is no way to stop or revert a 'removesnapshots' operation.
You can't know how many snapshots a VM has just looking to the dates and names of the files in the VM folder. You need first to check the .vmx file and find out how many disks the VM has and if they are using snapshots. You can check this using 'Edit Settings' on the VI Client as well, but that is all you can do on that window, so if the VM has snapshots, then you will have to go to the Service Console to continue troubleshooting.
On the VI Client -> VM Settings , you can see something like this:
| VM without snapshots | VM with snapshots |
|---|---|
As you may have noticed, the only difference is the file name.
If we sgrep the .vmx we will find in both cases:
| VM without snapshots | VM with snapshots |
|---|---|
# sgrep VM_Name.vmx |
# sgrep VM_Name.vmx |
scsi0:0.present = "true" |
scsi0:0.present = "true" |
scsi0:0.fileName = "VM_Name.vmdk" |
scsi0:0.fileName = "VM_Name-000001.vmdk" |
The result on the left means that this VM has only one virtual disk and it is NOT running on snapshots. No matter what is on the VM folder, right now this VM is not using any snapshot.
But if you have an output like the right one, you know that that virtual disk is running on snapshots.
Here is another example. This VM has 4 virtual disks and the VM is running on snapshots on all of them.
# sgrep VM_Name.vmx scsi0:0.present = "true" scsi0:0.fileName = "VM_Name-000003.vmdk" scsi0:1.present = "true" scsi0:1.fileName = "VM_Name_1-000003.vmdk" scsi0:2.present = "true" scsi0:2.fileName = "VM_Name_2-000003.vmdk" scsi0:3.present = "true" scsi0:3.fileName = "VM_Name_3-000003.vmdk"
The question is, how many snapshots does it have? To know it you need to follow the chain of .vmdk files until you get to the Base Disk. And you have to do the same with the other disks attached to the VM if you want to have the complete picture. You can not make assumptions here if you really want to know the truth. The VM may have several disks, virtual or RDMs, and each one can have a different number of snapshots, and the base disk may also be located on a different datastore. The number on the file name means nothing. The numbers on the chain may be sorted or may be not.
# sgrep VM_Name-000003.vmdk CID=7d8e30a3 parentCID=7d8e30a3 parentFileNameHint="VM_Name-000008.vmdk" RW 1638400 VMFSSPARSE "VM_Name-000003-delta.vmdk" # sgrep VM_Name-000008.vmdk CID=7d8e30a3 parentCID=7d8e30a3 parentFileNameHint="VM_Name-000002.vmdk" RW 1638400 VMFSSPARSE "VM_Name-000008-delta.vmdk" # sgrep VM_Name-000002.vmdk CID=7d8e30a3 parentCID=7d8e30a3 parentFileNameHint="VM_Name.vmdk" RW 1638400 VMFSSPARSE "VM_Name-000002-delta.vmdk" # sgrep VM_Name.vmdk CID=7d8e30a3 parentCID=ffffffff RW 1638400 VMFS "VM_Name-flat.vmdk"
You can also sgrep them all and find the chain on the results:
# sgrep VM_Name*[1-9].vmdk VM_Name-000001.vmdk:CID=7d8e30a3 VM_Name-000001.vmdk:parentCID=7d8e30a3 VM_Name-000001.vmdk:parentFileNameHint="VM_Name.vmdk" VM_Name-000001.vmdk:RW 1638400 VMFSSPARSE "VM_Name-000001-delta.vmdk" VM_Name-000002.vmdk:CID=7d8e30a3 VM_Name-000002.vmdk:parentCID=7d8e30a3 VM_Name-000002.vmdk:parentFileNameHint="VM_Name-000001.vmdk" VM_Name-000002.vmdk:RW 1638400 VMFSSPARSE "VM_Name-000002-delta.vmdk" VM_Name-000004.vmdk:CID=7d8e30a3 VM_Name-000004.vmdk:parentCID=7d8e30a3 VM_Name-000004.vmdk:parentFileNameHint="VM_Name.vmdk" VM_Name-000004.vmdk:RW 1638400 VMFSSPARSE "VM_Name-000004-delta.vmdk" [...] VM_Name_3-000002.vmdk:CID=ade33caf VM_Name_3-000002.vmdk:parentCID=ade33caf VM_Name_3-000002.vmdk:parentFileNameHint="VM_Name_3-000001.vmdk" VM_Name_3-000002.vmdk:RW 8192 VMFSSPARSE "VM_Name_3-000002-delta.vmdk" VM_Name_3-000004.vmdk:CID=ade33caf VM_Name_3-000004.vmdk:parentCID=ade33caf VM_Name_3-000004.vmdk:parentFileNameHint="/vmfs/volumes/47b31f65-74a19d98-e78b-001a4bb24256/VM_Name/VM_Name_3.vmdk" VM_Name_3-000004.vmdk:RW 8192 VMFSSPARSE "VM_Name_3-000004-delta.vmdk"
Imagine the following situation:
Base Disk
+ Snap1
+ Snap2
+ ...
+ SnapN
+ You are here
When a snapshot is deleted, its content is merged into its parent file. Often a lot of the same disk blocks will change which means the size of Snap_i would not become Snap_i + Snap_i+1, but it is better to be conservative and do the calculations for the worst case scenario, otherwise we could end up with a completely full datastore and an incomplete Delete task with undetermined data consistency on the snapshot files.
Premises:
| Process behind 'Delete all' snapshots. |
![]() |
If the VM is off, the space needed to delete the first N snapshots can be calculated as:
TotalUsed = SUM{i=1 to N}( MIN( SUM{j=i to N}(Sj) , BD ) ) |
![]() |
TotalOriginal = SUM{i=1 to N}(Si) |
![]() |
TotalNeeded = TotalUsed - TotalOriginal |
![]() |
Where
However in most cases
TotalOriginal <= BD |
![]() |
So if the condition above is True, then we can calculate the space needed easily with:
TotalNeeded = SUM{i=1 to N}(Si * (i-1)) |
![]() |
Important: This calculation has to be done for every disk on the VM. The total amount of space needed is the summation of space needed to commit the snapshots of every disk.
This is the source of this procedure: Committing snapshots when there are no snapshot entries in the snapshot manager (http://kb.vmware.com/kb/1002310)
The process is very simple. It only requires the use of vmware-cmd command. Do vmware-cmd -h for more information about this command or check the documentation.
If you query the VM and it reports to have no snapshots:
# vmware-cmd VM_Name.vmx hassnapshot hassnapshot() =
Then you move away the .vmsd file to avoid further confusions on the VM:
# mv VM_Name.vmsd /tmp/
Then you create one snapshot:
# vmware-cmd VM_Name.vmx createsnapshot SnapshotName createsnapshot(SnapshotName) = 1
And now that it recognizes that it has snapshots, you can request to delete all with:
# vmware-cmd VM_Name.vmx removesnapshots removesnapshots() = 1
If it fails, you will get an error message on the Service Console and something like this on the VI Client. To know why it has failed you will need to check several things and correct what is wrong
Firstly, it is always worth taking a look to the VM logs (vmware.log), the vmkernel logs (/var/log/vmkernel) and the hostd logs (/var/log/vmware/hostd.log) to see if you find there the reason why it failed.
In many cases, even if you see the reason there you will have to fix the problem by yourself. Lets see the most common issues that cause a 'removesnapshots' or a cloning operation to fail:
If the size of the delta file is <=16Mb you can consider that it contains no information about the changes on the VM and can be safely ignored if it is not situated between two valid snapshots on the snapshot chain.
On the table below you can see that the minimum size of the snapshot doesn't depend on the size of the Base Disk.
-rw------- 1 root root 16M Jan 16 15:36 VM_Name-000001-delta.vmdk -rw------- 1 root root 219 Jan 16 15:36 VM_Name-000001.vmdk -rw------- 1 root root 1.0M Jan 15 16:41 VM_Name-flat.vmdk -rw------- 1 root root 305 Jan 16 15:33 VM_Name.vmdk -rw------- 1 root root 16M Jan 16 15:38 VM_Name-000001-delta.vmdk -rw------- 1 root root 221 Jan 16 15:38 VM_Name-000001.vmdk -rw------- 1 root root 100M Jan 15 16:41 VM_Name-flat.vmdk -rw------- 1 root root 309 Jan 16 15:37 VM_Name.vmdk -rw------- 1 root root 16M Jan 16 15:40 VM_Name-000001-delta.vmdk -rw------- 1 root root 222 Jan 16 15:40 VM_Name-000001.vmdk -rw------- 1 root root 1.0G Jan 15 16:41 VM_Name-flat.vmdk -rw------- 1 root root 311 Jan 16 15:39 VM_Name.vmdk -rw------- 1 root root 16M Jan 16 15:41 VM_Name-000001-delta.vmdk -rw------- 1 root root 223 Jan 16 15:41 VM_Name-000001.vmdk -rw------- 1 root root 10G Jan 15 16:41 VM_Name-flat.vmdk -rw------- 1 root root 313 Jan 16 15:41 VM_Name.vmdk -rw------- 1 root root 16M Jan 16 15:42 VM_Name-000001-delta.vmdk -rw------- 1 root root 223 Jan 16 15:42 VM_Name-000001.vmdk -rw------- 1 root root 20G Jan 15 16:41 VM_Name-flat.vmdk -rw------- 1 root root 313 Jan 16 15:42 VM_Name.vmdk
So if the .vmx is pointing at the SnapN and you see something like this (logical representation of the chain):
SnapN (<16MB) -> SnapN-1 (>=16MB) -> .... -> BaseDisk
You can safely modify in the .vmx the reference to SnapN for SnapN-1 if you are trying to commit them all or simply point at SnapN-1 if you are creating a clone disk.
Every .vmdk file contains his own CID (Content ID) and the CID of its parent. The parent CID of the Base Disk is always 'ffffffff'. If a commit snapshots/clone disk operation fails, you need to verify that the chain of CIDs is correct for every virtual disk on the VM.
Usually, if you find on the vmware.log file something like "The base disk or one of the snapshots it depends on has been modified" then there is a broken CIDs chain.
You can check it in many different ways. You can for example sgrep all the .vmdk descriptors and examine the results.
# sgrep Tony\ test\ VM-00000[1-9].vmdk Tony test VM-000001.vmdk:CID=4c652f89 Tony test VM-000001.vmdk:parentCID=0112ccc5 Tony test VM-000001.vmdk:parentFileNameHint="Tony test VM.vmdk" Tony test VM-000001.vmdk:RW 1048576 VMFSSPARSE "Tony test VM-000001-delta.vmdk" Tony test VM-000002.vmdk:CID=91889aa4 Tony test VM-000002.vmdk:parentCID=4c652f89 Tony test VM-000002.vmdk:parentFileNameHint="Tony test VM-000001.vmdk" Tony test VM-000002.vmdk:RW 1048576 VMFSSPARSE "Tony test VM-000002-delta.vmdk" # sgrep -H Tony\ test\ VM.vmdk Tony test VM.vmdk:CID=0112ccc5 Tony test VM.vmdk:parentCID=ffffffff Tony test VM.vmdk:RW 1048576 VMFS "Tony test VM-flat.vmdk"
Also, if you know the chain in advance, you can see only the CIDs in order doing something like this:
# grep CID Tony\ test\ VM-00000{2,1}.vmdk
Tony test VM-000002.vmdk:CID=91889aa4
Tony test VM-000002.vmdk:parentCID=4c652f89
Tony test VM-000001.vmdk:CID=4c652f89
Tony test VM-000001.vmdk:parentCID=0112ccc5
# grep -H CID Tony\ test\ VM.vmdk
Tony test VM.vmdk:CID=0112ccc5
Tony test VM.vmdk:parentCID=ffffffff
Or all at once with
# grep CID Tony\ test\ VM-00000{2,1}.vmdk Tony\ test\ VM.vmdk
Tony test VM-000002.vmdk:CID=91889aa4
Tony test VM-000002.vmdk:parentCID=4c652f89
Tony test VM-000001.vmdk:CID=4c652f89
Tony test VM-000001.vmdk:parentCID=0112ccc5
Tony test VM.vmdk:CID=0112ccc5
Tony test VM.vmdk:parentCID=ffffffff
As you may have noticed, Bash wildcards are very useful if you know how to use them.
Warning: If the CIDs chain is broken it is probably because the Base Disk or one of the snapshots behind the last one has been modified after the creation of the last snapshot. In that case you have to explain to the customer that you can not guaranty consistency of the data if they have been really modified and in that case, it is better to create a clone of the whole chain and use that clone instead. To be able to create the clone you will have to modify the CIDs chain manually to make it consistent.
If the CID chain has got broken after a failed 'delete all' operation then it is safe (in most cases) to fix the CID chain and commit the snapshots. However, if you want to be completely sure that the content inside the VM is fine, power it on and verify its content before you commit the snapshots.
This command is useful for something more than just querying RDMs.
# vmkfstools -q --queryrdm <file.vmdk>
List the attributes of a raw disk mapping. When used with a
`rdm:<device>` specification, it prints out the
vml of the raw disk corresponding to the mapping
referenced by the <device>. It also prints out identification
information for the raw disk (if any).
When you use "vmkfstools -q" against a vmdk file it will walk through all the chain (if it exists) until the Base Disk. It will check that all the vmdk descriptor files are consistent with each other and that there are no missing files. Basically it will check that everything is there and makes sense. However it will not analyze the content inside the flat/delta files, just their existence.
Nothing wrong
# vmkfstools -q VM_Name2-000002.vmdk VM_Name2-000002.vmdk is not an rdm
CID broken
# vmkfstools -q VM_Name2-000002.vmdk Failed to open 'VM_Name2-000002.vmdk' : The parent virtual disk has been modified since the child was created (18).
With verbose output
# vmkfstools -q VM_Name2-000002.vmdk -v 10 DISKLIB-VMFS : "./VM_Name2-000002-delta.vmdk" : open successful (23) size = 83886080, hd = 0. Type 8 DISKLIB-VMFS : "/vmfs/volumes/4b72a71c-0047198b-03f3-00151725d513/VM_Name2/VM_Name2-000001-delta.vmdk" : open successful (23) size = 83886080, hd = 0. Type 8 DISKLIB-VMFS : "/vmfs/volumes/4b72a71c-0047198b-03f3-00151725d513/VM_Name2/VM_Name2-flat.vmdk" : open successful (23) size = 83886080, hd = 0. Type 3 DISKLIB-LINK : Attach: Content ID mismatch (6fef7bec != 86fef7be). DISKLIB-CHAIN : "/vmfs/volumes/4b72a71c-0047198b-03f3-00151725d513/VM_Name2/VM_Name2.vmdk" : failed to open (The parent virtual disk has been modified since the child was created). DISKLIB-VMFS : "./VM_Name2-000002-delta.vmdk" : closed. DISKLIB-VMFS : "/vmfs/volumes/4b72a71c-0047198b-03f3-00151725d513/VM_Name2/VM_Name2-000001-delta.vmdk" : closed. DISKLIB-VMFS : "/vmfs/volumes/4b72a71c-0047198b-03f3-00151725d513/VM_Name2/VM_Name2-flat.vmdk" : closed. DISKLIB-LIB : Failed to open 'VM_Name2-000002.vmdk' with flags 0x17 (The parent virtual disk has been modified since the child was created). Failed to open 'VM_Name2-000002.vmdk' : The parent virtual disk has been modified since the child was created (18).
Explanation
| DISKLIB-LINK : Attach: Content ID mismatch (6fef7bec != 86fef7be). | The CID chain is broken |
| DISKLIB-CHAIN : "/vmfs/volumes/4b72a71c-0047198b-03f3-00151725d513/VM_Name2/VM_Name2.vmdk" | Between this file and its child |
RW mismatch
# vmkfstools -q VM_Name2-000002.vmdk -v 10 DISKLIB-VMFS : "./VM_Name2-000002-delta.vmdk" : open successful (23) size = 595886080, hd = 0. Type 8 DISKLIB-VMFS : "/vmfs/volumes/4b72a71c-0047198b-03f3-00151725d513/VM_Name2/VM_Name2-000001-delta.vmdk" : open successful (23) size = 83886080, hd = 0. Type 8 DISKLIB-LINK : Attach: the capacity of each link is different (163840 != 1163840). DISKLIB-CHAIN : "/vmfs/volumes/4b72a71c-0047198b-03f3-00151725d513/VM_Name2/VM_Name2-000001.vmdk" : failed to open (The parent virtual disk has been modified since the child was created). DISKLIB-VMFS : "./VM_Name2-000002-delta.vmdk" : closed. DISKLIB-VMFS : "/vmfs/volumes/4b72a71c-0047198b-03f3-00151725d513/VM_Name2/VM_Name2-000001-delta.vmdk" : closed. DISKLIB-LIB : Failed to open 'VM_Name2-000002.vmdk' with flags 0x17 (The parent virtual disk has been modified since the child was created). Failed to open 'VM_Name2-000002.vmdk' : The parent virtual disk has been modified since the child was created (18).
Explanation
| DISKLIB-LINK : Attach: the capacity of each link is different (163840 != 1163840). | The RW numbers do not match |
| DISKLIB-CHAIN : "/vmfs/volumes/4b72a71c-0047198b-03f3-00151725d513/VM_Name2/VM_Name2-000001.vmdk" | Between this file and its child |
How can this happen? Because someone has expanded the size of the vDisk of the VM when it had snapshots, which is unsupported.
A virtual machine cannot boot after extending a base virtual disk that is part of a snapshot hierarchy (http://kb.vmware.com/kb/1646892).
No issue
# vmkfstools -q test_2-000002.vmdk test_2-000002.vmdk is not an rdm
One vmdk descriptor in the chain has been deleted
# vmkfstools -q test_2-000002.vmdk Failed to open 'test_2-000002.vmdk' : The parent of this virtual disk could not be opened (23).
With verbose output
# vmkfstools -q test_2-000002.vmdk -v 1 DISKLIB-VMFS : "./test_2-000002-delta.vmdk" : open successful (23) size = 73400320, hd = 0. Type 8 DISKLIB-LINK : "/vmfs/volumes/4b72a71c-0047198b-03f3-00151725d513/test2/test_2-000001.vmdk" : failed to open (The system cannot find the file specified). DISKLIB-CHAIN :"test_2-000002.vmdk": Failed to open parent "/vmfs/volumes/4b72a71c-0047198b-03f3-00151725d513/test2/test_2-000001.vmdk": The system cannot find the file specified. DISKLIB-CHAIN : "/vmfs/volumes/4b72a71c-0047198b-03f3-00151725d513/test2/test_2-000001.vmdk" : failed to open (The parent of this virtual disk could not be opened). DISKLIB-VMFS : "./test_2-000002-delta.vmdk" : closed. DISKLIB-LIB : Failed to open 'test_2-000002.vmdk' with flags 0x17 (The parent of this virtual disk could not be opened). Failed to open 'test_2-000002.vmdk' : The parent of this virtual disk could not be opened (23).
Explanation
| DISKLIB-LINK : "/vmfs/volumes/4b72a71c-0047198b-03f3-00151725d513/test2/test_2-000001.vmdk" : failed to open (The system cannot find the file specified). | This is the missing file |
Now, a delta vmdk has been deleted
# vmkfstools -q test_2-000002.vmdk -v 1 DISKLIB-VMFS : "./test_2-000002-delta.vmdk" : open successful (23) size = 73400320, hd = 0. Type 8 DISKLIB-VMFS : "/vmfs/volumes/4b72a71c-0047198b-03f3-00151725d513/test2/test_2-000001-delta.vmdk" : failed to open (25): Backing file doesn't exist. Type 8 DISKLIB-DSCPTR: Failed to open extents for descriptor file in normal mode DISKLIB-LINK : "/vmfs/volumes/4b72a71c-0047198b-03f3-00151725d513/test2/test_2-000001.vmdk" : failed to open (The system cannot find the file specified). DISKLIB-CHAIN :"test_2-000002.vmdk": Failed to open parent "/vmfs/volumes/4b72a71c-0047198b-03f3-00151725d513/test2/test_2-000001.vmdk": The system cannot find the file specified. DISKLIB-CHAIN : "/vmfs/volumes/4b72a71c-0047198b-03f3-00151725d513/test2/test_2-000001.vmdk" : failed to open (The parent of this virtual disk could not be opened). DISKLIB-VMFS : "./test_2-000002-delta.vmdk" : closed. DISKLIB-LIB : Failed to open 'test_2-000002.vmdk' with flags 0x17 (The parent of this virtual disk could not be opened). Failed to open 'test_2-000002.vmdk' : The parent of this virtual disk could not be opened (23).
Explanation
| DISKLIB-VMFS : "/vmfs/volumes/4b72a71c-0047198b-03f3-00151725d513/test2/test_2-000001-delta.vmdk" : failed to open (25): Backing file doesn't exist. Type 8 | This is the missing file |
Sometimes what you are trying to do is failing just because the hostd process is in an 'unstable/hung' state. You may need to restart/kill it before you can continue troubleshooting the snapshots problem.
Restarting the Management agents on an ESX or ESXi Server (http://kb.vmware.com/kb/1003490)
If you see that in that host you don't make any progress, you can VMotion (hot or cold migration) the VM to another host and try from there.
If you get/see strange error messages on the service console or on the logs, it is worth taking a look to the KB.
If you clone the VM, the clone will have the same information as the source VM but it will have no snapshots, i.e. in the cloning process the snapshots will be committed into the cloned virtual disks.
This option is handy when the VM has several snapshots and not very big Base Disks and there is free space somewhere else to put the clone there.
Note: You need to connect the VI Client to the Virtual Center to be able to clone a VM.
If the VM is running while you clone it, an additional snapshot will be created.
Notice that you don't necessarily need the total space of the VM available in a single Datastore. If on the third step of the cloning wizard you click on "Advanced" you will be able to select where you want to put every piece of the VM. Additionally the cloning process may complete quicker if you separate the disks as the I/O will go to different Datastores in parallel
![]() |
![]() |
As well as cloning a VM, cloning a disk with snapshots will commit the snapshots into the destination virtual disk. Just be careful to point at the last snapshot on the chain you want to commit.
Note: If you want to clone the whole chain, the VM must be powered off. You cannot clone a disk that is being used.
The command to clone a virtual disk, with or without snapshots is:
# vmkfstools -i Seed.vmdk /Path/to/destination/Clone_Name.vmdk
Notice that you must always point at the .vmdk descriptor file.
Imagine the following chain of snapshots:
VM_Name-000004.vmdk -> VM_Name-000005.vmdk -> VM_Name-000003.vmdk -> VM_Name-000001.vmdk -> VM_Name.vmdk
If you run:
# vmkfstools -i VM_Name-000004.vmdk VM_Name.Clone.all.vmdk
The result will be:
VM_Name-000004.vmdk + VM_Name-000005.vmdk + VM_Name-000003.vmdk + VM_Name-000001.vmdk + VM_Name.vmdk + --------------------- VM_Name.Clone.all.vmdk
And if you do:
# vmkfstools -i VM_Name-000003.vmdk VM_Name.Clone.2.vmdk
The result will be:
VM_Name-000003.vmdk + VM_Name-000001.vmdk + VM_Name.vmdk + --------------------- VM_Name.Clone.2.vmdk
Warning: Try to create the clones with very descriptive names. It is a pain to take a case where someone created some clones before and discover that they all have the same name. You don't know what contains what.
Warning: Try to avoid using spaces or strange symbols on the names. Just use 0..9a..Z_-
If you don't have space to commit all the disks of the VM but you can afford to commit one chain at a time then do the following (example):
# sgrep VMName.vmx scsi0:0.present = "true" scsi0:0.fileName = "VMName-000006.vmdk" scsi0:1.present = "true" scsi0:1.fileName = "VMName_1-000006.vmdk" scsi0:2.present = "true" scsi0:2.fileName = "VMName_2-000006.vmdk" scsi0:3.present = "true" scsi0:3.fileName = "VMName_3-000006.vmdk" scsi1:0.present = "false" scsi1:0.fileName = "/vmfs/volumes/4688d0e7-7b5c822c-61d7-00145e808070/VMName/VMName_4.vmdk"
true disks but one to false and verify.
# sgrep VMName.vmx scsi0:0.present = "true" scsi0:0.fileName = "VMName-000006.vmdk" scsi0:1.present = "false" scsi0:1.fileName = "VMName_1-000006.vmdk" scsi0:2.present = "false" scsi0:2.fileName = "VMName_2-000006.vmdk" scsi0:3.present = "false" scsi0:3.fileName = "VMName_3-000006.vmdk" scsi1:0.present = "false" scsi1:0.fileName = "/vmfs/volumes/4688d0e7-7b5c822c-61d7-00145e808070/VMName/VMName_4.vmdk"
# sgrep VMName.vmx scsi0:0.present = "true" scsi0:0.fileName = "VMName.vmdk" scsi0:1.present = "false" scsi0:1.fileName = "VMName_1-000006.vmdk" scsi0:2.present = "false" scsi0:2.fileName = "VMName_2-000006.vmdk" scsi0:3.present = "false" scsi0:3.fileName = "VMName_3-000006.vmdk" scsi1:0.present = "false" scsi1:0.fileName = "/vmfs/volumes/4688d0e7-7b5c822c-61d7-00145e808070/VMName/VMName_4.vmdk"
false.
# sgrep VMName.vmx scsi0:0.present = "false" scsi0:0.fileName = "VMName.vmdk" scsi0:1.present = "true" scsi0:1.fileName = "VMName_1-000007.vmdk" scsi0:2.present = "false" scsi0:2.fileName = "VMName_2-000006.vmdk" scsi0:3.present = "false" scsi0:3.fileName = "VMName_3-000006.vmdk" scsi1:0.present = "false" scsi1:0.fileName = "/vmfs/volumes/4688d0e7-7b5c822c-61d7-00145e808070/VMName/VMName_4.vmdk"
# sgrep VMName.vmx scsi0:0.present = "false" scsi0:0.fileName = "VMName.vmdk" scsi0:1.present = "true" scsi0:1.fileName = "/vmfs/volumes/482c6a32-da3cdd8a-646a-001a4baf5986/VMName/VMName_1.vmdk" scsi0:2.present = "false" scsi0:2.fileName = "VMName_2-000006.vmdk" scsi0:3.present = "false" scsi0:3.fileName = "VMName_3-000006.vmdk" scsi1:0.present = "false" scsi1:0.fileName = "/vmfs/volumes/4688d0e7-7b5c822c-61d7-00145e808070/VMName/VMName_4.vmdk"
true on the step number 1 until all their chains are committed.
# sgrep VMName.vmx scsi0:0.present = "false" scsi0:0.fileName = "VMName.vmdk" scsi0:1.present = "false" scsi0:1.fileName = "/vmfs/volumes/482c6a32-da3cdd8a-646a-001a4baf5986/VMName/VMName_1.vmdk" scsi0:2.present = "false" scsi0:2.fileName = "/vmfs/volumes/4688d0e7-7b5c822c-61d7-00145e808070/VMName/VMName_2.vmdk" scsi0:3.present = "true" scsi0:3.fileName = "/vmfs/volumes/47b0960d-268c7dd0-e4d5-001a4baf5986/VMName/VMName_3.vmdk" scsi1:0.present = "false" scsi1:0.fileName = "/vmfs/volumes/4688d0e7-7b5c822c-61d7-00145e808070/VMName/VMName_4.vmdk"
# sgrep VMName.vmx scsi0:0.present = "true" scsi0:0.fileName = "VMName.vmdk" scsi0:1.present = "true" scsi0:1.fileName = "/vmfs/volumes/482c6a32-da3cdd8a-646a-001a4baf5986/VMName/VMName_1.vmdk" scsi0:2.present = "true" scsi0:2.fileName = "/vmfs/volumes/4688d0e7-7b5c822c-61d7-00145e808070/VMName/VMName_2.vmdk" scsi0:3.present = "true" scsi0:3.fileName = "/vmfs/volumes/47b0960d-268c7dd0-e4d5-001a4baf5986/VMName/VMName_3.vmdk" scsi1:0.present = "false" scsi1:0.fileName = "/vmfs/volumes/4688d0e7-7b5c822c-61d7-00145e808070/VMName/VMName_4.vmdk"
As you have seen before, committing a snapshot directly to the Base Disk requires no additional space. This may be the only solution when you have no space to commit the chain of a single Base Disk and there are no other datastores usable for cloning.
So if you have something like this:
Snap2 -> Snap1 -> BaseDisk
# sgrep VM_Name.vmx scsi0:0.present = "TRUE" scsi0:0.fileName = "VM_Name-000002.vmdk" # sgrep VM_Name-000002.vmdk CID=abd2fcd6 parentCID=abd2fcd6 parentFileNameHint="VM_Name-000001.vmdk" RW 163840 VMFSSPARSE "VM_Name-000002-delta.vmdk" # sgrep VM_Name-000001.vmdk CID=abd2fcd6 parentCID=abd2fcd6 parentFileNameHint="VM_Name.vmdk" RW 163840 VMFSSPARSE "VM_Name-000001-delta.vmdk"
You can commit the chain doing this very manual process:
### BEFORE # sgrep VM_Name.vmx scsi0:0.present = "TRUE" scsi0:0.fileName = "VM_Name-000002.vmdk" ### AFTER # sgrep VM_Name.vmx scsi0:0.present = "TRUE" scsi0:0.fileName = "VM_Name-000001.vmdk"
# vmware-cmd VM_Name.vmx createsnapshot ToRemove createsnapshot(ToRemove) = 1 # sgrep VM_Name.vmx scsi0:0.present = "TRUE" scsi0:0.fileName = "VM_Name-000003.vmdk" # vmware-cmd VM_Name.vmx removesnapshots removesnapshots() = 1 # sgrep VM_Name.vmx scsi0:0.present = "TRUE" scsi0:0.fileName = "VM_Name.vmdk"
### BEFORE # sgrep VM_Name-000002.vmdk CID=abd2fcd6 parentCID=abd2fcd6 parentFileNameHint="VM_Name-000001.vmdk" RW 163840 VMFSSPARSE "VM_Name-000002-delta.vmdk" ### AFTER # sgrep VM_Name-000002.vmdk CID=abd2fcd6 parentCID=abd2fcd6 parentFileNameHint="VM_Name.vmdk" RW 163840 VMFSSPARSE "VM_Name-000002-delta.vmdk"
### BEFORE # sgrep VM_Name.vmx scsi0:0.present = "TRUE" scsi0:0.fileName = "VM_Name.vmdk" ### AFTER # sgrep VM_Name.vmx scsi0:0.present = "TRUE" scsi0:0.fileName = "VM_Name-000002.vmdk"
# vmware-cmd VM_Name.vmx createsnapshot ToRemove createsnapshot(ToRemove) = 1 # vmware-cmd VM_Name.vmx removesnapshots removesnapshots() = 1
With this procedure you can remove any number of snapshots. However it has one requirement, you need at least 16MB of space on the datastore in order to create the snapshot that you will use to remove the snapshot above the BaseDisk.
In some cases, using VMware Converter is a very good solution. You can 'convert' the VM while it is running in a brand new VM without snapshots. However it has some requirements:
There are several ways, some are easier and faster than others, so you will have to evaluate your options. Here are some:
As mentioned previously this guide was designed for ESX 3.5. For other versions some things may be slightly different. The details below are not the complete list.
There is no /var/log/vmkernel Instead you have /var/log/messages which contains entries for vmkernel and hostd together.
Mar 2 18:00:03 Hostd: .......... Mar 2 18:00:10 vmkernel: ..........
In this version using the Service Console is not supported, be aware. The "correct" way to send commands to an ESXi is using RCLI (Remote CLI) [6]
If you want to go for the unsupported way, here are the minimum commands you will need.
## Get the list of virtual machines on the host. ## Usage: getallvms # vim-cmd vmsvc/getallvms Vmid Name File Guest OS Version Annotation 1184 VM_Name2 [VSA-cs-tse-d15-local] VM_Name2/VM_Name2.vmx rhel5Guest vmx-07
## Gets the snapshot info for the vm. ## Usage: snapshot.get vmid # vim-cmd vmsvc/snapshot.get 1184 Get Snapshot:
## Creates a snapshot for the vm. ## Usage: snapshot.create vmid [snapshotName] [snapshotDescription] [includeMemory] [quiesced] # vim-cmd vmsvc/snapshot.create 1184 SnapName SnapDesc 0 0 Create Snapshot:
# vim-cmd vmsvc/snapshot.get 1184 Get Snapshot: |-ROOT --Snapshot Name : SnapName --Snapshot Desciption : SnapDesc --Snapshot Created On : 3/2/2010 19:53:29 --Snapshot State : powered off
## Removes all the snapshots on the vm. ## Usage: snapshot.removeall vmid # vim-cmd vmsvc/snapshot.removeall 1184 Remove All Snapshots:
In ESX4 this command has the same format, but now all the arguments are expected.
## FORMAT /usr/bin/vmware-cmd <cfg> createsnapshot <name> <description> <quiesce> <memory> ## WRONG # vmware-cmd VM_Name.vmx createsnapshot Name Desc TypeError: DoIt() takes exactly 7 arguments (5 given) ## CORRECT (no quiesce, no memory) # vmware-cmd VM_Name.vmx createsnapshot Name Desc 0 0 createsnapshot(Name, Desc, 0, 0) = 1
There is a very small detail that could invalidate some of the premises of this guide and therefore make the rest of the content not applicable: In ESX4 you can use thin vDisks. Their size is not preallocated, which means that they grow with the content they store. When a snapshot commits its content into a thin vDisk the thin vDisk will grow if the snapshot contains information located in sectors that haven't been used already by the Base Disk.
Virtual Machine Cannot Boot After Extending a Base Virtual Disk That Is Part of Snapshot Hierarchy (http://kb.vmware.com/kb/1646892)
Delete All Snapshot Operation Results in Consolidate Helper Snapshot When Datastore Has Insufficient Disk Space (http://kb.vmware.com/kb/1003302)
Committing snapshots from within the Service Console (http://kb.vmware.com/kb/1006847)
Committing snapshots when there are no snapshot entries in the snapshot manager (http://kb.vmware.com/kb/1002310)
Why snapshot removal can stop a virtual machine for long time (http://kb.vmware.com/kb/1002836)
Microsoft Exchange Server on a Virtual Machine Can Freeze Under Load When You Take Quiesced Snapshots or Use Custom Quiescing Scripts (http://kb.vmware.com/kb/5962168)
Consolidating snapshots (http://kb.vmware.com/kb/1007849)
If a Virtual Machine Has Two Disks with Identical File Names, Stored in Different Locations, You Cannot Delete a Snapshot of This Virtual Machine (http://kb.vmware.com/kb/5096672)
Virtual Disk Remains on a Datastore After Snapshots Referencing the Disk Are Deleted (http://kb.vmware.com/kb/1003002)
Increasing the Size of the Hard Disk Fails on a Virtual Machine That Has Snapshots (http://kb.vmware.com/kb/1003099)
Snapshot Operations Submitted Directly to an ESX Server Host During Storage VMotion Corrupt Virtual Machine Data (http://kb.vmware.com/kb/1003114)
Deleting Snapshots of Virtual Machines With Heavy Disk I/O Might Cause Host to Be Disconnected from VirtualCenter (http://kb.vmware.com/kb/1003024)
After cold migration on ESX Server, virtual disk with snapshot has the wrong CID (http://kb.vmware.com/kb/1005228)
Committing snapshots generates a content ID mismatch error (http://kb.vmware.com/kb/1007969)
Consolidation of Large or Deeply Nested Snapshots Using VirtualCenter, SDK, or VCB Might Take Longer on ESX Server 3.5 than on ESX Server 3.0.x (http://kb.vmware.com/kb/1003308)
Troubleshooting snapshot commit issue on a virtual machine (http://kb.vmware.com/kb/1004538)
Taking More Than One Snapshot of a Suspended Virtual Machine Produces an Error (http://kb.vmware.com/kb/1424776)
Virtual Machine Changes to Snapshot-Consolidate Mode During Delete-All-Snapshot Operation (http://kb.vmware.com/kb/2222)
Cannot Undo Revert Snapshot (http://kb.vmware.com/kb/1000176)
Taking a Snapshot of a Virtual Machine with Independent, Non-Persistent Disks Fails (http://kb.vmware.com/kb/1001556)
Reverting to Snapshots of Virtual Machines with Hot-Added RDM Virtual Disks in Physical or Virtual Mode Causes Virtual Machines to Hang (http://kb.vmware.com/kb/9729125)
"Unable to save snapshot file" Error Occurs When Taking a Snapshot (http://kb.vmware.com/kb/1000310)
Unable to Delete Snapshots After Restore From Backup (http://kb.vmware.com/kb/1002726)
Verifying ESX virtual machine file integrity (http://kb.vmware.com/kb/1003743)
Verifying sufficient free disk space for an ESX virtual machine (http://kb.vmware.com/kb/1003755)
Adding Space to an ESX Server Virtual Disk (http://kb.vmware.com/kb/994)
Increasing the size of a virtual disk (http://kb.vmware.com/kb/1004047)
Cannot power on a virtual machine because the virtual disk cannot be opened (http://kb.vmware.com/kb/1004232)
Adding and committing redo logs (http://kb.vmware.com/kb/1004458)
Verifying the integrity of the parent disks for an ESX virtual machine (http://kb.vmware.com/kb/1003759)
VMware ESX 3.5, Patch ESX350-200912401-BG: Updates VMkernel, Tools, hostd ( http://kb.vmware.com/kb/1016657 )
VMware ESX 3.5, Patch ESX350-200901401-SG: Updates VMkernel, VMX, and hostd ( http://kb.vmware.com/kb/1006651 )
VMware ESX 3.5, Patch ESX350-200804402-BG: Update to VMware-esx-vmx ( http://kb.vmware.com/kb/1004162 )
VMware ESX 3.5, Patch ESX350-200901401-SG: Updates VMkernel, VMX, and hostd (http://kb.vmware.com/kb/1006651)
VMware ESXi, Patch ESXe350-200901401-I-SG: Firmware Update (http://kb.vmware.com/kb/1006661)
VMware ESX Server 3.5, Patch ESXe350-200804401-O-BG: Firmware Update (http://kb.vmware.com/kb/1004169)
VMware ESX Server 3.5, Patch ESX350-200804402-BG: Update to VMware-esx-vmx (http://kb.vmware.com/kb/1004162)
ESX Server 3.0.2, Patch ESX-1004216: VMkernel LVM driver Might Stop Responding; VMware VIX API Memory Overflow Vulnerabilities; Snapshot Operations Might Fail Under High I/O Stress (http://kb.vmware.com/kb/1004216)
ESX Server 3.0.2, Patch ESX-1004210: Virtual Machine Migration; Thumbnail Files Accumulate; SMTP Data Packets Get Dropped; Virtual Machine Does Not Synchronize; ESX Server Host Might ASSERT or Not Respond During Snapshot Operations (http://kb.vmware.com/kb/1004210)
ESX Server 3.0.2, Patch ESX-1002088; vpxa Upgrade Failure; Internationalized ESX Server 3.x Product Home Page; Update to Allow Snapshot with RDM Disks Attached; Other VMware-hostd-esx.rpm Updates (http://kb.vmware.com/kb/1002088)
ESX Server 3.0.1, Patch ESX-8258730: Security Updates, Updates to Monitor and LSI Logic SCSI Emulator Driver (http://kb.vmware.com/kb/8258730)
ESX Server 3.0.1, Patch ESX-1004822: Snapshot Operations Might Fail Under High I/O Stress (http://kb.vmware.com/kb/1004822)
ESX Server 3.0.1, Patch ESX-1004185: Thumbnail Files Accumulate in ESX Server Host; Virtual Machine Does Not Synchronize With ESX Server Host Time; ESX Server Host Might ASSERT or Not Respond During Snapshot Operations (http://kb.vmware.com/kb/1004185)
ESX Server 3.0.1, Patch ESX-1002083; vpxa Upgrade Failure; Update to Allow Snapshot with RDM Disks Attached; Other VMware-hostd-esx.rpm Updates (http://kb.vmware.com/kb/1002083)
ESX Server 3.0.1, Patch ESX-1000077: Update to Host Agent (http://kb.vmware.com/kb/1000077)
ESX Server 3.0.0, Patch ESX-4809553: Security Updates; Updates to Monitor (http://kb.vmware.com/kb/4809553)
ESX Server 3.0.0, Patch ESX-1000081: Update to Host Agent (http://kb.vmware.com/kb/1000081)
There are several proactive measures you can take to avoid having problems with snapshots.
Increasing the amount of RAM assigned to the ESX Server service console (http://kb.vmware.com/kb/1003501)
You may wonder, How do I know if the Service Console is swapping?
Run the command "free -m"
# free -m
total used free shared buffers cached
Mem: 291 229 62 0 36 136
-/+ buffers/cache: 56 235
Swap: 596 XX 535
The number on XX is the amount of Megabytes that the Service Console is using or has used from the swap space. It should be zero.
It depends on factors such as:
In any case, there is no answer to this question. You can only watch the evolution of the committing operation and from there you may be able to make an estimation.
You should see something like this on the Service Console:
# vmware-cmd VM_Name.vmx removesnapshots removesnapshots() = 1
In the VI Client connected directly to the ESX you should see when the task is completed.
Using watch on the VM folder.
Alternatively you can use vimsh (see [2] for more info) to check if the task is completed.
See here an example where we query about the tasks for a specific VM. One task appears listed, however it has been completed already.
# vmware-vim-cmd vmsvc/getallvms
Vmid Name File Guest OS
288 bf_rhel5 [bf_Ubuntu_nfs-1] bf_rhel5/bf_rhel5.vmx rhel5Guest vmx-04
304 VM_Namex [ds-vol2] VM_Namex/VM_Namex.vmx winNetStandar st vmx-04
320 AR-RHEL5-01 [h16-local1] AR-RHEL5-01/AR-RHEL5-01.vmx rhel5Guest vmx-04
# vmware-vim-cmd vmsvc/get.tasklist 304
(ManagedObjectReference) [
'vim.Task:haTask-304-vim.VirtualMachine.removeAllSnapshots-39439'
]
# vmware-vim-cmd vimsvc/task_info 'vim.Task:haTask-304-vim.VirtualMachine.removeAllSnapshots-39439'
(vmodl.fault.ManagedObjectNotFound) {
dynamicType = <unset>,
obj = 'vim.Task:vim.Task:haTask-304-vim.VirtualMachine.removeAllSnapshots-39439',
msg = "The request refers to an object that no longer exists or has never existed."
}
Also, in the VI Client connected directly to the ESX you should see the task in progress (probably stuck at 95%, but that is normal).
find /vmfs/volumes/ -name "*delta*" -type f -print0 | xargs -0 du --human-readable --total
alias sgrep='egrep -i --color "scsi[0-9]+:[0-9]+.present|scsi.*filename|vmdk|parent|CID"'
vdf -h
vdf -h .
find /vmfs/volumes/ -iname "*FILENAME*"
watch -d "ls -lt *.vmdk ;date"
PS1='\[\e[0;31m\]\u@\h \W]$ \[\e[m\]\[\e[0;32m\]' ; export PS1
The author would like to thank Darren Burnett and Antonio Allegue Leira for their revisions of this document.
The author would also like to thank Edina Varallyai for her support while modelling mathematically the snapshot consolidation process.