A common vCenter issue seen is that vpxd will begin to crash immediately after it starts up. This usually starts after an initial crash or even just the restart of the service. It can seem random. But of course, almost nothing is “random” in tech industry. As such, this is a guide to finding out what went wrong!
Before diving right into it, let’s make sure you have the service set up to determine why it is crashing
Turn off Service Recovery Options (Windows)
In a windows deployment, you might need to set the service to “Take No Action” after the crash. This will allow the vpxd log to stop rolling and compressing with every recovery attempt. Accomplish this simply by going to the properties of the service. Don’t forget to change this back after you have fixed the issue!
Note: You could also just use a program that can handle .gz files to unpack the rolled up vpxd logs, but having the service not recover can save space on your drive.
Set vCenter logging to a higher level
Generally, you only need to do this if you do not get the information you need from normal logging. Setting this higher can make for a lot more to look through, and thus making it harder to find the information you need.
Accomplish this by editing vpxd.cfg, located in C:\ProgramData\VMware\VirtualCenter Server in vCenter 5.x or C:\ProgramData\VMware\vCenterServer\cfg\vmware-vpx in 6.0.
- Back up vpxd.cfg
- Open it and locate the
section and the line below. You will generally see two.
- The second one (NOT under alert) controls vpxd.log
- Change this to either verbose (more) or trivia (most)
- Save the file.
- Start vCenter once more to let it log the failure with increased logging.
Reviewing the vpxd log
Ok, let’s get into the problem now. Open up vpxd.log and scroll to the very bottom. Here are a few keywords and examples that you will want to search up for to find where the crash actually happened.
Note: Most common causes are invalid object configuration (KB2084284) or excessive snapshots (KB2065630). A section devoted to these is at the bottom.
Forcing shutdown of VMware Virtual Center
An unrecoverable problem has occurred
Panic: Assert Failed
Unable to create SSO facade (Investigate SSO logs for this one!)
The purpose of this is to find when the failure happened. From here you want to correlate what task was taking place just before the crash. Use the exact errors you get to search the VMware Knowledge Base or Google for a known issue and solution.
Sometimes, vpxd throws you a bone and will include the opID of the failing process in the error. Using the “Panic” example above, the opID would be SWI-66a88c0e. Use this to follow the progression of the task in the log before it failed.
If vpxd is feeling REALLY nice, it might even tell you exactly what object in your inventory is causing the failure. We’ll use the Panic example above again.
According to this, vpxd failed while validating information about vm-227 on host-100. Great, now how do you correlate that to a VM name and host IP? It’s easy as long as you have access to run a few simple queries on the vCenter DB.
Log in to the vCenter database instance with SQL Management Studio or whichever database management tool you have available and run queries similar to the following:
select name from vpxv_vms where vmid='227'
select name from vpxv_hosts where hostid='100'
Using the respective numbers of VM and host from the logs, you can easily find out what the VM name is and which host its on.
Following this, you’ll want to log directly into the host and remove the VM from inventory. On the next vCenter start, a host sync should remove the bad entries from the VCDB and allow it to stay up.
I can’t stress enough that the root cause of these issues usually tend to be excessive snapshots (32+) or an invalid device in the VM config.
Snapshots and Invalid Device Backing
It may seem strange that snapshots on a virtual machine could cause vCenter to crash. The reasoning behind this is that vpxd is allocated a specific thread stack size when validating objects in its inventory. Excessive snapshots will cause exhaustion of that stack, leading to a crash. Subsequent restarts of vCenter will seem to stay up for a few moments before crashing because the validation of these objects can take a bit to occur. Similar symptoms are seen with invalid virtual hardware devices on a VM. In 5.5, the threadstacksize can be adjusted in vpxd.cfg. Check out how here.
Increasing the threadstacksize is really just masking the problem though.
The worst part about excessive snapshots is that they have the ability to crash vpxd with no logged errors. If you run into a scenario like this, increase your vCenter logging level with the instructions at the beginning of the post and search for some of the keywords again. Failing even that, you might need to go manually search through your datastores for VMs with snapshots.
SSH into a host that can access as many of your datastores as possible. If some of your hosts see different storage, you will need to SSH into more than one to ensure you cover all of your shared datastores.
Run a find command on /vmfs/volumes to display all virtual machine snapshot files on the presented storage
find /vmfs/volumes -name *delta* |less
Every snapshot vmdk will be displayed on screen, if any set for the same VM numbers over 32, then that VM is suspect and needs to be consolidated.
Locate the host that the VM resides on using the SQL queries in the section above
select name from vpxv_vms where id='227'
select name from vpxv_hosts where id='100'
Login to the host directly and remove the snapshots from the VM using the “Consolidate” feature.
Where did they come from?
Excessive snapshots usually stem from backup software which removes the pointer (seen in Snapshot Manager), but does not actually remove the delta vmdk file. Once this happens, every backup attempt will build up more and more snapshots on the filesystem, even though they wont be visible in snapshot manager.