Troubleshooting a purple screen of death
When a PSOD occurs, the first thing you want to do is note the information displayed on the screen. I suggest using a digital camera or cell phone to take a quick photo. The PSOD message consists of the ESX version and build, the exception type, register dump, what was running on each CPU at the time of the crash, back-trace, server up-time, error messages and memory core dump info. The information won’t be useful to you, but VMware support can decipher it and help determine the cause of the crash.
Unfortunately, other than recording the information on the screen, your only option when experiencing a PSOD is to power the server off and back on. Once the server reboots you should find a vmkernel-zdump-* file in your server /root directory. This file will be valuable for determining the cause. You can use the vmkdump utility to extract the vmkernel log file from the file (vmkdump –l ) and examine it for clues as to what caused the PSOD. VMware support will usually want this file also. One common cause of PSOD’s is defective server memory; the dump file will help identify which memory module caused the problem so it can be replaced.
Checking your RAM for errors
If you suspect your system’s RAM may be at fault you can use a built-in utility to check your RAM in the background without effecting your running virtual machines. The RAM check utility runs in the VMkernel space and can be started by logging into the Service Console and typing Service Ramcheck Start.
While RAM check is running it will log all activity and any errors to the /var/log/vmware directory in files called ramcheck.log and ramcheck-err.log. One drawback, however, is that it’s hard to test all of your RAM with this utility if you have virtual machines (VMs) running, as it will only test unused RAM in the ESX system. A more thorough method of testing your server’s RAM is to shutdown ESX, boot from a CD, and run Memtest86+.
Using the vm-support utility
If you contact VMware support, they will usually ask you to run the vm-support utility that packages all of the ESX server log and configuration files into a single file. To run this utility, simply log in to the service console with root access, and type “vm-support” without any options. The utility will run and create a single Tar file that will be named “esx—..tgz”. You can send it via FTP to VMware support. Make sure you delete the Tar file from the ESX Server once you are done to save disk space.
Alternatively, you can generate the same file by using the VMware Infrastructure Client (VI Client). Select Administration, then Export Diagnostic Data, and select your host (VirtualCenter data optional) and a directory on your local PC to store the file that will be created.
Using log files for troubleshooting
Log files are generally your best tool for troubleshooting any type of problem. ESX has many log files. Which ones you should check depends on the problem you are experiencing. Below is the list of ESX log files that you will commonly use to troubleshoot ESX server problems. The VMkernel and hosted log files are usually the logs you will want to check first.
- VMkernel – /var/log/vmkernel – Records activities related to the virtual machines and ESX server. Rotated with a numeric extension, current log has no extension, most recent has a “.1″ extension.
- VMkernel Warnings – /var/log/vmkwarning – Records activities with the virtual machines, a subset of the VMkernel log and uses the same rotation scheme.
- VMkernel Summary – /var/log/vmksummary – Used to determine uptime and availability statistics for ESX Server; readable summary found in /var/log/vmksummary.txt.
- ESX Server host agent log – /var/log/vmware/hostd.log – Contains information on the agent that manages and configures the ESX Server host and its virtual machines. (Search the file date/time stamps to find the log file it is currently outputting to, or open hostd.log, which is linked to the current log file.)
- ESX Firewall log – /var/log/vmware/esxcfg-firewall.log – Logs all firewall rule events.
- ESX Update log – /var/log/vmware/esxupdate.log – Logs all updates done through the esxupdate tool.
- Service Console – /var/log/messages – Contains all general log messages used to troubleshoot virtual machines or ESX Server.
- Web Access – /var/log/vmware/webAccess – Records information on web-based access to ESX Server.
- Authentication log – /var/log/secure – Contains records of connections that require authentication, such as VMware daemons and actions initiated by the xinetd daemon.
- Vpxa log – /var/log/vmware/vpx – Contains information on the agent that communicates with VirtualCenter. Search the file date/time stamps to find the log file it is currently outputting to or open hostd.log which is linked to the current log file.
As part of the troubleshooting process, often times you’ll need to find out the version of various ESX components and which patches are applied. Below are some commands you can run from the service console to do this:
- Type vmware –v to check ESX Server version, i.e., VMware ESX Server 3.0.1 build-32039
- Type esxupdate –l query to see which patches are installed.
- Type vpxa –v to check the ESX Server management version, i.e. VMware VirtualCenter Agent Daemon 2.0.1 build-40644.
- Type rpm –qa | grep VMware-esx-tools to check the ESX Server VMware Tools installed version – i.e., VMware-esx-tools-3.0.1-32039.
If all else fails, restart the VMware host agent service
Many ESX problems can be resolved by simply restarting the VMware host agent service (vmware-hostd), which is responsible for managing most of the operations on the ESX host. To do this, log into the service console and type service mgmt-vmware restart.
NOTE: ESX 3.0.1 contained a bug that would restart all your VMs if your ESX server was configured to use auto-startups for your VMs. This bug was fixed in a patch for 3.0.1 and also in 3.0.2, but appeared again in ESX 3.5 with another patch released to fix it. It’s best to temporarily disable auto-startups before you run this command.
In some cases restarting the vmware-vpxa service when you restart the host agent will fix problems that occur between ESX and both the VI Client and VirtualCenter. This service is the management agent that handles all communication between ESX and its clients. To restart it, log into the ESX host and type service vmware-vpxa restart. It is important to note that restarting either of these services will not impact the operation of your virtual machines (with the exception of the bug noted above).
Fixing a frozen service console
Another problem that can occur is your Service Console can hang and not allow you to log in locally. This can be caused by hardware lock-ups or a deadlocked condition. Your VMs may continue to operate normally when this occurs, but rebooting ESX is usually the only way to recover. Before you do that, however, try shutting down your guest VMs and/or using VMotion to migrate them to another ESX host. To do this, use the VI Client by connecting remotely via SSH or by using one of alternate/emergency consoles, which you can access by pressing Alt-F2 through Alt-F6. You can also press Alt-F12 to display VMkernel messages on the console screen.
If you are able to shutdown or move your VMs, then you can try rebooting the server by issuing the reboot command through the VI Client or alternate consoles. If not, cold-booting the server is your only option.
Lost network configurations
The problem that can occur is that you may lose part or all of your networking configurations. If this happens, you must rebuild your network by using the ESX local service console, since you will be unable to connect using the VI Client. VMware has published knowledgebase articles that detail how to rebuild your networking using the esxcfg-* service console commands and also how to verify your network settings.