VMware offers a set of technologies which manage resource distribution and availability in a vSphere cluster: High Availability (HA), Distributed Resource Scheduler (DRS) and Distributed Power Management (DPM). In most cases you will use a combination of these technologies, what can result in a lot of different scenario’s. The vSphere Clustering Deep Dive books will give you some good insight information, and I’ve used some information in this article.
Now think of the following scenario (from a theoretical perspective): I’ve configured a 5 node vSphere cluster which is using the “Percentage of Cluster Resources Reserved” policy for HA. I’ve also configured DRS in fully automated mode, and I’m using DPM in fully automated mode as well. The configured failover capacity for this cluster is 20% for both CPU and memory, so I can accommodate one host failure in this cluster.
After populating the cluster with virtual machines, DPM kicks in and powers off 3 hosts because no more resources are needed.
Note: (As vSphere 5.0) DPM will always leave 2 ESXi hosts powered on when HA is enabled on the cluster…even with a disabled admission control policy.
When a host failure occurs, we only have one host left and all virtual machines that were running on the failed host will be started on the remaining host. Okay, cool but what happens if we don’t have enough resources to start all the virtual machines? In that case DPM will boot an additional host and HA will be waiting until this extra host is powered on. When this extra host is available, HA will restart the remaining virtual machines. This is a good solution, but will add some extra time to the HA recovery process.
In case you are running your vCenter Server as a virtual machine, you have an interesting challenge: If the vCenter Server virtual machine was running on the failed host, HA will try to restart the vCenter Server on the remaining host. Because of this always give your vCenter (and the database server) a high restart priority.
When the vCenter Server (or service) is not restarted successfully for whatsoever reason we have an interesting solution: DPM is a part of DRS and managed by vCenter. In case we need extra hosts, we will need the vCenter server to start these additional hosts through ILO, IMPI or Wake on Lan. No vCenter available means no opportunity to start extra ESXi hosts and your vSphere cluster will end up with too few hosts.
What’s the solution for this situation? The first option is not to run vCenter on a DPM enabled cluster. The second option is increasing the the failover capacity for the cluster. But increasing the failover capacity will result in a larger reservation to satisfy this reservation….always.
Another third option is to use the advanced DPM parameters MinPoweredOnCpuCapacity and MinPoweredOnMemCapacity. This advanced options will help you to specify a minimum amount of CPU and Memory capacity available in a cluster, respected by DPM. MinPoweredOnCpuCapacity is set in MHz, MinPoweredOnMemCapacty is set in MB. If we have 48 GB hosts in our cluster, we can set MinPowerOnMemCapacity to 98304 to ensure at least there ESXi powered on host in our cluster. More information on these parameters in this KB article.
Using DPM in an HA enabled cluster can result in some extra challenges on your environment. Always think carefully about your design and consequences of possible outages and HA recovery options. Use the comment option to share your thoughts 🙂