Recently I’ve been working on some vRealize Operations (vRops) dashboards and reports that provide information on the availability of virtual machines. Although this seems to be simple, it can be quite challenging to report uptime and SLAs in a proper way.
Of course things always depend on the exact requirements, but you try to answer the following question yourself: when is a VM considered to be available?
- When the VM is powered on?
- When the VM is powered on and is running (so no BSODs, not stuck in the BIOS, etc.)?
- When the VM is powered on, is running and the application/service that the VM is servicing responds as we would expect?
In the first situation, the VM should only be powered on to be considered available. I think this one is a bit limited, because you’re not evaluating if and how the OS is running. The second option is more valuable: the VM is powered on and I’m also checking if the OS is running properly. In the third option I’m also checking if the application/service that’s running inside the VM is doing what we should expect.
Let’s skip option 1 and have closer look at option 2 and 3. As you can imagine, the checks we need in option 2 (check the VM and the OS) are more generic than those we will need for option 3. To be more precise, option 3 would include VM checks, OS checks and on top of that application checks. These application checks are depending on the application you’re running inside a VM: for example you want to check if a webserver is up and running and if the user is presented with a logon screen. Maybe you also want to check the result of a logon, and check how fast the application is responding to requests.
For this post I will focus on the more generic checks as mentioned in scenario 2: VM and OS should be running.
How to determine VM availability?
So, the question now is how to determine VM availability? In an excellent post, Iwan Rahabok explains how you can think about VM availability. I will not repeat Iwan’s post here, but I will summarize some of the most important conclusions.
A virtual machine is up and running, if:
- It’s powered-on, and;
- The guest OS is running and showing ‘normal’ activity, and;
- The VM uptime is 5 minutes or higher.
This first condition is pretty straight forward, for the second condition you can check memory activity, network activity, disk activity and VMware Tool status. Checking CPU activity can lead to false positive, for example a BSOD can lead to a 100% CPU load. This would result in a VM is up and running, while it’s crashed. So don’t use the CPU metrics here.
The latest condition is the result of the 5 minute monitoring interval that vRops uses. Every 5 minutes vRops collects the metrics of the endpoints (for example vSphere) it is connected to. If a VM is rebooted between two collection cycles, this should result in a ‘VM was down’ condition. By testing the VM uptime (uptime should be higher 5 minutes/300 seconds), you can determine if a VM is rebooted or not. In this case the uptime is determined by the uptime value divided by 3. For example 216/3=72%.
If would create a supermetric according to this model, we would get:
( This Resource: sys|poweredOn * ceil ( min([ max([ This Resource: sys|osUptime_latest, This Resource: mem|workload, This Resource: datastore:Aggregate of all instances|commandsAveraged_average, This Resource: net|usage_average ]) ,1 ]) ) * (min([ 300, This Resource: sys|osUptime_latest ]) /3) )
Note: This supermetric is based on the one in Iwan’s post. I noticed that vRops 6.6 lacks the VMware Tools status metric (sys|heartbeat_latest), so I removed this metric from the supermetric.
Some uptime calculations
So with this new supermetric we’re able to determine VM availability every 5 minutes. If the VM is available, the value is 100. If the VM is unavailable the value would be 0, or a value between 0-99 if the VM is rebooted.
With the new supermetric you would get 24 * 60 / 5 = 288 data points throughout the day. Using these datapoints we can calculate an uptime for a VM.
Let’s say we have 10 minutes (2 data points) downtime during a day, the availability would be::
(288 - 2 / 288) * 100 = 99,31%
Or in minutes:
(1440 - 10 / 1440) * 100 = 99,31%
Another way to approach this, is to take an average of all the collected datapoints (for a day):
(286 * 100 + 2 * 0) / 288 = 99,31%
This last approach is something we can get out of vRops using the Views feature:
- Create a new View give it a name and choose a presentation method. List or trend can be valuable here;
- The subject for the view is the Virtual Machine object;
- The Data you want to evaluate/display is the brand new VM uptime supermetric we’ve just configured. Depending on what you want to display you choose to display the latest value, or calculate an average value (select ‘Average’ as the transformation option to calculate the average for the selected time range) or show a trend.
- Configure the required time range in the view under ‘Time Settings’ to determine the VM uptime for a specific time interval.
Now add the new View to a Dashboard or Report and get uptime information on the selected virtual machine. In the following example we see the uptime trend for a specific VM.
Uptime for all VMs in a cluster/dc/custom group
So now we have a way to display uptime/availability information for a single VM. Let’s say we want to provide information on the uptime for all the VMs in a specific cluster, (custom) datacenter or custom group. To achieve this we need one extra supermetric that calculates the average of a set of VM instances/objects. The configuration for this new supermetric is straightforward:
avg(Virtual Machine: Super Metric|VM uptime %)
You’re actually pointing this new supermetric to the existing VM uptime metric. By linking this new supermetric to a cluster, datacenter or custom group the average SLA is calculated based on the uptime SLAs for all the VMs that are running in the cluster. This average value is calculated every 5 minutes, so again you will get a datapoint every 5 minutes. Of course you can also create a View for this scenario, and calculate the average.
Maintenance windows and non 7×24 SLAs
Although we’re now able to do some calculations on the uptime of your VM and generate reports, there are a few things remaining that we have think about. How to deal with maintenance windows, and what is the impact if uptime calculations are only required during business hours (let’s say between 7:00 and 19:00).
Any downtime during a maintenance window should not be accounted to the SLA of a virtual machine. This can be achieved through the maintenance mode option in vRealize Operations. You can put an object in vRops into maintenance mode; this means that no metrics are saved for an object. Because there are no data point(s), there’s now value for the uptime (0 or 100) and there’s no impact on the average uptime SLA value. You can manually place an object in maintenance mode, however it’s also possible to create a maintenance schedule and link it to a policy. In this way all objects that are linked to the policy are automatically placed in maintenance mode. More info in this article.
A big disadvantage of the maintenance mode method is that while an object (a VM) is in maintenance mode none of the metrics are saved. Actually you’re loosing visibility to the VM. This means that if you want to use this method only to monitor the VM uptime during business hours, you’re loosing complete visibility after business hours. Not something you want.
Another apprach is to always save uptime information (except if a VM is down for maintenance) and do uptime calculations only for the applicable time range. For this scenario we will have to use the View option again (although there are some limitations).
The View option option offers various time and range settings:
What would like to see is a View that displays the VM uptime for weekdays between 7:00 and 19:00. Unfortunately a View is not capable of analyzing multiple time ranges; you can set a (one) relative time range, or set a (one) absolute time range. With an absolute time range you can set date & time. So if you want to have a report on each day of the week, you have to create different views with each view analyzing a specific date/time. This will cost you a lot of effort to get uptime information for all weekdays during a specific month.
Another way to deal with this is to use the relative start date, and use the “Nth” option.
With the Nth option you can set a time interval between 9-19, so you can info on the uptime between this time intervals. The only constraint here is that this time range is only valid for the current day, it’s not possible to point to a specific day of the week. In a dashboard that includes this view, you will have the uptime information for the current day, only available today. If you want to archive uptime for a VM throughout the week, you can schedule a report that includes this view and run it Monday till Friday after 19:00. It’s a bit of a workaround, but I haven’t found another way to retrieve this information.
One more thing…
One thing that is important to know is about how, or actually when the supermetric (the uptime) is calculated. vRops collects metrics and calculates supermetrics every 5 minutes. However, the supermetric of collection interval ‘t’ is calculated at interval t+1. Interval t+1 is the next time interval and 5 minutes after the ‘source metric’ is retrieved. This is illustrated in the following graph:
What you here is that the latest uptime metric is available at 10:28 pm, while the latest available supermetric is from 10:23 pm. So in a dashboard that displays the latest information you will always get the t-1 information for a supermetric. Notice that if there was downtime at 10:13 pm, the supermetric will show this at time interval 10:13 pm. It will only take until 10:18 pm before you have the information available.
I hope this was helpful again, if you have any questions please don’t hesitate to leave a comment bewow.
Roberto Gaetano Pulvirenti
interesting article, but I think that VM uptime is reset after a vmotion. This would make any availability calculation unreliable. What’s your view?
Good one, let me dive into this and I will come back to you!
Hi, I’ve tested the scenario but I don’t see uptime being reset after a vMotion.
I’m thinking of trying something similar for hosts. It could be used for SLA compliance, general uptime/downtime reporting and tied in with server costs to see how much host downtime really costs.
I see the system host properties Runtime|Connection State and Runtime|Maintenance State could be used for calculating this uptime/downtime.
Good one! I actually played with these counters in another dashboard. Specifically the Runtime|Maintentance State can be used to get a quick overview of which host are in maintenance state.
Hi Viktorious, Thank you very much for this post, i was looking same thing since long time. base on super metric mentioned in this post, is it possible to get one month uptime report? like vm was up 80% and vm was down 20% in the month?
Yes you can do this through a Report or a View using the available statistical constructs (you probably would need “average”).
Does the metric – Resource Availabilty and vROPS generated | availability resets post restart of agent / server . Also , what happens in case of any windows service .