A while ago I wrote an article on high storage latency. This article discussed the relationship between the amount of latency and IO blocksize used by the OS running in your virtual machine.
In this new article I want to discuss a latency message I recently faced in several vSphere 5 environments. The alert is like:
“Device x performance has deteriorated. I/O latency increased from average value of y microseconds to z microseconds.”
The alert shows up in both the vSphere client as well as in the vmkernel.log. X shows the naa ID of the storage device, y & z latency values in microseconds. The alert indicates that latency increased from the normal value ‘y’ (let’s say 10000 microseconds = 10 miliseconds) to values as high as 500000 microseconds (=500 ms), and sometimes even higher.
Of course latency values of 500 ms are not desired, so there might be problem with the storage device. A good way to determine what’s happening on your storage is to run an analysis using esxtop batch mode. After collecting performance data you can analyse the data using Windows perfmon. More information on that is available in this article.
Now look for the time of the alert and compare this with the actual values that the esxtop results show. Important: vmkernel.log and esxtop results always show UTC time, while the vSphere Client uses local time for messages and alerts.
My experience is that the displayed values in the “performance has deteriorated” alert (converted to milliseconds) are most of the time not the same with the values observed by esxtop. Actually, esxtop values are most of the time (2x-6x) lower than values reported in the alert. Important: don’t only investigate the latency value; it’s also important to investigate the behavior of the graph/latency. For example: a latency spike is something different than an increase in latency which slowly decreases. A latency spike is most of the time something not to worry about, it indicates an increase in latency for a very short period of time. An increase in storage latency which continues for a while could indicate a performance problem.
I’ve seen various “Device x performance has deteriorated. I/O latency increased from average value of y microseconds to z microseconds” alerts, which could most of the time be linked to latency spikes. Always be careful but in these cases there was not too much to worry about.
Moreover, when using perfmon for analyzing esxtop results you might also notice latency values fluctuating a little depending on the timeframe you’re analyzing. This has to do with the way perfmon works; perfmon will show you an average value in the graph…this value is dependending on the esxtop value and the timeframe you’re observing. If you have the maximum zoom level, esxtop will show you the actual esxtop values (which are still lower then in the “performance has deteriorated” alert).
I hope this post helps you thinking about storage latency. Good luck on resolving the issues you might face! Also check this VMware KB article.
Important remark: Although the alert discussed in this article can be a “false positive”, there are also numerous posts about a networking bug in vSphere 5 which can affect software iSCSI. More information is available on vmtoday.com.
One Comment
Manish
Please include the link to KB 1008205 to use esxtop for finding storage performance issues as well.