Over the last weeks I’ve been talking to different customers about the benefits IaaS can offer. More and more organizations are investigating how they can use cloud solutions. We’re discussing SaaS, PaaS, IaaS and some of “sub” offerings in these different service models. One of subjects discussed is the availability of the different offerings. It’s always good to have a closer look on the cloud provider’s Service Level Agreement.
In this article I want to have a closer look on the IaaS offerings of Microsoft and VMware. Both companies have an IaaS offering:
- Microsoft Azure offers the Virtual Machine service;
- VMware offers vCloud Air (f.k.a. vCloud Hybrid Service) dedicated cloud and virtual private cloud to run virtual machines.
Before we look at the SLA’s, it’s good to get some background info on types of IAAS workloads. I am talking here about traditional datacenter applications versus cloud native applications. Read on to learn more…
Traditional datacenter- versus cloud native applications
In the world of on-premises datacenters we originally ran our applications on physical servers. With the introduction of virtualization a little more than 10 years ago, more and more applications were moved to virtual machines. Nowadays for most organizations virtual machines are the standard, and physical workloads are the exception.
Availability has always been and still is an important subject in the design of datacenters: redundant power, cooling, network, storage and so on. The goal here is to make our applications as high available as possible (or at least as high available as required). With the introduction of virtualization, availability options in the virtualization layer were introduced on top of the existing hardware availability features: think of live migration, HA and Fault Tolerance.
Another property of these traditional datacenter applications (or call them enterprise applications) is that they are scale up applications. If you need more capacity, you will give the application more cpu and memory power. In some scenarios you can put an enterprise application behind a load balancer, but this is most of the time not the primary deployment scenario. Some refer to traditional- or enterprise applications as legacy applications, I think this is an incorrect name because a lot of today’s traditional applications are still used and new versions appear daily.
Now let’s take a look at cloud native applications. Cloud native applications are designed for fail. This means that the application is designed with failure in mind; the high availability comes from the application itself. Availability is achieved by deploying multiple instances that work together. This brings us to another property of cloud native applications: scalability is achieved through scale out (deploying more nodes) instead of scale up (increase available resources to a particular node). Scaling up a cloud native application is very easy, you can quickly deploying new nodes when load requires it or scale down, removed nodes, if load decreases. Most internet applications like Facebook, Dropbox, Netflix are cloud native applications, but also R&D tooling that needs a lot of compute power can be designed as a cloud native application. I would recommend this session on how NetFlix is architected if you would like to learn about how NetFlix achieves availability.
It’s obvious that enterprise and cloud native applications require different clouds. My co-worker Marcel van den Berg calls them reliable clouds (for enterprise applications) and best-effort clouds (for cloud native applications). Of course both clouds have an SLA (it’s not really best effort), but you get the point. Gartner also makes the distinction: some clouds are more suitable for Mode 1 applications (enterprise or traditional applications), while other clouds are more suitable for Mode 2 applications (cloud native applications or agile applications). Read more about Mode 1 and Mode 2 cloud in the Magic Quadrant for Cloud infrastructure as a service here, or use Google to find a vendor that will provide you the report for free (after leaving your e-mail address).
Note that is doesn’t mean you cannot run enterprise application on a best effort cloud; what it does mean is that the availability is managed in a way that is not suitable for your application. You might not want to run a production load on a best effort cloud, however a test & dev workload (with lower availability requirements) might be suitable to run on a best effort cloud.
Public IaaS clouds and SLAs
And now the SLA part. Let’s take a first on the SLA of these services, Microsoft’s SLA tells us that:
For all Internet facing Virtual Machines that have two or more instances deployed in the same Availability Set, we guarantee you will have external connectivity at least 99.95% of the time.
And the VMware SLA states:
VMware will use commercially reasonable efforts to ensure that each class of service purchased for an identified user of an instance of a Service Offering (“you” or “Customer”) is “Available” during a given calendar month equal to “Availability Commitment” provided in the table below: for dedicated cloud 99.99%, for virtual private cloud 99.95%.
Both providers define availability as:
((total minutes in a calendar month – total minutes Unavailable) / total minutes in a calendar month) x 100
So 99.95-99.99% availability would result in a maximum downtime of almost 22 minutes to a little more than 4 minutes a month. If the Service Levels in the SLA are not achieved, you are eligible for a credit.
The big difference between the Microsoft and VMware SLA is that Microsoft requires you to have a minimum of two virtual machines running in the same Availability Set, while VMware is not talking about this. There’s a good article here on Azure availability sets and how they relate Update Domains (UD) and Fault Domains (FD). Because Microsoft Azure’s virtual machine service doesn’t offer live migration for planned maintenance, you have to take in account that virtual machines in the same Update Domain will go down in case a reboot is required after an update to the Azure service. You should also consider Azure Fault Domains (FD): FDs define a group of virtual machines that share a common power source and network switch. In case of a failure, virtual machines in the same FD will go down at the same time.
At this point you might think, but I don’t want my application to go down in case of a failure and certainly not in case of an update! Well, design your application with failure in mind. Microsoft Azure is more or less a best effort cloud and more suitable for cloud native applications and not suitable for enterprise or traditional applications that require a high available cloud infrastructure. VMware’s public cloud service, vCloud Air, is using VMware vSphere as the underlying virtualization platform and a customized version of vCloud Director for multi tenancy and self-service. vCloud Air is leveraging features like vMotion, HA and DRS which will certainly improve availability of the cloud service. Note that Microsoft is not running System Center Virtual Machine Manager and HyperV in the Azure cloud, they’re using a customized version of their hypervisor for Azure. At the end you can run enterprise/traditional applications in the Azure cloud, but keep in mind the availability you want and how Azure is designed. Dev/test workloads might be more suitable. In the case of VMware vCloud Air you can run both production and dev/test workloads because VMware’s cloud is a more reliable cloud.
Another factor to consider when evaluating cloud options is what virtualization platform you have running on-premises. The optimal integration is achieved with Microsoft SCVMM/HyperV -> Azure and VMware vSphere -> vCloud Air. On top of this you should think of these options in a broader perspective, what do you want to achieve with cloud, what are your business drivers and what is your cloud strategy? Enough thoughts for another articles on this subject.
I hope this useful, you can leave your thoughts in the comment box below!
1 Comments
Pingback: Azure now provides an SLA for single instance VMs - techunplugged.io