One of the top breakout sessions I attended at VMworld 2014 was session BCO1916: Site Recovery Manager and Stretched Storage Tech Preview. This session is about two important Disaster Recovery (DR) architectures in the virtualization space and how the might (will) get combined in a future version of Site Recovery Manager (SRM):
- Disaster recovery using a stretched cluster architecture (active/active);
- Disaster recovery using a ‘traditional’ SRM architecture (active/active or active/standby);
As a consultant I frequently have the discussion with customers: what is the best DR architecture/solution for your environment? Unfortunately there’s no correct answer and “it just depends” on the exact case. Specifically for The Netherlands (where I live), a tiny country with little chance of a big disaster like an earthquake or hurricane, a geographic dispersion of 10 KM is often considered as a true DR solution. This contrasts with the US, where greater distances are required for a good DR solution. Because of the limited requirements regarding geographic dispersion, a stretched cluster configuration is quite popular in my region.
Before I will elaborate on NextGen SRM, let’s first have a closer look at the differences between current SRM and Stretched Cluster solutions. The next figure details some of the key differences:
This figure is taken from the “Stretched Clusters and VMware vCenter Site Recovery Manager” whitepaper available through vmware.com.
Some important conclusions here are:
- There’s a big difference between Disaster Recovery and Disaster Avoidance. Disaster Recovery is the proces in which you recover your datacenter at the secondary site from an unexpected outage at the primary site (which is partially or completely lost). Disaster avoidance is the process in which a disaster is about to strike but you still have time to move the workload to the secondary site. In such a case both datacenters are still up and running.
- SRM is a better Disaster Recovery (DR) solution while a Stretched Cluster solution is a better Disaster Avoidance (DA) Solution. SRM’s orchestration options are a lot better and it allows you to create a pre-defined recovery plan. This recovery plan includes the exact steps to execute, in case of a failover. SRM always involves downtime when an failover is initiated. A stretched cluster allows you to vMotion virtual machines from the primary datacenter to the secondary datacenter which is ideal for DA purposes; however, some manual reconfiguration of the storage solution might be necessary. In case of an unexpected outage of the primary site, HA is used to restart your virtual machines. HA lacks the orchestration options of SRM which may results in problems when virtual machines are started in an incorrect order.
- A stretched cluster solution is the better solution for live site balancing; you can move virtual machines between the sites to balance the load. There’s one pitfall here; because the storage volumes are actively pinned to one of the datacenters, your virtual machines must run in the datacenter location where the data lives. Because of this you have to configure DRS groups, which pins virtual machines to the correct datacenter.
- Another challenge in a stretched cluster scenario is where will your vCenter Server live? Nowadays a stretched cluster configuration will always use just one vCenter Server configured with one cluster, while SRM will always require to have two vCenter Servers in place.
As you can see both SRM and stretched cluster solution have their advantages and disadvantages. Let’s see how we can improve these solutions and eliminate some of the shortcomings both solutions have.
Improve SRM and stretched cluster architectures?
You can think of two ways to improve SRM and stretched cluster architectures:
- Improve the orchestration features of HA in such a way that HA will allow you to create some kind of recovery plan which can be used in a stretched cluster configuration. In this you’re still facing the challenge of running only one vCenter Server.
- Improve SRM in such a way that it will also support stretched cluster configurations.
It will not surprise you, but session BCO1916 is about the second option. Let’s take a closer look at this new architecture:
(picture was taken from the BCO1916 presentation)
In this new scenario SRM will manage your stretched cluster and allows you to do a classic “SRM failover” or a vMotion-based failover (ideal for Disaster Avoidance) in case both datacenters are available.
A key feature required for this setup is cross vCenter Motion – one of the new features in vSphere 6. An improvement on the SRA (Storage Replication Adapter) is also required, of course, because the SRA has to support stretched storage configurations.
Advantages of this new architecture are:
- “SRM managed” stretched cluster architecture;
- Implementation of a recovery plan for both classic fail-overs or stretched cluster-based failover;
- Freedom of choice, you can make some VMs member of a stretched cluster environment and some of a traditional SRM-like failover;
- SRM test bubble functionality is available;
- IP customization is available for failover test;
- No more site preference configuration through DRS groups; you have two different datacenters available;
- No single vCenter Server eliminating this Single Point of Failure.
Disadvantages of this new architecture are:
- Extra licensing costs: a secondary vCenter Server and Site Recovery Manager are both required;
- Extra complexity in some respect (SRM is an add-on product);
- No DRS across sites, although it’s questionable if this is something you want.
I think the advantages outweigh the disadvantages and this is a great development in de Disaster Avoidance/Recovery space.
Another valid question is whether you can still use the old HA based stretched cluster option will also be valid. I think this can still be a good solution, depending on the exact requirements. As far as I understood at this point there will be no development to enhance the orchestration features of HA.
For further information about the future SRM I would suggest to watching this YouTube movie which is a recording of session BCO1916: Site Recovery Manager and Stretched Storage Tech Preview.