Recently I had a discussion with one of my co-workers about Recovery Point Objective or RPO. RPO is the maximum targeted period in which data might be lost from an IT service due to a major incident. RPO can be zero, a couple of minutes, a few hours or maybe even a day or a couple of days. RPO is an important factor when desiging a BC/DR solution and its replication strategy.
When talking about BC/DR solutions in virtual environments you can both leverage array based replication options, hypervisor based replication and/or guest OS based replication. For example:
- VMware’s Site Recovery Manager can integrate with both array based replication (provided by the storage vendor) as well as hypervisor based replication using vSphere Replication;
- Zerto Virtual Replication is a hypervisor based replication solution for both Microsoft and VMware virtual environments. Zerto also supports replication to AWS and Azure;
- VMware vCloud Air DR is using vSphere Replication;
- Azure Site Recovery is using Hyper-V Replica replication for Microsoft virtual environments, where in-guest replication is used for VMware virtual environments.
I recommend to read the DR for virtual environments at WhatMatrix.com if you want to learn more about these different solutions.
Ok, now back to the title of this article: “What do you prefer…a low RPO or a configurable RPO?”
Products that currently available in the market have different approaches when it comes to replication and RPO. For example with array based replication you can set a replication interval for a LUN/volume; the array will take a snapshot at the configured interval and then replicate the data. Consistency of data can be challenge, although some vendor provide integration with techniques like VSS. Notice that replication interval and RPO are different things: replication interval is about how often data is replicated, RPO is about is the maximum targeted period in which data might be lost in case of a major incident. You cannot say: my RPO is 4 hours, so I set a replication interval of 4 hours. In this scenario the replication will start every 4 hours, but as long as the replication is running you’re violating the RPO:
- At 12:00pm your replication is starting, this takes 45 minutes and finishes at 12:45 pm;
- The next replication start at 4:00 pm and takes again 45 minutes, and thus finished at 4:45 pm.
- If a major incident occurs at 4:30 pm (replication is not yet finished), the latest replicated dataset is from 12:00 pm, so RPO is 4,5 hours.
VMware’s vSphere Replication allows you to configure athe required RPO and will automatically optimize the replication schedule. Because the replication transfer time is of great influence on the desired RPO (the same as with array based replication), vSphere Replication uses the duration of the last few replications to estimate the required replication transfer time. The following example shows how this works (taken from the vSphere Replication documentation):
Assume that during replication configuration you set the RPO to 15 minutes. If the synchronization starts at 12:00 and it takes 5 minutes to transfer to the target site, the instance becomes available on the target site at 12:05, but it reflects the state of the virtual machine at 12:00. The next synchronization can start no later than 12:10. This replication instance is then available at 12:15 when the first replication instance that started at 12:00 expires.
If you set the RPO to 15 minutes and the replication takes 7.5 minutes to transfer an instance, vSphere Replication transfers an instance all the time. If the replication takes more than 7.5 minutes, the replication encounters periodic RPO violations. For example, if the replication starts at 12:00 and takes 10 minutes to transfer an instance, the replication nishes at 12:10. You can start another replication immediately, but it finishes at 12:20. During the time interval 12:15-12:20, an RPO violation occurs because the latest available instance started at 12:00 and is too old.
Some solutions (like Zerto and Azure Site Recovery for VMware) offer a “near zero” RPO: the DR solution tries to satisfy a near zero RPO and will replicate all changes to the failover site. This is done asynchronously, so you don’t have a guarantee that the RPO of almost zero is actually achieved. This is contrast to synchronous replication which will mark a write as complete after it’s written to the primary and secondary site. Notice that this is not a discussion about the best option, is more about the characteristics of both options. Specifically Zerto provides an unique feature called journal based replication/protection. With the launch of version 5.0 they’re able to provide you a synchronization journal of the last 30 days (this used to be 15 days). This means you can recover to a point in time, which allows you to recover from a datacenter wide disaster but also from a database corruption, faulty upgrades and virus- or cryptoware outbreaks. Think about Zerto not only as a DR solution but also as a point-in-time backup/restore and DR solution.
A lower RPO will puts higher demands on the required bandwidth between the original and DR site. Especially when you have a IT environment with a high data change rate, you will face challenges. For example, if you have a server that is running a batch job several times a day that will change a lot of data, a near zero RPO solution has to replicate all these changes. Especially when the same data is changed over and over again, you’re replicating data again and again that is not used at the end of the day. And the end you might end up with a higher RPO than designed.
This scenario is different when you’ve configured an RPO of, let’s say, 4 hours (and you’re not using a near zero RPO strategy). If the batch job is running every hour and changing the same data over and over again, you’re DR solution will only replicate the result of the latest batch job. In this case it might be easier to satisfy the required RPO.
So, you have to decide: what’s more important for you. A low RPO or a configurable RPO? Do you need synchronous or asynchronous replication? Some solutions provide a near zero RPO. This can be attractive, but you have to do some research on your applications and how they behave. Maybe you will not ever achieve the low RPO because of the change rate of your data and the bandwidth that is required. A configurable RPO might be more appropriate in some scenarios. In this case so you can set the RPO to 15 minutes, 1 hours or maybe 4 hours depending on the requirements of your organisation and depending on the characteristics of your data set. On top of the RPO there are of course other considerations that will affect your decision. My “DR for virtual environments” comparison at whatmatrix.com will help you, and will provide you comprehensive understanding of available solutions in the market.
I hope this was helpful, thanks for reading.