See all blog posts in this series:
- From VMware to IBM Cloud VPC VSI, part 1: Introduction
- From VMware to IBM Cloud VPC VSI, part 2: VPC network design
- From VMware to IBM Cloud VPC VSI, part 3: Migrating virtual machines
- From VMware to IBM Cloud VPC VSI, part 4: Backup and restore
- From VMware to IBM Cloud VPC VSI, part 5: VPC object model
- From VMware to IBM Cloud VPC VSI, part 6: Disaster recovery
In this article we will briefly consider the native capabilities of IBM Cloud VPC VSI that you could use to build a disaster recovery solution and compare this with alternative approaches.
Copied snapshots
As we saw previously, the IBM Cloud Backup for VPC backup policies allow you not only to schedule the creation of snapshots for your VSI volumes, but you can also schedule the copying of these snapshots to another IBM Cloud region. You could use this approach to perform periodic replication of all of your VSI data to another region for the purpose of disaster recovery. This approach has a number of limitations that you should take into consideration:
- We saw that for in-region snapshots, you can create policies that will generate consistent snapshots of all of the volumes of individual VSIs, but not of multiple VSIs. By contrast, policy-based cross-region snapshot copies are available only if you snapshot volumes by tag without respect to their associated VSI. These snapshots and copies are not write-order consistent even within a single VSI. If you wanted to work around this limitation, you would need to use the API or CLI to invoke VSI-consistent snapshots and then separately invoke individual copies of each volume to another region.
- You could combine such automation with automated quiesce activity in your VSIs if you wanted to ensure that you had stable replicas or even to possibly achieve write-order consistency across a fleet of VSIs.
- Backup for VPC allows you to use
crontabstyle expressions to schedule the snapshot and copy. Note that in principle your snapshots and copies within a given region exist in a space-efficient chain. However, you should note that the size of your volumes will affect the time that it takes to perform the initial full copy from region to region. And furthermore, for performance reasons you will need to backoff your snapshot and copy frequency based on your volume size if you want the cross-region copy to be incremental; see this reference table. Thus, for example, in my testing I had a 250GB boot volume and needed to set my snapshot and copy frequency to 2 hours. - The table seems to indicate a minimum expectation. I found in some cases that even with the 2-hour interval, sometimes the copied snapshot size reflected a full copy rather than an incremental copy.
- In any case, whether the snapshot is being copied incrementally or in full, the copy time results in an effective RPO for this setup that is somewhat longer than the interval of 2 hours. Although these copies are space efficient they are not true replicas, which would typically have a much lower RPO.
- I recommend that you configure your policy to keep at least 3 copies in the source region and 3 copies in the destination region. This is to ensure that not only do you have a viable copy at all times (minimum of 2 copies) but also that you are not asking the IBM Cloud storage system to calculate an increment off of the most recent snapshot at the same time it is deleting and consolidating an older snapshot into that same snapshot.
- As we discussed for backups and our review of the VPC object model, you will need to plan to reconstitute all aspects of your environment and not just the storage volumes. Some resources like public IP addresses must change in the new region. You also need to reroute private network connectivity to the new region.
- When you recreate your VM in the new region, it will be provisioned with a new uuid and this will cause
cloud-initto re-run; you should be prepared for its side effects such as resetting your root password and authorized SSH keys.
Depending on your application and requirements, you may be able to work with these limitations. If not, you will need to devise an alternate approach.
Moving up the stack
It is well known that you need to move up the stack—or invest in solutions that stretch across layers of the stack—to achieve more stringent BCDR goals. For example, you may be able to leverage storage array replication for highly efficient replication with low RPOs, but you will need to pair this with a solution that is able to quiesce your file system or your database if you want your replicas to be transactionally consistent rather than merely crash consistent.
Thus, enterprise architectures often need to leverage agent-based tools or application- and database-specific methods either to perform the replication or at least to orchestrate it. Such approaches are highly dependent on your solution architecture, including your choice of operating systems, application software, and messaging and database software.
Because of this, you need to investigate and evaluate which tools and techniques are suitable for your solution architecture and your disaster recovery objectives. For example, if you are using DB2, you might consider Q replication or SQL replication to replicate your database between different IBM Cloud regions. Use of OS agents tends to be more common in the backup realm than in the disaster recovery realm, but this may be a viable option for you depending on your RPO objectives. However, for agent-based backups you will need to investigate whether your recovery options are limited due to the current lack of support for booting a VSI from an ISO image.
Approaches like this typically depend on having active infrastructure running in both your production and DR locations. This complicates some aspects of planning and execution; for example, your replicated infrastructure will likely not have the same IP addressing as your original infrastructure, and you will likely use DNS updates to hide this from your application users. On the other hand, it simplifies other aspects of your planning and execution, because you will have pre-created most of the necessary resources instead of needing to create them during the failover.