Business continuity and disaster recovery

Go back to GCP Tutorials

In this tutorial, we will learn about Business continuity and disaster recovery (DR) in Google Cloud. However, this part provides an overview of the DR planning process: what you need to know in order to design and implement a DR plan.

DR is a subset of business continuity planning. DR planning begins with a business impact analysis that defines two key metrics:

Firstly, a recovery time objective (RTO), which is the maximum acceptable length of time that your application can be offline. This value is usually defined as part of a larger service level agreement (SLA).
Secondly, a recovery point objective (RPO), which is the maximum acceptable length of time during which data might be lost from your application due to a major incident. For example, user data that are frequently modified could have an RPO of just a few minutes. In contrast, less critical, infrequently modified data could have an RPO of several hours.

Creating a detailed DR plan

Design according to your recovery goals

When you design your DR plan, you need to combine your application and data recovery techniques and look at the bigger picture. The typical way to do this is to look at your RTO and RPO values and which DR pattern you can adopt to meet those values. For example, in the case of historical compliance-oriented data, you probably don’t need speedy access to the data, so a large RTO value and cold DR pattern is appropriate. However, if your online service experiences an interruption, you’ll want to be able to recover both the data and the customer-facing part of the application as quickly as possible. In that case, a hot pattern would be more appropriate. Your email notification system, which typically isn’t business-critical, is probably a candidate for a warm pattern.

Implementing control measures

Add controls to prevent disasters from occurring and to detect issues before they occur. For example, add a monitor that sends an alert when a data-destructive flow, such as a deletion pipeline, exhibits unexpected spikes or other unusual activity. Further, this monitor could also terminate the pipeline processes if a certain deletion threshold is reached. Thus, preventing a catastrophic situation.

Preparing your software

Part of your DR planning is to make sure that the software you rely on is ready for a recovery event.

Firstly, verify that you can install your software. Make sure that your application software can be installed from source or from a preconfigured image. Further, make sure that you are appropriately licensed for any software that you will deploy on Google Cloud.
Secondly, design continuous deployment for recovery. Your continuous deployment (CD) toolset is an integral component when you are deploying your applications. However, as part of your recovery plan, you must consider where in your recovered environment you will deploy artifacts. Plan where you want to host your CD environment and artifacts.

Implementing security and compliance controls

When you design a DR plan, security is important. The same controls that you have in your production environment must apply to your recovered environment. Compliance regulations will also apply to your recovered environment.

Configure security the same for the DR and production environments

Make sure that your network controls provide the same separation and blocking that the source production environment uses. Learn how to configure Shared VPC and Google Cloud firewalls. Next, understand how to use service accounts to implement the least privilege for applications that access Google Cloud APIs. Further, make sure to use service accounts as part of the firewall rules.

Below, the following list outlines ways to synchronize permissions between environments:

Firstly, if your production environment is Google Cloud, replicating IAM policies in the DR environment is straightforward. You can use infrastructure as code (IAC) methods and employ tools such as Cloud Deployment Manager to deploy your IAM policies to production.
Secondly, if your production environment is on-premises, you map the functional roles, such as your network administrator and auditor roles, to IAM policies that have the appropriate IAM roles.
Thirdly, you have to configure IAM policies to grant appropriate permissions to products. For example, you might want to restrict access to specific Cloud Storage buckets.
If your production environment is another cloud provider, map the permissions in the provider’s IAM policies to Google Cloud IAM policies.

Verify your DR security

After you’ve configured permissions for the DR environment, make sure that you test everything. Create a test environment. Then, use IAC methods that employ tools like Deployment Manager to deploy your Google Cloud policies to the test environment. Moreover, verify that the access that you grant users confers the same permissions that the users are granted on-premises.

Make sure users can log in to the DR environment

Similarly, don’t wait for a disaster to occur before checking that your users can access the DR environment. Make sure that you have granted appropriate access rights to users, developers, operators, data scientists, security administrators, or network administrators. However, if you are using an alternative identity system, make sure that accounts have been synced with your Cloud Identity account.

Make sure that the DR environment meets compliance requirements

Verify that access to your DR environment is restricted to only those who need access. Make sure that PII data is redacted and encrypted. However, if you perform regular penetration tests on your production environment. Then, you should include your DR environment as part of that scope and carry out regular tests by standing up a DR environment.

Making sure your DR plan works

You want to make sure that your planning pays off by making sure that if disaster does strike, everything works as you intend.

Maintain more than one data recovery path

In the event of a disaster, your connection method to Google Cloud might become unavailable. Implement an alternative means of access to Google Cloud to help ensure that you can transfer data to Google Cloud. Regularly test that the backup path is operational.

Test your plan regularly

After you have a DR plan in place, test it regularly, noting any issues that come up and adjusting your plan accordingly. Using Google Cloud, you can test recovery scenarios at a minimal cost. We recommend that you implement the following in order to help with your testing:

Firstly, automate infrastructure provisioning with the Deployment Manager. You can use the Deployment Manager to automate the provisioning of VM instances and other Google Cloud infrastructure. However, if you’re running your production environment on-premises. Then, make sure that you have a monitoring process that can start the DR process when it detects a failure and can trigger the appropriate recovery actions.
Secondly, monitor and debug your tests with Cloud Logging and Cloud Monitoring. Google Cloud has excellent logging and monitoring tools that you can access through API calls, allowing you to automate the deployment of recovery scenarios by reacting to metrics. When you’re designing tests, make sure that you have appropriate monitoring and alerting in place that can trigger appropriate recovery actions.
Lastly, perform the testing noted earlier:
- Test that permissions and user access work in the DR environment like they do in the production environment.
- Perform penetration testing on your DR environment.
- Perform a test in which your usual access path to Google Cloud doesn’t work.

Business continuity and disaster recovery GCP cloud architect online course

Reference: Google Documentation