Using Geo-redundancy for designing highly available applications

Go back to AZ-304 Tutorials

In this tutorial we will learn about the process of designing applications for handling an outage in the primary region using Geo-redundancy.

However, Cloud-based infrastructures like Azure Storage also provide a highly available and durable platform for hosting data and applications. However, Azure Storage offers geo-redundancy storage for ensuring high availability even in the event of a regional outage. Storage accounts configured for geo-redundancy replication are synchronously replicated in the primary region.

However, Azure Storage offers two options for geo-redundant replication. The only things that differentiate these two is how data is replicated in the primary region:

Firstly, Geo-zone-redundant storage (GZRS). In this, the data is replicated synchronously across three Azure availability zones in the primary region using zone-redundant storage (ZRS). And, then replicated asynchronously to the secondary region.
Secondly, Geo-redundant storage (GRS). In this, data replication is done synchronously three times in the primary region using locally redundant storage (LRS). After that, the replication is done asynchronously to the secondary region.

Application design considerations when reading from the secondary

When designing your application for RA-GRS or RA-GZRS remember few points:

Firstly, Azure Storage maintains a read-only copy of the data you store in your primary region in a secondary region.
Secondly, the read-only copy is eventually consistent with the data in the primary region.
Thirdly, for blobs, tables, and queues querying, the secondary region for a Last Sync Time value tells you about the last replication from the primary to the secondary region.
Fourthly, you can use the Storage Client Library to read and write data in either the primary or secondary region. However, you can also redirect read requests automatically to the secondary region. This will only work if a read requests to the primary region times out.
Lastly, if the primary region becomes unavailable, you can initiate an account failover. After failing over to the secondary region, the DNS entries pointing to the primary region are changed to the secondary region.

Running your application in read-only mode

For preparing for an outage in the primary region, you must be able to handle both failed read requests and failed update requests. However, if the primary region fails, read requests can redirect to the secondary region. But, update requests cannot be redirected to the secondary because the secondary is read-only. That is the reason for which you require to design your application to run in read-only mode. Moreover, if you decide to handle errors for each service separately, you will also need to handle the ability to run your application in read-only mode by service.

Handling updates when running in read-only mode

There are various ways for handling update requests while running in read-only. But, there are a couple of patterns that you consider.

Firstly, you can respond to your user and tell them you are not currently accepting updates.
Secondly, you can enqueue updates in another region as you would write your pending update requests to a queue in a different region. And then have a way for processing those requests after the primary data center comes online.
Thirdly, you can write your updates to a storage account in another region. And, when the primary data center comes back online, then you can have a way to merge those updates into the primary data.

Read requests

Read requests have the ability to move into secondary storage if there is a problem with primary storage. However, if you are using the storage client library to access data from the secondary. Then you can specify the retry behavior of a read request by setting a value for the LocationMode property to one of the following:

PrimaryOnly (the default)
PrimaryThenSecondary
SecondaryOnly
SecondaryThenPrimary

After setting the LocationMode to PrimaryThenSecondary, if the initial read request for the primary endpoint fails with an error. Then the client automatically makes another read request to the secondary endpoint. However, if the error is a server timeout, then the client will have to wait for the timeout to expire before it receives a retryable error from the service. There are two scenarios for considering how to respond to a retryable error:

Firstly, this is an isolated problem and subsequent requests to the primary endpoint will not return a retryable error. However, there is no significant performance penalty for having LocationMode set to PrimaryThenSecondary as this only happens infrequently.
Secondly, this is a problem with at least one of the storage services in the primary region, and all subsequent requests to that service in the primary region are likely to return retryable errors. This includes a performance penalty as all your read requests will try the primary endpoint first and wait for the timeout to expire before switching to the secondary endpoint.

Updating requests

The Circuit Breaker pattern can also apply for updating the requests. However, update requests cannot move to secondary storage, which is read-only. For these requests, you have to leave the LocationMode property set to PrimaryOnly. For handling these errors, you can apply a metric to these requests like 10 failures in a row and when your threshold is met, switch the application into read-only mode.

Options for monitoring the error frequency

There are three main options for monitoring the frequency of retries in the primary region for determining when to switch over to the secondary region and change the application to run in read-only mode.

Firstly, Add a handler for the Retrying event on the OperationContext object you pass to your storage requests. This method is for accompanying the sample. However, these events fire whenever the client retries a request, enabling you to track how often the client encounters retryable errors on a primary endpoint.
Secondly, in the Evaluate method in a custom retry policy, you can run custom code whenever a retry takes place. In addition to recording when a retry happens, this gives the opportunity to modify your retry behavior.
Thirdly, implementing a custom monitoring component in your application that continually pings your primary storage endpoint with dummy read requests for determining its health. This would take up some resources, but not a significant amount.

Handling eventually consistent data

Geo-redundant storage working includes replicating transactions from the primary to the secondary region. However, this replication process guarantees that the data in the secondary region is eventually consistent. And, all the transactions in the primary region will eventually appear in the secondary region. But, there may be a lag before they appear, and that there is no guarantee the transactions arrive in the secondary region in the same order as in the primary region. However, if your transactions arrive in the secondary region out of order, you may consider your data in the secondary region to be in an inconsistent state until the service catches up.

Geo-redundancy concept in AZ-304 Online course

Reference: Microsoft Documentation