High availability and failover design

  1. Home
  2. High availability and failover design

Go back to GCP Tutorials

In this, we will learn about High availability and failover design.

HA configuration overview

Data redundancy is provided by the HA configuration, sometimes known as a cluster. A regional instance is a Cloud SQL instance that has been configured for high availability and is hosted in a primary and secondary zone within the defined region. Within a regional instance, however, the setup consists of a primary and a backup instance. All writes to the primary instance are replicated to the standby instance through synchronous replication to each zone’s persistent disc. This arrangement also decreases downtime in the case of an instance or zone failure, and your data remains accessible to client applications.

Failover overview

Cloud SQL automatically switches to providing data from the standby instance if a HA-configured instance becomes unresponsive. This is referred to as a failover.

Process

The following process occurs:

  • Firstly, the primary instance or zone fails.
    • Each second, the primary instance writes to a system database as a heartbeat signal. However, if multiple heartbeats aren’t detected, failover is initiated. This occurs if the primary instance is unresponsive for approximately 60 seconds or the zone containing the primary instance experiences an outage.
  • Secondly, the standby instance now serves data upon reconnection.
    • Through a shared static IP address with the primary instance, the standby instance now serves data from the secondary zone.
gcp cloud architect practice tests

Requirements

For Cloud SQL to allow a failover, the configuration must meet the following requirements:

  • Firstly, the primary instance must be in a normal operating state (not stopped, undergoing maintenance, or performing a long-running Cloud SQL instance operation such as a backup, import or export operation).
  • Secondly, the secondary zone and standby instance must both be in a healthy state. When the standby instance is unresponsive and/or replication to the secondary zone is interrupted, failover operations are blocked. Further, when Cloud SQL repairs the standby instance and the secondary zone is available, replication resumes and Cloud SQL allows failover.

Backup and restore

Automated backups and point-in-time recovery must be enabled for high availability (point-in-time recovery uses binary logging).

Applications and instances

  • Because there is no difference between dealing with non-HA and HA instances, your application does not require any special configuration.
  • Any existing connections to the primary instance, as well as read replicas, are closed when failover occurs. It will take around 2-3 minutes to reestablish connections to the primary instance. It may take longer to connect to copies.
  • As a result, your application reconnects using the same connection string or IP address, and you won’t have to adjust it after the failover.

Maintenance downtime

  • Firstly, maintenance events affect primary instances configured with HA in the same way as any other instance. However, you can expect primary instances to be down during this time. Further, to minimize impact to your service, you can set a maintenance window to control when downtime occurs.
  • Secondly, when maintenance occurs on an instance, it does not fail over to the standby instance. Maintenance updates are applied to the standby instance at the same time as the primary instance.

Performance

  • The performance of regional persistent discs is influenced by a variety of factors:
    • Examine the size of your vm instance type as well as the input and output of your workload.
    • Another measure to consider is that regional persistent disc with solid-state drives (SSD) has a greater latency than persistent disc with local SSD. That is, assuming your workload isn’t a streaming one and you’re concerned about latency.
    • Then, because a regional persistent disc with SSD has higher latency than a persistent disc with local SSD, it can’t achieve the input/output operations per second (IOPS) limit.
    • Lastly, because the redundancy required to write two copies increases tail latency, this is the case.
High availability and failover design GCP cloud architect  online course

Reference: Google Documentation

Go back to GCP Tutorials

Menu