Scalability to meet growth requirements

  1. Home
  2. Scalability to meet growth requirements

Go back to GCP Tutorials

In this we will learn and understand about the scalability to meet growth requirements.

Reliability

Reliability is the most important feature of any application, because if the application is not reliable, users will eventually leave, and the other features won’t matter.

  • Firstly, an application must have measurable reliability goals, and deviations must be promptly corrected.
  • Secondly, the application must be architected for scalability, high availability, and automated change management.
  • Thirdly, the application must be self-healing where possible, and it must be instrumented for observability.
  • Lastly, the operational procedures used to run the application must impose minimal manual work and cognitive load on operators, while ensuring rapid mitigation of failures.

Strategies to achieve reliability.

  • Firstly, Reliability is defined by the user. For user-facing workloads, measure the user experience, for example, query success ratio, as opposed to just server metrics such as CPU usage. For batch and streaming workloads, you might need to measure KPIs to ensure a quarterly report is on track to finish on time.
  • Secondly, Use sufficient reliability. Your systems should be reliable enough that users are happy, but not excessively reliable such that the investment is unjustified. Define Service Level Objectives (SLOs) that set the reliability threshold, and use error budgets to manage the rate of change.
  • Thirdly, Create redundancy. Systems with high reliability needs must have no single points of failure, and their resources must be replicated across multiple failure domains.
  • Fourthly, Include horizontal scalability. Ensure that every component of your system can accommodate growth in traffic or data by adding more resources.
  • Fifthly, Ensure overload tolerance. Design services to degrade gracefully under load.
  • Next, Include rollback capability. Any change an operator makes to a service must have a well-defined method to undo it—that is, roll back the change.
  • After that, Detect failure. There is a tradeoff between alerting too soon and burning out the operation team versus alerting too late and having extended service outages.
  • Lastly, Instrument systems for observability. Systems must be sufficiently well instrumented to enable rapid triaging, troubleshooting, and diagnosis of problems to minimize TTM.

Define your reliability goals

We recommend measuring your existing customer experience, their tolerance for errors and mistakes while establishing reliability goals based on these events. For instance, an overall system uptime goal of 100% over an infinite amount of time can’t be achieved and is not meaningful if the data that the user expects isn’t there. Further, Set SLOs based on the user experience. Measure reliability metrics as close to the user as possible. If possible, instrument the mobile or web client. If that’s not possible, instrument the load balancer. Measuring reliability at the server should be the last resort.

In other words, achievable reliability goals that are tuned to the customer experience tend to help define the maximum pace and scope of changes (that is, feature velocity) that customers can tolerate.

SLIs, SLOs, and SLAs
  • Firstly, a Service Level Indicator (SLI) is a quantitative measure of some aspect of the level of service that is being provided. It is a metric, not a target.
  • Secondly, Service level objectives (SLOs) specify a target level for the reliability of your service. Because SLOs are key to making data-driven decisions about reliability, they’re at the core of SRE practices.
  • Thirdly, Error budgets are calculated as (100% – SLO) over a period of time. They tell you if your system is more or less reliable than is needed over a certain time window.

However, if the application has a multi-tenant architecture, typical of SaaS applications used by multiple independent customers, be sure to capture SLIs at a per-tenant level. If you measure SLIs only at a global aggregate level, your monitoring will be unable to flag critical problems affecting individual customers or a minority of customers.

Error budgets

Use error budgets to manage development velocity. When the error budget is not yet consumed, continue to launch new features quickly. When the error budget is close to zero, freeze or slow down service changes and invest engineering resources in reliability features. However, Google Cloud minimizes the effort of setting up SLOs and error budgets with service monitoring.

Design for scale and high availability

Design a multi-region architecture with failover

If your service needs to be up even when an entire region is down, then design it to use pools of compute resources spread across different regions, with automatic failover when a region goes down. Eliminate single points of failure, such as a single-region master database that can cause a global outage when it is unreachable.

Eliminate scalability bottlenecks

Identify system components that cannot grow beyond the resource limits of a single VM or a single zone. Some applications are designed for vertical scaling, where more CPU cores, memory, or network bandwidth are needed on a single VM to handle increased load. However, such applications have hard limits on their scalability, and often require manual reconfiguration to handle growth. Redesign these components to be horizontally scalable using sharding, so that growth in traffic or usage can be handled easily by adding more shards.

Degrade service levels gracefully

Design your services to detect overload and return lower quality responses to the user or partially drop traffic rather than failing completely under overload. For example, a service can respond to user requests with static web pages while temporarily disabling dynamic behavior that is more expensive. Or it can allow read-only operations while temporarily disabling data updates.

Predict peak traffic events and plan for them

If your system experiences known periods of peak traffic, such as Black Friday for retailers, invest time in preparing for such events to avoid significant loss of traffic and revenue. Forecast the size of the traffic spike, add a buffer, and ensure that your system has sufficient compute capacity to handle the spike. Load test the system with the expected mix of user requests to ensure that its estimated load-handling capacity matches the actual capacity. Further, run exercises where your Ops team conducts simulated outage drills, rehearsing their response procedures and exercising the collaborative cross-team incident management procedures discussed below.

Manage risk with controls

Prior to creating and deploying resources on Google Cloud, assess the security features you need to meet your internal security requirements and external regulatory requirements. However, three control areas focus on mitigating risk are:

  • Firstly, technical controls refer to the features and technologies that you use to protect your environment. These include native cloud security controls, such as firewalls and enabling logging, and can also encompass third-party tools and vendors to reinforce or support your security strategy.
  • Secondly, Contractual protections refer to the legal commitments made by the cloud vendor around Google Cloud services.
  • Lastly, third-party verifications or attestations refer to having a third party audit the cloud provider. This is to ensure that the provider meets compliance requirements.
gcp cloud architect practice tests

Technical controls

We start from the fundamental premise that Google Cloud customers own their data and control how it is used. The data a customer stores and manages on Google Cloud systems is to provide that customer with Google Cloud services. Further, we have robust internal controls and auditing to protect against insider access to customer data.

Contractual controls

Google Cloud maintains and expands our compliance portfolio. The Data Processing and Security Terms (DPST) document defines our commitment to maintaining our ISO 27001, 27017, 27018 certifications. And also updating our SOC 2 and SOC 3 reports every 12 months. Further, the DPST also outlines the access controls in place to limit Google support engineers’ access to customers’ environments.

Implement compute security controls

It is always a best practice to secure how you expose your resources to the network. Here are controls availabe in Google Kubernetes Engine (GKE) and Compute Engine.

Private IPs
You can disable External IP access to your production VMs using organization policies. Moreover, you can deploy private clusters with Private IPs within GKE to limit possible network attacks.

Compute instance usage
It’s also important to know who can spin up instances and access control using IAM. This is because you can incur significant cost if there is a break-in. Further, Google Cloud lets you define custom quotas on projects to limit such activity.

Compute OS images
Google provides you with curated OS images that are maintained and patched regularly. Although you can bring your own custom images and run them on Compute Engine. However, you still have to patch, update, and maintain them.

GKE and Docker
App Engine flexible runs application instances within Docker containers, letting you run any runtime. You can also enable SSH access to the underlying instances, we do not recommend this unless you have a valid business use case. Further, to provide infrastructure security for your cluster, GKE provides the ability to use IAM with role-based access control (RBAC) to manage access to your cluster and namespaces.

Runtime security
GKE integrates with various partner solutions for runtime security to provide you with robust solutions. So to monitor and manage your deployment. However, all these solutions can integrate with Security Command Center, providing you with a single pane of glass.

Partner solutions for host-protection
In addition to using curated hardened OS images provided by Google. And, you can use various Google Cloud partner solutions for host protection. Most partner solutions offered on Google Cloud integrate with Security Command Center. And, from where you can go to the partner portal for advanced threat analysis or extra runtime security.

Scalability to meet growth requirements GCP cloud architect  online course

Reference: Google Documentation, Documentation 2

Go back to GCP Tutorials

Menu