Developing procedures to ensure the resilience of solution in production

  1. Home
  2. Developing procedures to ensure the resilience of solution in production

Go back to GCP Tutorials

In this tutorial we will learn about developing procedures to ensure the resilience of solution in production.

Scalability: Adjusting capacity to meet demand

Scalability is the measure of a system’s ability to handle varying amounts of work by adding or removing resources from the system. For example, a scalable web app is one that works well with one user or many users, and that gracefully handles peaks and dips in traffic. However, the flexibility to adjust the resources consumed by an app is a key business driver for moving to the cloud. With proper design, you can reduce costs by removing under-utilized resources without compromising performance or user experience. Similarly, you can maintain a good user experience during periods of high traffic by adding more resources.

Google Cloud provides products and features to help you build scalable, efficient apps:

  • Firstly, compute Engine virtual machines and Google Kubernetes Engine (GKE) clusters integrate with autoscale.
  • Secondly, Google Cloud’s serverless platform provides managed compute, database, and other services.
  • Thirdly, database products like BigQuery, Cloud Spanner, and Cloud Bigtable can deliver consistent performance across massive data sizes.
  • Lastly, cloud Monitoring provides metrics across your apps and infrastructure, helping you make data-driven scaling decisions.

Resilience: Designing to withstand failures

A resilient app is one that continues to function despite failures of system components. Resilience requires planning at all levels of your architecture. It influences how you layout your infrastructure and network and how you design your app and data storage. However, building and operating resilient apps is hard. This is especially true for distributed apps, which might contain multiple layers of infrastructure, networks, and services. With proper processes and organizational culture, you can also learn from failures to further increase your app’s resilience.

Google Cloud provides tools and services to help you build highly available and resilient apps:

  • Firstly, Google Cloud services are available in regions and zones across the globe, enabling you to deploy your app to best meet your availability goals.
  • Secondly, compute Engine instance groups and GKE clusters can be distributed and managed across the available zones in a region.
  • Thirdly, compute Engine regional persistent disks are synchronously replicated across zones in a region.
  • Then, Google Cloud provides a range of load-balancing options to manage your app traffic, including global load balancing that can direct traffic to a healthy region closest to your users.
  • After that, Google Cloud’s serverless platform includes managed to compute and database products that offer built-in redundancy and load balancing.
  • Lastly, Google Cloud supports CI/CD through native tools and integrations with popular open-source technologies, to help automate building and deploying your apps.

Drivers and constraints

There are varying requirements and motivations for improving the scalability and resilience of your app. There might also be constraints that limit your ability to meet your scalability and resilience goals. The relative importance of these requirements and constraints vary depending on the type of app, the profile of your users, and the scale and maturity of your organization.

Drivers

To help prioritize your requirements, consider the drivers from the different parts of your organization.

Business drivers

Common drivers from the business side include the following:

  • Firstly, optimize costs and resource consumption.
  • Secondly, minimize app downtime.
  • Thirdly, ensure that user demand can be met during periods of high usage.
  • Next, improve the quality and availability of service.
  • Lastly, ensure that user experience and trust are maintained during any outages.
Development drivers

Common drivers from the development side include the following:

  • Firstly, minimize time spent investigating failures.
  • Secondly, increase the time spent on developing new features.
  • Thirdly, minimize repetitive toil through automation.
  • Lastly, build apps using the latest industry patterns and practices.
Operations drivers

Requirements to consider from the operations side include the following:

  • Firstly, reduce the frequency of failures requiring human intervention.
  • Secondly, increase the ability to automatically recover from failures.
  • Thirdly, minimize repetitive toil through automation.
  • Lastly, minimize the impact of the failure of any particular component.
Constraints

Constraints might limit your ability to increase the scalability and resilience of your app. Ensure that your design decisions do not introduce or contribute to these constraints:

  • Firstly, dependencies on hardware or software that is difficult to scale.
  • Secondly, dependencies on hardware or software that is difficult to operate in a high-availability configuration.
  • Thirdly, dependencies between apps.
  • Then, licensing restrictions.
  • After that, lack of skills or experience in your development and operations teams.
  • Lastly, organizational resistance to automation.
gcp cloud architect practice tests

Use appropriate database and storage technology

Certain databases and types of storage are difficult to scale and make resilient. Make sure that your database choices don’t constrain your app’s availability and scalability.

Evaluate your database needs

The pattern of designing your app as a set of independent services also extends to your databases and storage. However, it might be appropriate to choose different types of storage for different parts of your app, which results in heterogeneous storage.

Traditional apps often operate exclusively with relational databases. Relational databases offer useful functionality such as transactions, strong consistency, referential integrity, and sophisticated querying across tables. However, relational databases also have some constraints. They are typically hard to scale, and they require careful management in a high-availability configuration.

On the other hand, Non-relational databases, often referred to as NoSQL databases, take a different approach. Although details vary across products, NoSQL databases typically sacrifice some features of relational databases in favor of increased availability and easier scalability. In terms of the CAP theorem, NoSQL databases often choose availability over consistency.

Implement caching

A cache’s primary purpose is to increase data retrieval performance by reducing the need to access the underlying slower storage layer.

Caching supports improved scalability by reducing reliance on disk-based storage. Because requests can be served from memory, request latencies to the storage layer are reduced, typically allowing your service to handle more requests. In addition, caching can reduce the load on services that are downstream of your app, especially databases, allowing other components that interact with that downstream service to also scale more easily or at all.

Modernize your development processes and culture

DevOps can be considered a broad collection of processes, culture, and tooling that promote agility and reduced time-to-market for apps and features by breaking down silos between development, operations, and related teams. DevOps techniques aim to improve the quality and reliability of software.

Design for testability

Automated testing is a key component of modern software delivery practices. The ability to execute a comprehensive set of unit, integration, and system tests is essential to verify that your app behaves as expected and that it can progress to the next stage of the deployment cycle. Testability is a key design criterion for your app.

However, we recommend that you use unit tests for the bulk of your testing because they are quick to execute and typically easy to maintain. Also, automated testing is an integral component of continuous integration. Executing a robust set of automated tests on each code commit provides fast feedback on changes, improving the quality and reliability of your software. Google Cloud-native tools like Cloud Build and third-party tools like Jenkins can help you implement continuous integration.

Automate your deployments

Continuous integration and comprehensive test automation give you confidence in the stability of your software. And when they are in place, your next step is automating deployment of your app. The level of deployment automation varies depending on the maturity of your organization.

However, choosing an appropriate deployment strategy is essential in order to minimize the risks associated with deploying new software. With the right strategy, you can gradually increase the exposure of new versions to larger audiences, verifying behavior along the way. You can also set clear provisions for rollback if problems occur.

Adopt SRE practices for dealing with failure

For distributed apps that operate at scale, some degree of failure in one or more components is common. If you adopt the patterns covered in this document, your app can better handle disruptions caused by a defective software release, unexpected termination of virtual machines, or even an infrastructure outage that affects an entire zone.

However, even with careful app design, you inevitably encounter unexpected events that require human intervention. If you put structured processes in place to manage these events, you can greatly reduce their impact and resolve them more quickly. Furthermore, if you examine the causes and responses to the event, you can help protect your app against similar events in the future.

Validate and review your architecture

As your app evolves, user behavior, traffic profiles, and even business priorities can change. Similarly, other services or infrastructure that your app depends on can evolve. Therefore, it’s important to periodically test and validate the resilience and scalability of your app.

Test your resilience

It’s critical to test that your app responds to failures in the way you expect. The overarching theme is that the best way to avoid failure is to introduce failure and learn from it. Simulating and introducing failures is complex. In addition to verifying the behavior of your app or service, you must also ensure that expected alerts are generated, and appropriate metrics are generated. For example, you might proceed as follows, validating and documenting behavior at each stage:

  • Firstly, introduce intermittent failures.
  • Secondly, block access to dependencies of the service.
  • Thirdly, block all network communication.
  • Lastly, terminate hosts.
resilience of solution in production GCP cloud architect  online course

Reference: Google Documentation

Go back to GCP Tutorials

Menu