Data retention and data life cycle management

In this we will learn and understand about Data retention and data life cycle management.

Retention policies

When creating a new bucket, you can specify a retention policy, or you can add one to an existing bucket. Adding a retention policy to a bucket, on the other hand, ensures that all present and future objects in the bucket cannot be removed or replaced until they meet the retention policy’s defined age. The 403 – retentionPolicyNotMet issue occurs when attempting to remove or replace objects that are older than the retention period.

For example, say you have a bucket with two objects in it: Object A you added a month ago, and Object B you added two years ago. However, if you apply a retention policy to your bucket that has a retention period of 1 year, you cannot delete or replace Object A for another 11 months. As it is currently 1 month old, but must be a least 1 year old to delete or replace. Object B, on the other hand, can be immediately destroyed or replaced. Because its age exceeds the retention period. If you opt to replace Object B, the age of the new version of Object B will reset to 0 years.

Further, when working with retention policies, keep in mind the following:

Firstly, unless the retention policy is locked, you can increase, decrease, or remove the retention policy from a bucket.
Secondly, changing a retention policy is considered a single Class A operation, regardless of the number of objects affected.
An object’s editable metadata is not subject to the retention policy and can be modified even when the object itself cannot be.
Thirdly, a retention policy contains an effective time, the time after which all objects in the bucket are guaranteed to be in compliance with the retention period.
Next, Retention policies and Object Versioning are mutually exclusive features in Cloud Storage. However, for a given bucket, only one of these can be enabled at a time.
Object Lifecycle Management can be used to delete objects in a bucket, even if the bucket has a locked policy. A lifetime rule, on the other hand, will not remove an item until it has met the retention policy.
Finally, if your bucket has a retention policy, you should avoid performing parallel composite uploads. This is due to the fact that the component components cannot be deleted until the bucket’s minimum retention term has been met.

Retention periods

Although most tools, such as the Google Cloud Console and gsutil, assess retention durations in seconds, some tools, such as the Google Cloud Console and gsutil. For convenience, these tools allow you to specify and view retention periods in other time units. In such instances, the following conversions apply:

Firstly, a day is considered to be 86,400 seconds.
Secondly, a month is considered to be 31 days, which is 2,678,400 seconds.
Lastly, a year is considered to be 365.25 days, which is 31,557,600 seconds.

Further, you can set a maximum retention period of 3,155,760,000 seconds (100 years).

Retention policy locks

When you lock a retention policy on a bucket, you prevent it from being withdrawn or the retention time from being shortened in the future. If you try to delete or shorten the policy term of a locked bucket, however, it will fail. Then a 400 BadRequestException error appears. You can’t delete a bucket once it’s been locked by a retention policy. This will continue until all of the objects in the bucket have reached the end of their retention period.

Locking a retention policy, on the other hand, is irreversible, and you should be aware of the repercussions before using this option. When you use a retention policy that isn’t locked. After that, you can remove the policy, allowing you to continue deleting items as needed. To “remove” a retention policy that has been locked, you must delete the entire bucket. However, you won’t be able to remove the bucket if it contains objects that haven’t yet reached the end of their retention period. To “delete” a locked retention policy, you must wait until all objects in the bucket have reached the end of their retention time. When you get to that point, though, you can delete the bucket.

Data lifecycle has four steps.

Firstly, Ingest: The first stage is to pull in the raw data, such as streaming data from devices, on-premises batch data, app logs, or mobile-app user events and analytics.
Secondly, Store: After the data retrieves, it needs to be stored in a format that is durable and can be easily accessed.
Thirdly, Process and analyze: In this stage, the data transforms from raw form into actionable information.
Lastly, Explore and visualize: The final stage is to convert the results of the analysis into a format that is easy to draw insights from and to share with colleagues and peers.

Ingest

There are a number of approaches you can take to collect raw data, based on the data’s size, source, and latency.

Firstly, App: Data from app events, such as log files or user events, typically collects in a push model, where the app calls an API to send the data to storage.
Secondly, Streaming: The data consists of a continuous stream of small, asynchronous messages.
Thirdly, Batch: Large amounts of data stores in a set of files that are transferred to storage in bulk.

Mapping Google Cloud services to app, streaming, and batch data. — **Image Source: GCP**

Ingesting app data

Data is generated in large quantities by apps and services. App event logs, clickstream data, social network interactions, and e-commerce transactions are examples of this type of data. Furthermore, gathering and analysing event-driven data might show user tendencies as well as provide useful business insights. However, Google Cloud offers a number of services for hosting apps, ranging from Compute Engine’s virtual machines to App Engine’s managed platform to Google Kubernetes Engine’s container management (GKE).

Consider the following examples:

Firstly, Writing data to a file: An app outputs batch CSV files to the object store of Cloud Storage.
Secondly, Writing data to a database: An app writes data to one of the databases that Google Cloud provides.
Lastly, Streaming data as messages: An app streams data to Pub/Sub, a real-time messaging service.

Ingesting streaming data

Streaming data is sent asynchronously, with no expectation of a response, and the packets are tiny. Streaming data is frequently used in telemetry, which collects data from geographically separated devices. Further, this can also be utilized to fire event triggers, perform complicated session analysis, and feed machine learning activities.

Here are two common uses of streaming data.

Firstly, Telemetry data: Internet of Things (IoT) devices are network-connected devices that gather data from the surrounding environment through sensors.
Secondly, User events and analytics: A mobile app might log events when the user opens the app and whenever an error or crash occurs.

Pub/Sub: Real-time messaging

Pub/Sub is a real-time messaging service that allows you to send and receive messages between apps. One of the primary use cases for inter-app messaging is to ingest streaming event data. With streaming data, Pub/Sub automatically manages the details of sharding, replication, load-balancing, and partitioning of the incoming data streams.

Ingesting bulk data

Bulk data is made up of enormous datasets that require a lot of aggregate bandwidth between a few sources and the target. The data could be saved in a relational or NoSQL database, or in files like CSV, JSON, Avro, or Parquet. Consider the following scenarios for absorbing large amounts of data.

Firstly, Scientific workloads. Genetics data stored in Variant Call Format (VCF) text files are uploaded to Cloud Storage for later import into Genomics.
Secondly, Migrating to the cloud. Moving data stored in an on-premises Oracle database to a fully managed Cloud SQL database using Informatica.
Thirdly, Backing up data. Replicating data stored in an AWS bucket to Cloud Storage using Cloud Storage Transfer Service.
Lastly, Importing legacy data. Copying ten years worth of website log data into BigQuery for long-term trend analysis.

Storage Transfer Service: Managed file transfer

Storage Transfer Service manages the transfer of data to a Cloud Storage bucket. However, the data source can be an AWS S3 bucket, a web-accessible URL, or another Cloud Storage bucket. Storage Transfer Service is intended for bulk transfer and is optimized for data volumes greater than 1 TB.
Secondly, backing up data is a common use of Storage Transfer Service. You can back up data from other storage providers to a Cloud Storage bucket. Or you can move data between Cloud Storage buckets, such as archiving data from a Standard Storage bucket to an Archive Storage bucket to lower storage costs.

BigQuery Data Transfer Service: Managed application data transfer

BigQuery Data Transfer Service schedules and manages data flow from software as a service (SaaS) apps like Google Ads and Google Ad Manager. Furthermore, without creating a single line of code, you may establish the groundwork for a data warehouse. BigQuery Data Transfer Service, on the other hand, loads data into BigQuery on a regular basis once the data transfer is configured. It also allows for user-initiated data backfills to fill in any gaps or outages.

Transfer Appliance: Shippable, high-capacity storage server

The Google Transfer Appliance is a high-capacity storage server that you rent. You connect it to your network, load it with data, and send it to an upload facility, where it is uploaded to Cloud Storage. The Transfer Appliance is available in a variety of sizes. Furthermore, depending on the nature of your data, you may be able to apply deduplication and compression to significantly boost the appliance’s effective capacity. Calculate the time it will take to upload your data over a network connection to determine when to use Transfer Appliance.

However, if you determine that it would take a week or more, or if you have more than 60 TB of data (regardless of transfer speed), it might be more reliable and expedient to transfer your data by using the Transfer Appliance.

Process and analyze

In order to derive business value and insights from data, you must transform and analyze it. However, this requires a processing framework that can either analyze the data directly or prepare the data for downstream analysis. And also tools to analyze and understand processing results.

Firstly, Processing: Data from source systems is cleansed, normalized, and processed across multiple machines, and stored in analytical systems.
Secondly, Analysis: Processed data is stored in systems that allow for ad-hoc querying and exploration.
Thirdly, Understanding: Based on analytical results, data is used to train and test automated machine-learning models.

Processing large-scale data

Reading data from source systems like Cloud Storage, Cloud Bigtable, or Cloud SQL, and then performing complicated normalizations or aggregations on that data is characteristic of large-scale data processing. However, because data is frequently too massive to fit on a single machine, frameworks are used to manage distributed compute clusters and give software tools to help with processing.

Data retention and data life cycle management GCP cloud architect online course

Reference: Google Documentation, Doc 2

Go back to GCP Tutorials