Data growth management

In this we will learn and understand about Data growth management.

Cloud storage growth

Cloud storage is excellent at resolving today’s data storage consolidation and capacity issues. Many Google Cloud Platform (GCP) clients rely on it for unified object storage. By giving you the freedom to store and relocate data as required. You may also utilize Transfer Appliance to easily move petabytes into Cloud Storage. You can also use Dataflow to stream data into Cloud Storage and use Storage Transfer Service to transport data from AWS S3 to Cloud Storage.

It’s easy to store and use the data in Cloud Storage. But it’s still being created at an astonishing and unpredictable rate. And creation unpredictability means cost unpredictability. We’ve developed the Storage Growth Plan to help enterprise customers manage storage costs and meet the forecasting and predictability that is often asked of IT organizations.

Adding geo-redundancy and price drops for Cloud Storage

In addition to launching this new approach to acquiring Cloud Storage, we’re passing on continuous technical advancement to our customers in the form of price reductions. Cloud Storage Coldline is now geo-redundant in multi-regional locations, as we just revealed. This refers to Coldline data, which is the Cloud Storage tier with the least amount of access. It is, however, secured against regional failure by storing a backup copy of your data at least 100 miles away in another region.

Data lifecycle has four steps.

Firstly, Ingest: The first stage is to pull in the raw data, such as streaming data from devices, on-premises batch data, app logs, or mobile-app user events and analytics.
Secondly, Store: After the data retrieves, it needs to be stored in a format that is durable and can be easily accessed.
Thirdly, Process and analyze: In this stage, the data transforms from raw form into actionable information.
Lastly, Explore and visualize: The final stage is to convert the results of the analysis into a format that is easy to draw insights from and to share with colleagues and peers.

Ingest

You may acquire raw data in a variety of ways, depending on the amount, source, and latency of the data.

Firstly, App: Data from app events, such as log files or user events, typically collects in a push model, where the app calls an API to send the data to storage.
Secondly, Streaming: The data consists of a continuous stream of small, asynchronous messages.
Thirdly, Batch: Large amounts of data stores in a set of files that are transferred to storage in bulk.

Mapping Google Cloud services to app, streaming, and batch data. — Image Source: Google Cloud

Ingesting app data

Data is generated in large quantities by apps and services. App event logs, clickstream data, social network interactions, and e-commerce transactions are examples of this type of data. Furthermore, gathering and analysing event-driven data might show user tendencies as well as give useful business insights. However, Google Cloud offers a number of services for hosting apps, ranging from Compute Engine’s virtual machines to App Engine’s managed platform to Google Kubernetes Engine’s container management (GKE).

Consider the following examples:

Firstly, Writing data to a file: An app outputs batch CSV files to the object store of Cloud Storage.
Secondly, Writing data to a database: An app writes data to one of the databases that Google Cloud provides.
Lastly, Streaming data as messages: An app streams data to Pub/Sub, a real-time messaging service.

Ingesting streaming data

Streaming data is sent asynchronously, with no expectation of a response, and the packets are tiny. Streaming data is frequently used in telemetry, which collects data from geographically separated sensors. Further, streaming data may also be utilized to fire event triggers, perform complicated session analysis, and feed machine learning activities.

Here are two common uses of streaming data.

Firstly, Telemetry data: Internet of Things (IoT) devices are network-connected devices that gather data from the surrounding environment through sensors.
Secondly, User events and analytics: A mobile app might log events when the user opens the app and whenever an error or crash occurs.

Pub/Sub: Real-time messaging

Pub/Sub is a real-time messaging platform that lets you send and receive messages across apps in real time. Ingesting streaming event data is one of the most common uses for inter-app communications. Pub/Sub automatically manages the complexities of sharding, replication, load-balancing, and splitting of incoming data streams while using streaming data.

Ingesting bulk data

Bulk data is made up of enormous datasets that need a lot of aggregate bandwidth between a few sources and the goal. The data might be saved in a relational or NoSQL database, or in files like CSV, JSON, Avro, or Parquet. Consider the following scenarios for absorbing large amounts of data.

Firstly, Scientific workloads. Genetics data stored in Variant Call Format (VCF) text files are uploaded to Cloud Storage for later import into Genomics.
Secondly, Migrating to the cloud. Moving data stored in an on-premises Oracle database to a fully managed Cloud SQL database using Informatica.
Thirdly, Backing up data. Replicating data stored in an AWS bucket to Cloud Storage using Cloud Storage Transfer Service.
Lastly, Importing legacy data. Copying ten years worth of website log data into BigQuery for long-term trend analysis.

Storage Transfer Service: Managed file transfer

Storage Transfer Service manages the transfer of data to a Cloud Storage bucket. However, the data source can be an AWS S3 bucket, a web-accessible URL, or another Cloud Storage bucket. Storage Transfer Service is intended for bulk transfer and is optimized for data volumes greater than 1 TB.
Secondly, backing up data is a common use of Storage Transfer Service. You can back up data from other storage providers to a Cloud Storage bucket. Or you can move data between Cloud Storage buckets, such as archiving data from a Standard Storage bucket to an Archive Storage bucket to lower storage costs.

BigQuery Data Transfer Service: Managed application data transfer

BigQuery Data Transfer Service schedules and manages data flow from software as a service (SaaS) apps like Google Ads and Google Ad Manager. Furthermore, without creating a single line of code, you may establish the groundwork for a data warehouse. BigQuery Data Transfer Service, on the other hand, loads data into BigQuery on a regular basis after the data transfer is set up. It also allows for user-initiated data backfills to fill in any gaps or outages.

Transfer Appliance: Shippable, high-capacity storage server

The Google Transfer Appliance is a high-capacity storage server that you rent. You connect it to your network, load it with data, and send it to an upload facility, where it is transferred to Cloud Storage. The Transfer Appliance is available in a variety of sizes. Furthermore, depending on the nature of your data, you may be able to apply deduplication and compression to significantly boost the appliance’s effective capacity. Calculate the time it will take to upload your data over a network connection to choose when to utilise Transfer Appliance.

However, if you determine that it would take a week or more, or if you have more than 60 TB of data (regardless of transfer speed), it might be more reliable and expedient to transfer your data by using the Transfer Appliance.

Process and analyze

In order to derive business value and insights from data, you must transform and analyze it. However, this requires a processing framework that can either analyze the data directly or prepare the data for downstream analysis. And also tools to analyze and understand processing results.

Firstly, Processing: Data from source systems is cleansed, normalized, and processed across multiple machines, and stored in analytical systems.
Secondly, Analysis: Processed data is stored in systems that allow for ad-hoc querying and exploration.
Thirdly, Understanding: Based on analytical results, data is used to train and test automated machine-learning models.

Processing large-scale data

Reading data from source systems like Cloud Storage, Cloud Bigtable, or Cloud SQL, and then performing complicated normalizations or aggregations on that data is characteristic of large-scale data processing. However, because data is frequently too massive to fit on a single computer, frameworks are used to manage distributed compute clusters and give software tools to help with processing.