Movement of data

  1. Home
  2. Movement of data

Go back to GCP Tutorials

In this we will learn about Google Cloud services that you can use to manage data throughout its entire lifecycle, from initial acquisition to final visualization. You’ll learn about the features and functionality of each service so you can make an informed decision about which services best fit your workload.

The data lifecycle has four steps.

  • Firstly, Ingest: The first stage is to pull in the raw data, such as streaming data from devices, on-premises batch data, app logs, or mobile-app user events and analytics.
  • Secondly, Store: After the data retrieves, it needs to be stored in a format that is durable and can be easily accessed.
  • Thirdly, Process and analyze: In this stage, the data transforms from raw form into actionable information.
  • Lastly, Explore and visualize: The final stage is to convert the results of the analysis into a format that is easy to draw insights from and to share with colleagues and peers.

Ingest

There are a number of approaches you can take to collect raw data, based on the data’s size, source, and latency.

  • Firstly, App: Data from app events, such as log files or user events, typically collects in a push model, where the app calls an API to send the data to storage.
  • Secondly, Streaming: The data consists of a continuous stream of small, asynchronous messages.
  • Thirdly, Batch: Large amounts of data stores in a set of files that are transferred to storage in bulk.
Mapping Google Cloud services to app, streaming, and batch data.
Image Source: Google Cloud

Ingesting app data

Apps and services generate a significant amount of data. This includes data such as app event logs, clickstream data, social network interactions, and e-commerce transactions. Further, collecting and analyzing this event-driven data can reveal user trends and provide valuable business insights. However, Google Cloud provides a variety of services that you can use to host apps, from the virtual machines of Compute Engine, to the managed platform of App Engine, to the container management of Google Kubernetes Engine (GKE).

Consider the following examples:

  • Firstly, Writing data to a file: An app outputs batch CSV files to the object store of Cloud Storage.
  • Secondly, Writing data to a database: An app writes data to one of the databases that Google Cloud provides.
  • Lastly, Streaming data as messages: An app streams data to Pub/Sub, a real-time messaging service.

Ingesting streaming data

Streaming data is delivered asynchronously, without expecting a reply, and the individual messages are small in size. Commonly, streaming data is used for telemetry, collecting data from geographically dispersed devices. Further, Streaming data can be used for firing event triggers, performing complex session analysis, and as input for machine learning tasks.

Here are two common uses of streaming data.

  • Firstly, Telemetry data: Internet of Things (IoT) devices are network-connected devices that gather data from the surrounding environment through sensors.
  • Secondly, User events and analytics: A mobile app might log events when the user opens the app and whenever an error or crash occurs.
Pub/Sub: Real-time messaging

Pub/Sub is a real-time messaging service that allows you to send and receive messages between apps. One of the primary use cases for inter-app messaging is to ingest streaming event data. With streaming data, Pub/Sub automatically manages the details of sharding, replication, load-balancing, and partitioning of the incoming data streams.

Ingesting bulk data

Bulk data consists of large datasets where ingestion requires high aggregate bandwidth between a small number of sources and the target. The data could be stored in files, such as CSV, JSON, Avro, or Parquet files, or in a relational or NoSQL database. Further, consider the following use cases for ingesting bulk data.

  • Firstly, Scientific workloads. Genetics data stored in Variant Call Format (VCF) text files are uploaded to Cloud Storage for later import into Genomics.
  • Secondly, Migrating to the cloud. Moving data stored in an on-premises Oracle database to a fully managed Cloud SQL database using Informatica.
  • Thirdly, Backing up data. Replicating data stored in an AWS bucket to Cloud Storage using Cloud Storage Transfer Service.
  • Lastly, Importing legacy data. Copying ten years worth of website log data into BigQuery for long-term trend analysis.
Storage Transfer Service: Managed file transfer
  • Storage Transfer Service manages the transfer of data to a Cloud Storage bucket. However, the data source can be an AWS S3 bucket, a web-accessible URL, or another Cloud Storage bucket. Storage Transfer Service is intended for bulk transfer and is optimized for data volumes greater than 1 TB.
  • Secondly, backing up data is a common use of Storage Transfer Service. You can back up data from other storage providers to a Cloud Storage bucket. Or you can move data between Cloud Storage buckets, such as archiving data from a Standard Storage bucket to an Archive Storage bucket to lower storage costs.

BigQuery Data Transfer Service: Managed application data transfer

BigQuery Data Transfer Service automates data movement from software as a service (SaaS) applications such as Google Ads and Google Ad Manager on a scheduled, managed basis. Further, laying the foundation for a data warehouse without writing a single line of code. However, after the data transfer is configured, BigQuery Data Transfer Service automatically loads data into BigQuery on a regular basis. It also supports user-initiated data backfills to recover from any outages or gaps.

Transfer Appliance: Shippable, high-capacity storage server

Transfer Appliance is a high-capacity storage server that you lease from Google. You connect it to your network, load it with data, and ship it to an upload facility where the data is uploaded to Cloud Storage. Transfer Appliance comes in multiple sizes. In addition, depending on the nature of your data, you might be able to use deduplication and compression to substantially increase the effective capacity of the appliance. Further, to determine when to use Transfer Appliance, calculate the amount of time needed to upload your data by using a network connection.

However, if you determine that it would take a week or more, or if you have more than 60 TB of data (regardless of transfer speed), it might be more reliable and expedient to transfer your data by using the Transfer Appliance.

Process and analyze

In order to derive business value and insights from data, you must transform and analyze it. However, this requires a processing framework that can either analyze the data directly or prepare the data for downstream analysis. And also tools to analyze and understand processing results.

  • Firstly, Processing: Data from source systems is cleansed, normalized, and processed across multiple machines, and stored in analytical systems.
  • Secondly, Analysis: Processed data is stored in systems that allow for ad-hoc querying and exploration.
  • Thirdly, Understanding: Based on analytical results, data is used to train and test automated machine-learning models.
Processing large-scale data

Large-scale data processing typically involves reading data from source systems such as Cloud Storage, Cloud Bigtable, or Cloud SQL, and then conducting complex normalizations or aggregations of that data. However, in many cases, the data is too large to fit on a single machine so frameworks are used to manage distributed compute clusters and to provide software tools that aid processing.

gcp cloud architect practice tests
Dataproc: Managed Apache Hadoop and Apache Spark

The capability to deal with extremely large datasets has evolved since Google first published the MapReduce paper in 2004. Many organizations now load and store data in Hadoop Distributed File System (HDFS) and run periodic aggregations, reports or transformation using traditional batch-oriented tools, such as Hive or Pig. However, Hadoop has a large ecosystem to support activities such as machine learning using Mahout, log ingestion using Flume, and statistics using R, and more. The results of this Hadoop-based data processing are business critical. It is a non-trivial exercise for an organization dependent on these processes to migrate them to a new framework.

On the other hand, Spark has gained popularity over the past few years as an alternative to Hadoop MapReduce. Spark’s performance is generally considerably faster than Hadoop MapReduce. Spark achieves this by distributing datasets and computation in memory across a cluster. In addition to speed increases, this distribution gives Spark the ability to deal with streaming data using Spark Streaming. Further, also in traditional batch analytics, transformations and aggregations using Spark SQL and a simple API. The Spark community is very active with several popular libraries including MLlib, which can be used for machine learning.

Further, dataproc provides the ease and flexibility to spin up Spark or Hadoop clusters on demand when they are needed, and to terminate clusters when they are no longer needed. Consider the following use cases.
  • Firstly, Log processing. With minimal modification, you can process large amounts of text log data per day from several sources using existing MapReduce.
  • Secondly, Reporting: Aggregate data into reports and store the data in BigQuery. Then you can push the aggregate data to apps that power dashboards and conduct analysis.
  • Thirdly, On-demand Spark clusters. Quickly launch ad-hoc clusters to analyze data is stored in blob storage using Spark (Spark SQL, PySpark, Spark shell).
  • Lastly, Machine learning. Use the Spark Machine Learning Libraries (MLlib), which are preinstalled on the cluster, to customize and run classification algorithms.
Movement of data GCP cloud architect  online course

Reference: Google Documentation

Go back to GCP Tutorials

Menu