Ingest Google Professional Data Engineer GCP

  1. Home
  2. Ingest Google Professional Data Engineer GCP
  • Capture raw data depending on the data’s size, source, and latency
  • Various ingest sources
    • App: Data from app events, like log files or user events
    • Streaming: A continuous stream of small, asynchronous messages.
    • Batch: Large amounts of data in set of files to transfer to storage in bulk.

Google Cloud services map for app/streaming and batch workloads –

 

The data transfer model you choose depends on workload, and each model has different infrastructure requirements.

Ingesting app data

  • Consists of apps and services data and includes
  • app event logs
  • clickstream data
  • social network interactions
  • e-commerce transactions
  • App data helps in showing user trends and gives business insights
  • GCP hosts apps from App Engine (managed platform) and Google Kubernetes Engine (GKE – container management).
  • Use cases of GCP hosted apps
    • Writing data to a file: App outputs batch CSV files to the object store of Cloud Storage then to import function of BigQuery, an data warehouse, for analysis and querying.
    • Writing data to a database: App writes data to GCP database service
    • Streaming data as messages: App streams data to Pub/Sub and other app, subscribed to the messages, can transfer the data to storage or process it immediately in situations such as fraud detection.

 Cloud Logging

  • A centralized log management service
  • Collects log data from apps running on GCP.
  • Export data collected by Cloud Logging and send the data to Cloud Storage, Pub/Sub, and BigQuery.
  • Many GCP services automatically record log data to Cloud Logging like App Engine
  • Also provide custom logging messages to stdout and stderr
  • displays data in the Logs Viewer.
  • Involves a logging agent, based on fluentd, which run on VM instances
  • Agent streams log data

Ingesting streaming data

  • Streaming data is
    • delivered asynchronously
    • without expecting a reply
    • are small in size
  • Streaming data can
    • fire event triggers
    • perform complex session analysis
    • be input for ML tasks.
  • Streaming Data Use cases
    • Telemetry data: Data from network-connected Internet of Things (IoT) devices who gather data about surrounding environment by sensors.
    • User events and analytics: Mobile app logging events about app usage, crash, etc

Pub/Sub

  • A real-time messaging service
    • sends and receives messages between apps
  • A use cases is inter-app messaging to ingest streaming event data.
  • Pub/Sub automatically manages
    • Sharding
    • replication
    • load-balancing
    • partitioning of the incoming data streams.
  • Pub/Sub has global endpoints using GCP load balancer, with minimal latency.
  • Automatic scaling to meet demand, without pre-provisioning the system resources.
  • Message streams re organized as topics.
    • Streaming data target a topic
    • each message has unique identifier and timestamp.
  • After data ingestion, apps can retrieve messages by using a topic subscription in a pull or push model.
    • In a push subscription, server sends a request to the subscriber app at a preconfigured URL endpoint.
    • In the pull model, the subscriber requests messages from the server and acknowledges receipt.
  • Pub/Sub guarantees message delivery at least once per subscriber.
  • No guarantees about the order of message delivery.
  • For strict message ordering with buffering, use Dataflow for real-time processing
    • After processing, move the data into Datastore/BigQuery.

Ingesting bulk data

  • Bulk data is
    • large datasets
    • ingestion needs high aggregate bandwidth between a small sources and the target.
  • Data can be
    • files (CSV, JSON, Avro, or Parquet files) or in
    • a relational database
    • NoSQL database
  • Source data can be on-premises or on other cloud platforms.
  • Use cases
    • Scientific workloads
    • Migrating to the cloud
    • Backing up data or Replication
    • Importing legacy data

Storage Transfer Service

  • Managed file transfer to a Cloud Storage bucket
  • Data source can be
    • AWS S3 bucket
    • a web-accessible URL
    • another Cloud Storage bucket.
  • Used for bulk transfer
  • Optimized for 1 TB or more data volumes.
  • Usually used for backing up data to archive storage bucket
  • Supports one-time transfers or recurring transfers.
  • Has advanced filters based on file creation dates/filename/times of day
  • Supports the deletion of the source data after it’s been copied.

Transfer Appliance:

  • A shippable, high-capacity storage server
  • It is leased from Google.
  • connect it to network, load data and ship to an upload facility.
  • Appliance comes in multiple sizes
  • Use appliance a per cost and time feasibility for same
  • Appliance deduplicates, compresses, and encrypts captured data with strong AES-256 encryption using a password and passphrase given by user. During reading of data from Cloud Storage, same password and passphrase are needed.

gsutil

  • A command-line utility
  • moves file-based data from any existing file system into Cloud Storage.
  • Written in Python and runs on Linux, macOS and Windows.
  • It can also
    • create and manage Cloud Storage buckets
    • edit access rights of objects
    • copy objects from Cloud Storage.

Database migration

  • For RDBMS data, can migrate to Cloud SQL and Cloud Spanner.
  • For Data warehouses data, migrate to
  • For NoSQL databases migrate to Bigtable (for column-oriented NoSQL) and Datastore (for JSON-oriented NoSQL).
Menu