Ingest Google Professional Data Engineer GCP
- Capture raw data depending on the data’s size, source, and latency
 - Various ingest sources
- App: Data from app events, like log files or user events
 - Streaming: A continuous stream of small, asynchronous messages.
 - Batch: Large amounts of data in set of files to transfer to storage in bulk.
 
 
Google Cloud services map for app/streaming and batch workloads –

The data transfer model you choose depends on workload, and each model has different infrastructure requirements.
Ingesting app data
- Consists of apps and services data and includes
 - app event logs
 - clickstream data
 - social network interactions
 - e-commerce transactions
 - App data helps in showing user trends and gives business insights
 - GCP hosts apps from App Engine (managed platform) and Google Kubernetes Engine (GKE – container management).
 - Use cases of GCP hosted apps
- Writing data to a file: App outputs batch CSV files to the object store of Cloud Storage then to import function of BigQuery, an data warehouse, for analysis and querying.
 - Writing data to a database: App writes data to GCP database service
 - Streaming data as messages: App streams data to Pub/Sub and other app, subscribed to the messages, can transfer the data to storage or process it immediately in situations such as fraud detection.
 
 
Cloud Logging
- A centralized log management service
 - Collects log data from apps running on GCP.
 - Export data collected by Cloud Logging and send the data to Cloud Storage, Pub/Sub, and BigQuery.
 - Many GCP services automatically record log data to Cloud Logging like App Engine
 - Also provide custom logging messages to stdout and stderr
 - displays data in the Logs Viewer.
 - Involves a logging agent, based on fluentd, which run on VM instances
 - Agent streams log data
 
Ingesting streaming data
- Streaming data is
- delivered asynchronously
 - without expecting a reply
 - are small in size
 
 - Streaming data can
- fire event triggers
 - perform complex session analysis
 - be input for ML tasks.
 
 - Streaming Data Use cases
- Telemetry data: Data from network-connected Internet of Things (IoT) devices who gather data about surrounding environment by sensors.
 - User events and analytics: Mobile app logging events about app usage, crash, etc
 
 
Pub/Sub
- A real-time messaging service
- sends and receives messages between apps
 
 - A use cases is inter-app messaging to ingest streaming event data.
 - Pub/Sub automatically manages
- Sharding
 - replication
 - load-balancing
 - partitioning of the incoming data streams.
 
 - Pub/Sub has global endpoints using GCP load balancer, with minimal latency.
 - Automatic scaling to meet demand, without pre-provisioning the system resources.
 - Message streams re organized as topics.
- Streaming data target a topic
 - each message has unique identifier and timestamp.
 
 - After data ingestion, apps can retrieve messages by using a topic subscription in a pull or push model.
- In a push subscription, server sends a request to the subscriber app at a preconfigured URL endpoint.
 - In the pull model, the subscriber requests messages from the server and acknowledges receipt.
 
 - Pub/Sub guarantees message delivery at least once per subscriber.
 - No guarantees about the order of message delivery.
 - For strict message ordering with buffering, use Dataflow for real-time processing
- After processing, move the data into Datastore/BigQuery.
 
 
Ingesting bulk data
- Bulk data is
- large datasets
 - ingestion needs high aggregate bandwidth between a small sources and the target.
 
 - Data can be
- files (CSV, JSON, Avro, or Parquet files) or in
 - a relational database
 - NoSQL database
 
 - Source data can be on-premises or on other cloud platforms.
 - Use cases
- Scientific workloads
 - Migrating to the cloud
 - Backing up data or Replication
 - Importing legacy data
 
 
Storage Transfer Service
- Managed file transfer to a Cloud Storage bucket
 - Data source can be
- AWS S3 bucket
 - a web-accessible URL
 - another Cloud Storage bucket.
 
 - Used for bulk transfer
 - Optimized for 1 TB or more data volumes.
 - Usually used for backing up data to archive storage bucket
 - Supports one-time transfers or recurring transfers.
 - Has advanced filters based on file creation dates/filename/times of day
 - Supports the deletion of the source data after it’s been copied.
 
Transfer Appliance:
- A shippable, high-capacity storage server
 - It is leased from Google.
 - connect it to network, load data and ship to an upload facility.
 - Appliance comes in multiple sizes
 - Use appliance a per cost and time feasibility for same
 - Appliance deduplicates, compresses, and encrypts captured data with strong AES-256 encryption using a password and passphrase given by user. During reading of data from Cloud Storage, same password and passphrase are needed.
 
gsutil
- A command-line utility
 - moves file-based data from any existing file system into Cloud Storage.
 - Written in Python and runs on Linux, macOS and Windows.
 - It can also
- create and manage Cloud Storage buckets
 - edit access rights of objects
 - copy objects from Cloud Storage.
 
 
Database migration
- For RDBMS data, can migrate to Cloud SQL and Cloud Spanner.
 - For Data warehouses data, migrate to
 - For NoSQL databases migrate to Bigtable (for column-oriented NoSQL) and Datastore (for JSON-oriented NoSQL).
 
Google Professional Data Engineer (GCP) Free Practice TestTake a Quiz
		