Cloud Dataflow Google Professional Data Engineer GCP

  1. Home
  2. Cloud Dataflow Google Professional Data Engineer GCP
  • It is a managed service
  • executes data processing patterns.
  • Uses Apache Beam SDK
  • Can develop both batch and streaming pipelines.

Pipeline

  • It manage data similar to a factory line
  • Tasks include
    • Transforming data
    • Aggregating data and computing
    • Enriching data.
    • Moving data.
  • Data Pipeline views all data as streaming
  • It breaks dataflows into smaller units
  • Tasks are written in Java or Python using beam sdk
  • Jobs are executed in parallel
  • Same code is used for streaming and batch
  • Pipeline is a directed graph of steps
  • Source/Sink can be filesystem, gcs, bigquery, pub/sub
  • Runner can be local laptop, dataflow(in cloud)
  • Output data written can be sharded or unsharded.
  • Input and outputs are pcollection. Pcollection is not in-memory and can be unbounded.
  • Each transform – give a name
  • Read from source, write to sink
  • Common tasks to do
  • Convert incoming data to a common format.
  • Prepare data for analysis and visualization.
  • Migrate between databases.
  • Share data processing logic across web apps, batch jobs, and APIs.
  • Power data ingestion and integration tools.
  • Consume large XML, CSV, and fixed-width files.
  • Replace batch jobs with real-time data.
  • pipeline components
  • Data Nodes – for input data
  • Activities – work definition to do
  • Preconditions
  • Resources
  • Actions – An action that is triggered if conditions are met

 

Types

Types of pipelines are

    • useful for large volumes of data at a regular interval
    • no real-time processing needed
  • Real-time
    • to process data in real time.
    • processing data from a streaming source
  • Cloud native
    • optimized to work with cloud-based data, such as data from AWS buckets.
    • tools are hosted in the cloud
Menu