Cloud Dataflow Overview Google Professional Data Engineer GCP

  1. Home
  2. Cloud Dataflow Overview Google Professional Data Engineer GCP

Google Cloud Dataflow is

  • a managed data transformation service,
  • has a unified data processing model to process both unbounded and bounded datasets.
  • a serverless platform
  • write code in the form of pipelines, and submit to CloudDataflow for execution.
  • offers autoscaling workers and dynamically rebalancing workloads across those workers
  • provides the open Apache Beam programming model as a managed service for
  • process data in multiple ways as
    • batch operations
    • extract-transform-load (ETL) patterns
    • continuous, streaming computation.
  • pipelines operate on data in terms of collections, using the abstract PCollection .
    • Each PCollection is a distributed set of homogeneous data as in the pipeline
    • can represent bounded data (CSV file in Cloud Storage) or unbounded data source, such (Cloud Pub/Sub topic).
    • PCollection is immutable.
    • each element in PCollection has an associated timestamp either from data’s source, or explicitly defined.