Key Concepts Google Professional Data Engineer GCP

  1. Home
  2. Key Concepts Google Professional Data Engineer GCP

Pipelines

  • encapsulates the all processes in reading input data, transforming that data, and writing output data.
  • The input source and output sink can be the same or of different types
  • Apache Beam programs start by constructing a Pipeline object
  • then using that object as the basis for creating the pipeline’s datasets.
  • Each pipeline represents a single, repeatable job.

PCollection

  • It represents a distributed, multi-element dataset
  • acts as the pipeline’s data.
  • Apache Beam transforms use PCollection objects as inputs and outputs for each step in pipeline.
  • A PCollection can hold a dataset of a fixed size or an unbounded dataset from a continuously updating data source.

Transforms

  • Represent a processing operation that transforms data.
  • takes one or more PCollections as input, performs an specified operation and produces one or more PCollections as output.
  • can perform many kind of processing operation

 

ParDo

  • It is the core parallel processing operation in the Apache Beam SDKs,
  • It invokes a user-specified function on each of the elements of the input PCollection.
  • ParDo collects the zero or more output elements into an output PCollection.
  • The ParDo transform processes elements independently and possibly in parallel.

Pipeline I/O

  • let you read data into pipeline and write output data from pipeline.
  • consists of a source and a sink.
  • can also write a custom I/O connector.

Aggregation

  • process of computing some value from multiple input elements.

 

Side input

  • Can be static like constant
  • Can also be a list or map. If side input is a pcollection, we first convert to list or map and pass that as side input.
  • Call parDo.withsideInputs with the map or list

 

Mapreduce

  • Map – operates in parallel, reduce – aggregates based on key
  • parDo acts on one item at a time, similar to map operation in mapreduce, should not have state/history. Useful for filtering, mapping.
  • In python, map done using map for 1:1, flatmap for non 1:1. In Java, done using parDo

User-defined functions (UDFs)

  • Apache Beam allow executing user-defined code to configure the transform.
  • For ParDo, user-defined code specifies the operation to apply to every element,
  • UDFs can be written in a different language than the language of runner.

Runner

  • the software that accepts a pipeline and executes it.
  • runners are translators or adapters to massively parallel big-data processing systems.

Event time

  • The time a data event occurs,
  • determined by the timestamp on the data element itself.

 

Windowing

  • enables grouping operations over unbounded collections
  • divides the collection into windows of finite collections

Watermarks

  • Apache Beam tracks a watermark, all data in a certain window to have arrived in the pipeline.

 

Trigger

  • Triggers determine when to emit aggregated results as data arrives.
  • For bounded data, results are emitted after all of the input has been processed.
  • For unbounded data, results are emitted when the watermark passes the end of the window
Menu