Key Concepts Google Professional Data Engineer GCP
Pipelines
- encapsulates the all processes in reading input data, transforming that data, and writing output data.
 - The input source and output sink can be the same or of different types
 - Apache Beam programs start by constructing a Pipeline object
 - then using that object as the basis for creating the pipeline’s datasets.
 - Each pipeline represents a single, repeatable job.
 
PCollection
- It represents a distributed, multi-element dataset
 - acts as the pipeline’s data.
 - Apache Beam transforms use PCollection objects as inputs and outputs for each step in pipeline.
 - A PCollection can hold a dataset of a fixed size or an unbounded dataset from a continuously updating data source.
 
Transforms
- Represent a processing operation that transforms data.
 - takes one or more PCollections as input, performs an specified operation and produces one or more PCollections as output.
 - can perform many kind of processing operation
 
ParDo
- It is the core parallel processing operation in the Apache Beam SDKs,
 - It invokes a user-specified function on each of the elements of the input PCollection.
 - ParDo collects the zero or more output elements into an output PCollection.
 - The ParDo transform processes elements independently and possibly in parallel.
 
Pipeline I/O
- let you read data into pipeline and write output data from pipeline.
 - consists of a source and a sink.
 - can also write a custom I/O connector.
 
Aggregation
- process of computing some value from multiple input elements.
 
Side input
- Can be static like constant
 - Can also be a list or map. If side input is a pcollection, we first convert to list or map and pass that as side input.
 - Call parDo.withsideInputs with the map or list
 
Mapreduce
- Map – operates in parallel, reduce – aggregates based on key
 - parDo acts on one item at a time, similar to map operation in mapreduce, should not have state/history. Useful for filtering, mapping.
 - In python, map done using map for 1:1, flatmap for non 1:1. In Java, done using parDo
 
User-defined functions (UDFs)
- Apache Beam allow executing user-defined code to configure the transform.
 - For ParDo, user-defined code specifies the operation to apply to every element,
 - UDFs can be written in a different language than the language of runner.
 
Runner
- the software that accepts a pipeline and executes it.
 - runners are translators or adapters to massively parallel big-data processing systems.
 
Event time
- The time a data event occurs,
 - determined by the timestamp on the data element itself.
 
Windowing
- enables grouping operations over unbounded collections
 - divides the collection into windows of finite collections
 
Watermarks
- Apache Beam tracks a watermark, all data in a certain window to have arrived in the pipeline.
 
Trigger
- Triggers determine when to emit aggregated results as data arrives.
 - For bounded data, results are emitted after all of the input has been processed.
 - For unbounded data, results are emitted when the watermark passes the end of the window
 
Google Professional Data Engineer (GCP) Free Practice TestTake a Quiz
		