Cloud Dataflow Google Professional Data Engineer GCP

It manage data similar to a factory line
Tasks include
- Transforming data
- Aggregating data and computing
- Enriching data.
- Moving data.
Data Pipeline views all data as streaming
It breaks dataflows into smaller units
Tasks are written in Java or Python using beam sdk
Jobs are executed in parallel
Same code is used for streaming and batch
Pipeline is a directed graph of steps
Source/Sink can be filesystem, gcs, bigquery, pub/sub
Runner can be local laptop, dataflow(in cloud)
Output data written can be sharded or unsharded.
Input and outputs are pcollection. Pcollection is not in-memory and can be unbounded.
Each transform – give a name
Read from source, write to sink
Common tasks to do
Convert incoming data to a common format.
Prepare data for analysis and visualization.
Migrate between databases.
Share data processing logic across web apps, batch jobs, and APIs.
Power data ingestion and integration tools.
Consume large XML, CSV, and fixed-width files.
Replace batch jobs with real-time data.
pipeline components
Data Nodes – for input data
Activities – work definition to do
Preconditions
Resources
Actions – An action that is triggered if conditions are met

Types

Types of pipelines are

- useful for large volumes of data at a regular interval
- no real-time processing needed
Real-time
- to process data in real time.
- processing data from a streaming source
Cloud native
- optimized to work with cloud-based data, such as data from AWS buckets.
- tools are hosted in the cloud