Streaming Pipeline Google Professional Data Engineer GCP

  1. Home
  2. Streaming Pipeline Google Professional Data Engineer GCP

In this, we will learn the concepts of Streaming Pipeline for Google Professional Data Engineer GCP exam.

Streaming pipelines

In streaming pipelines, unbounded PCollections, or unbounded collections, represent data. Data from a continually changing data source, such as Pub/Sub, is stored in an unbounded collection. In an unlimited collection, you can’t just use a key to organize elements. Because the data source is continually adding new components, there might be an endless number of elements for a particular key in streaming data. To aggregate elements in unbounded collections, you can utilize windows, watermarks, and triggers. Bounded PCollections that represent data in batch pipelines are also considered windows.

Windows and windowing functions

Unbounded collections are divided into logical components, or windows, using windowing techniques. Windowing functions use the timestamps of individual elements to organize unbounded collections. There are a limited amount of items in each window.

With the Apache Beam SDK or Dataflow SQL streaming extensions, you may specify the following windows:

  • Tumbling windows (called fixed windows in Apache Beam): In the data stream, a tumbling window represents a consistent, discontinuous temporal span.
  • Hopping windows (called sliding windows in Apache Beam): In the data stream, a hopping window indicates a continuous time interval. Tumbling windows are discontinuous, whereas hopping windows might overlap.
  • Sessions Windows: A session window is made up of items that are separated by a time gap. A data stream’s gap duration is the time between new data. Data is allocated to a new window if it comes after the gap time.

A watermark is a threshold that tells Dataflow when all of the data in a window is expected to arrive. The data is considered late if it comes with a timestamp that is inside the timeframe but older than the watermark.

Watermarks are tracked by Dataflow for the following reasons:

  • Data does not always come in the same sequence or at consistent intervals.
  • The order in which data events are created does not ensure that they will appear in pipelines in the same order in which they were generated.

As data enters, triggers determine when to emit aggregated findings. By default, when the watermark reaches the edge of the window, the results are emitted. Create or edit triggers for each collection in a streaming pipeline using the Apache Beam SDK. Dataflow SQL does not allow you to create triggers. The Apache Beam SDK allows you to create triggers based on any combination of the following criteria:

  • The timestamp on each data piece indicates the event time.
  • Processing time refers to the amount of time it takes for a data element to be processed at each stage of the pipeline.
  • A collection’s number of data components.

For more check here.