Dataprep Overview Google Professional Data Engineer GCP

  1. Home
  2. Dataprep Overview Google Professional Data Engineer GCP
  • an intelligent data service
  • to visually explore, clean and prepare data that is not ready for immediate analysis.

Flow

  • a container for holding one or more datasets, associated recipes and other objects.
  • is a means for packaging Cloud Dataprep objects for following actions
    • Creating relationships between datasets, their recipes, and other datasets.
    • Sharing with other users
    • Copying
    • Execution of pre-configured jobs
    • Creating references between recipes and external flows

Imported Dataset

  • a reference to the original data
  • the data does not exist within the platform.
  • can be a reference to a file, multiple files, database table, or other type of data.
  • is a pointer to a source of data.
  • An imported dataset can be referenced in recipes.
  • Imported datasets are created through the Import Data Page.

Recipe

  • a user-defined sequence of steps to transform a dataset.
  • A recipe object is created from an imported dataset or another recipe.
  • can create a recipe from a recipe to chain together recipes.
  • Recipes are interpreted by Cloud Dataprep by TRIFACTA INC. and turned into commands that can be executed against data.
  • When initially created, a recipe contains no steps.
  • Recipes are augmented and modified using the various visual tools in the Transformer Page.

In a flow, the following objects are associated with each recipe

  • Outputs
  • References

Outputs

Outputs

  • contain one or more publishing destinations
  • which define the output format, location, and other publishing options
  • applied to the results generated from a job run on the recipe.

References

  • to create a reference to the output of the recipe’s steps in another dataset.
  • When you select a recipe’s reference object, you can add it to another flow.
  • A reference dataset is a read-only version of the output data generated from the execution of a recipe’s steps.

Samples

  • It is a subset of the entire dataset.
  • For smaller datasets, the sample may be the entire dataset.
  • As you build or modify recipe, the results of each modification are immediately reflected in the sampled data.
  • Can generate additional samples

 

Macros

  • can create reusable sequences of steps that can be parameterized for use in other recipes.

 

Run Jobs

A job may be composed of one or more of the following job types:

  • Transform job: Executes the set of recipe steps that you have defined against sample(s), generating the transformed set of results across the entire dataset.
  • Profile job: choose to generate a visual profile of the results of transform job.
  • When a job completes, you can review the resulting data and identify data that still needs fixing.

 

Schedules

  • Associate a schedule with a flow.
  • schedule is a combination of one or more triggers and the outputs.
  • A flow can have only one schedule associated with it.
  • A trigger is a scheduled time of execution.
  • A schedule can have multiple triggers associated with it.
  • A recipe can have only one scheduled destination.
  • Each recipe in a flow can have a scheduled destination.

Example

 

Type Datasets Description
Standard job execution Recipe 1/Job 1 Results of the job are used to create a new imported dataset (I-Dataset 2).
Create dataset from generated results Recipe 2/Job 2 Recipe 2 is created off of I-Dataset 2 and then modified. A job has been specified for it, but the results of the job are unused.
Chaining datasets Recipe 3/Job 3 Recipe 3 is chained off of Recipe 2. The results of running jobs off of Recipe 2 include all of the upstream changes as specified in I-Dataset 1/Recipe1 and I-Dataset 2/Recipe 2.
Reference dataset Recipe 4/Job 4 I-Dataset 4 is created as a reference off of Recipe 3. It can have its own recipe, job, destinations, and results.

 

Workflow

Basic Workflow

  • Review object overview:
  • Import data
  • Profile data
  • Build transform recipes
  • Run job
  • Export results

 

Menu