Dataprep Overview Google Professional Data Engineer GCP
- an intelligent data service
 - to visually explore, clean and prepare data that is not ready for immediate analysis.
 
Flow
- a container for holding one or more datasets, associated recipes and other objects.
 - is a means for packaging Cloud Dataprep objects for following actions
- Creating relationships between datasets, their recipes, and other datasets.
 - Sharing with other users
 - Copying
 - Execution of pre-configured jobs
 - Creating references between recipes and external flows
 
 
Imported Dataset
- a reference to the original data
 - the data does not exist within the platform.
 - can be a reference to a file, multiple files, database table, or other type of data.
 - is a pointer to a source of data.
 - An imported dataset can be referenced in recipes.
 - Imported datasets are created through the Import Data Page.
 
Recipe
- a user-defined sequence of steps to transform a dataset.
 - A recipe object is created from an imported dataset or another recipe.
 - can create a recipe from a recipe to chain together recipes.
 - Recipes are interpreted by Cloud Dataprep by TRIFACTA INC. and turned into commands that can be executed against data.
 - When initially created, a recipe contains no steps.
 - Recipes are augmented and modified using the various visual tools in the Transformer Page.
 
In a flow, the following objects are associated with each recipe
- Outputs
 - References
 
Outputs
Outputs
- contain one or more publishing destinations
 - which define the output format, location, and other publishing options
 - applied to the results generated from a job run on the recipe.
 
References
- to create a reference to the output of the recipe’s steps in another dataset.
 - When you select a recipe’s reference object, you can add it to another flow.
 - A reference dataset is a read-only version of the output data generated from the execution of a recipe’s steps.
 
Samples
- It is a subset of the entire dataset.
 - For smaller datasets, the sample may be the entire dataset.
 - As you build or modify recipe, the results of each modification are immediately reflected in the sampled data.
 - Can generate additional samples
 
Macros
- can create reusable sequences of steps that can be parameterized for use in other recipes.
 
Run Jobs
A job may be composed of one or more of the following job types:
- Transform job: Executes the set of recipe steps that you have defined against sample(s), generating the transformed set of results across the entire dataset.
 - Profile job: choose to generate a visual profile of the results of transform job.
 - When a job completes, you can review the resulting data and identify data that still needs fixing.
 
Schedules
- Associate a schedule with a flow.
 - schedule is a combination of one or more triggers and the outputs.
 - A flow can have only one schedule associated with it.
 - A trigger is a scheduled time of execution.
 - A schedule can have multiple triggers associated with it.
 - A recipe can have only one scheduled destination.
 - Each recipe in a flow can have a scheduled destination.
 
Example

| Type | Datasets | Description | 
| Standard job execution | Recipe 1/Job 1 | Results of the job are used to create a new imported dataset (I-Dataset 2). | 
| Create dataset from generated results | Recipe 2/Job 2 | Recipe 2 is created off of I-Dataset 2 and then modified. A job has been specified for it, but the results of the job are unused. | 
| Chaining datasets | Recipe 3/Job 3 | Recipe 3 is chained off of Recipe 2. The results of running jobs off of Recipe 2 include all of the upstream changes as specified in I-Dataset 1/Recipe1 and I-Dataset 2/Recipe 2. | 
| Reference dataset | Recipe 4/Job 4 | I-Dataset 4 is created as a reference off of Recipe 3. It can have its own recipe, job, destinations, and results. | 
Workflow
Basic Workflow
- Review object overview:
 - Import data
 - Profile data
 - Build transform recipes
 - Run job
 - Export results
 
Google Professional Data Engineer (GCP) Free Practice TestTake a Quiz
		