Datalab Overview Google Professional Data Engineer GCP

  1. Home
  2. Datalab Overview Google Professional Data Engineer GCP
  • It is an interactive tool for large-scale data exploration, analysis, and visualization.
  • based on the open source Jupyter project.
  • Cloud Datalab is packaged as a container and run in a VM instance.
  • uses notebooks instead of the text files containing code.
  • Notebooks combines code, documentation written as markdown, and the results of code execution
  • notebooks help you write code: execute in interactive and iterative manner and rendering the results
  • Can share notebook with team members
  • Import from flat file, databases, or distributed storage systems
  • Locate and remove or modify missing or mismatched data
  • Unnest complex data structures
  • Identify statistical outliers in data for review and management
  • Perform lookups from one dataset into another reference dataset
  • Aggregate columnar data using a variety of aggregation functions
  • Normalize column values for more consistent usage and statistical modeling
  • Merge datasets with joins
  • Append one dataset to another through union operations
  • notebooks can be stored in Google Cloud Source Repository, a git repository.
  • git repository is cloned onto persistent disk attached to the VM.
  • Notebooks automatically saved to persistent disk periodically
  • Do not delete the persistent disk.
  • VM used for running Cloud Datalab is a shared resource accessible to all the members of the associated cloud project.
  • Results saved in the notebook remain in persistent format on the disk.