Configure Dataproc Cluster and Submit Job Google Professional Data Engineer GCP

  1. Home
  2. Configure Dataproc Cluster and Submit Job Google Professional Data Engineer GCP

Configuration terms

Cluster Region:

  • specify a global region or a specific region for cluster.
  • The global region is a special multi-region endpoint to deploy instances into any user-specified Compute Engine zone.
  • can also specify distinct regions

 

Compute Engine Virtual Machine instances (VMs)

  • consist of master and worker VMs,
  • require full internal IP networking access to each other.
  • Default network available to create a cluster helps ensure this access.

Labels –

  • apply user labels to cluster and job resources
  • to group resources and related operations for later filtering and listing.
  • associate labels when resource is created, at cluster creation or job submission.
  • label is propagated to operations performed on the resource

 

Cluster Update / Delete

  • can update a cluster by Dataproc API, gcloud, or from Configuration tab of Cluster details page for the cluster in the Google Cloud Console.
  • Following can be updated
    • the number of standard worker nodes in a cluster
    • the number of secondary worker nodes in a cluster
    • whether to use graceful decommissioning to control shutting down a worker after its jobs are completed
    • adding or deleting cluster labels

Deleting a cluster –

  • delete a cluster by
    • Dataproc API
    • gcloud
    • Google Cloud Console.

Submit a job

  • Submit a job to an existing cluster by Dataproc API, gcloud or Google Cloud Console
  • can also SSH into the master instance in cluster, and then run a job

Log and Monitor

job and cluster logs can be viewed, searched, filtered, and archived in Cloud Logging.

Menu