Dataproc Overview Google Professional Data Engineer GCP

  1. Home
  2. Dataproc Overview Google Professional Data Engineer GCP
  • A managed Spark and Hadoop service
  • Use for batch processing, querying, streaming, and machine learning.
  • automation helps you
    • create clusters quickly
    • manage them easily
    • save money by turning clusters off when you don’t need them.
  • Use cases –
    • Move Hadoop and Spark clusters to the cloud
    • Data science on Dataproc
  • Apache Hadoop ecosystem components are automatically installed on the cluster.
  • initialization actions provide faster cluster startup times
  • clusters can be provisioned with a custom image
  • Dataproc manages preemptible node addition and deletion
    • Atleast single regular worker is needed.
    • Workers can be preemptible.
    • preemptible worker nodes will not have hdfs storage,
    • preemptible has same config as regular worker nodes.
  • Web ports used tcp port 8088 which is Hadoop, 9870 which is HDFS and 8080 which is Datalab
  • access Dataproc from
  • Through the REST API
  • Using the Cloud SDK
  • Using the Dataproc UI
  • Through the Cloud Client Libraries