Dataproc Overview Google Professional Data Engineer GCP
- A managed Spark and Hadoop service
 - Use for batch processing, querying, streaming, and machine learning.
 - automation helps you
- create clusters quickly
 - manage them easily
 - save money by turning clusters off when you don’t need them.
 
 - Use cases –
- Move Hadoop and Spark clusters to the cloud
 - Data science on Dataproc
 
 - Apache Hadoop ecosystem components are automatically installed on the cluster.
 - initialization actions provide faster cluster startup times
 - clusters can be provisioned with a custom image
 - Dataproc manages preemptible node addition and deletion
- Atleast single regular worker is needed.
 - Workers can be preemptible.
 - preemptible worker nodes will not have hdfs storage,
 - preemptible has same config as regular worker nodes.
 
 - Web ports used tcp port 8088 which is Hadoop, 9870 which is HDFS and 8080 which is Datalab
 - access Dataproc from
 - Through the REST API
 - Using the Cloud SDK
 - Using the Dataproc UI
 - Through the Cloud Client Libraries
 
Google Professional Data Engineer (GCP) Free Practice TestTake a Quiz
		