Google(GCP) Professional Data Engineer Sample Questions

  1. Home
  2. Google(GCP) Professional Data Engineer Sample Questions
Google Professional Data Engineer Sample Questions

Advanced Sample Questions

What is the best method to store and analyze large amounts of data in GCP?

  • a. Google Bigtable
  • b. Google BigQuery
  • c. Google Cloud SQL
  • d. Google Datastore

Answer: b. Google BigQuery

Explanation: Google BigQuery is a fully-managed, serverless data warehouse that enables super-fast SQL queries using the processing power of Google’s infrastructure. It is the best option for storing and analyzing large amounts of data in GCP.

What is the recommended approach to secure data stored in GCP?

  • a. Use network firewall rules to restrict access
  • b. Encrypt data at rest using encryption keys
  • c. Use both network firewall rules and encryption at rest
  • d. Do not encrypt data at rest as it is secure in GCP

Answer: c. Use both network firewall rules and encryption at rest

Explanation: Network firewall rules and encryption at rest are the best ways to secure data stored in GCP. Firewall rules can restrict access to specific IP addresses, while encryption keys protect sensitive data in case of a data breach.

What is the most cost-effective method to store large amounts of data in GCP?

  • a. Google Cloud Storage
  • b. Google Bigtable
  • c. Google Datastore
  • d. Google BigQuery

Answer: a. Google Cloud Storage

Explanation: Google Cloud Storage is the most cost-effective method to store large amounts of data in GCP. It is a fully-managed, highly scalable object storage solution that is cost-effective for storing large amounts of data.

What is the recommended way to run complex data processing tasks in GCP?

  • a. Google Dataproc
  • b. Google Bigtable
  • c. Google BigQuery
  • d. Google Cloud Functions

Answer: a. Google Dataproc

Explanation: Google Dataproc is the recommended way to run complex data processing tasks in GCP. It is a fast, easy-to-use, fully-managed cloud service that makes it simple to run Apache Spark and Apache Hadoop clusters.

What is the best method to store real-time data in GCP?

  • a. Google Cloud Storage
  • b. Google Bigtable
  • c. Google Datastore
  • d. Google BigQuery

Answer: b. Google Bigtable

Explanation: Google Bigtable is the best method to store real-time data in GCP. It is a fully-managed, NoSQL database service that supports real-time data processing and high-speed data access.

What is the recommended method to process large amounts of data in real-time in GCP?

  • a. Google Cloud Functions
  • b. Google Dataflow
  • c. Google Bigtable
  • d. Google Datastore

Answer: b. Google Dataflow

Explanation: Google Dataflow is the recommended method to process large amounts of data in real-time in GCP. It is a fully-managed service that allows you to build and run data processing pipelines, which can handle real-time and batch data processing.

What is the best method to store and retrieve large amounts of structured data in GCP?

  • a. Google Bigtable
  • b. Google BigQuery
  • c. Google Cloud Storage
  • d. Google Datastore

Answer: b. Google BigQuery

Explanation: Google BigQuery is the best method to store and retrieve large amounts of structured data in GCP. It is a fully-managed data warehousing solution that allows you to store and analyze large amounts of structured data quickly and easily.

What is the recommended method to store and retrieve unstructured data in GCP?

  • a. Google Bigtable
  • b. Google BigQuery
  • c. Google Cloud Storage
  • d. Google Datastore

Answer: c. Google Cloud Storage

Explanation: Google Cloud Storage is the recommended method to store and retrieve unstructured data in GCP. It is a fully-managed, highly scalable object storage solution that is ideal for storing unstructured data, such as images, videos, and audio files.

What is the recommended method to deploy machine learning models in GCP?

  • a. Google AI Platform
  • b. Google Bigtable
  • c. Google Datastore
  • d. Google BigQuery

Answer: a. Google AI Platform

Explanation: Google AI Platform is the recommended method to deploy machine learning models in GCP. It is a fully-managed service that provides a seamless and easy way to deploy machine learning models at scale, so you can quickly start using them in your applications.

What is the recommended method to perform real-time streaming analytics in GCP?

  • a. Google Cloud Functions
  • b. Google Dataflow
  • c. Google Cloud Pub/Sub
  • d. Google BigQuery

Answer: c. Google Cloud Pub/Sub

Explanation: Google Cloud Pub/Sub is the recommended method to perform real-time streaming analytics in GCP. It is a fully-managed messaging service that enables you to perform real-time streaming analytics by exchanging messages between independent applications.

Basic Sample Questions

Question 1.  A customer of John’s organization provides you with daily dumps of data from their database. They send the data to Google Cloud Storage in comma-separated-values (CSV) files. John’s job is about analyzing this data using Google BigQuery. However, the data might have rows that can be formatted incorrectly or maybe corrupteHow should John go about building this pipeline?
  1. Using federated data sources, and checking data in the SQL query.
  2. Enabling BigQuery monitoring in Google Stackdriver and creating an alert.
  3. Importing the data into BigQuery using the G-cloud CLI and setting max_bad_records to 0.
  4. Running a Google Cloud Dataflow batch pipeline for importing the data into BigQuery, and pushing errors to another dead-letter table for analysis.

Correct Answer: 4

Question 2.  Alexi has the job of creating a model that could predict housing prices. Owing to budget constraints, she must now run it on a single resource-constrained virtual machine. Which learning algorithm would you suggest she use?
  1. Linear regression
  2. Logistic classification
  3. Recurrent neural network
  4. Feedforward neural network

Correct Answer: 1

Question 3.  ABC company is using WILDCARD tables for querying data across multiple tables with similar names. The SQL statement presently shows the following error:

# Syntax error: Expected end of the statement but got “-” at [4:11]

SELECT age –

FROM –

bigquery-public-data.noaa_gsod.gsod

WHERE –

age != 99

AND_TABLE_SUFFIX = “˜1929′

ORDER BY –

age DESC

What table name will ensure that the SQL statement works correctly?

  1. “˜bigquery-public-data.noaa_gsod.gsod”˜
  2. bigquery-public-data.noaa_gsod.gsod*
  3. “˜bigquery-public-data.noaa_gsod.gsod’*
  4. “˜bigquery-public-data.noaa_gsod.gsod*`

Correct Answer: 4

Reference: https://cloud.google.com/bigquery/docs/wildcard-tables

Question 4. Company L&T operates in a highly regulated fielIndividual users should have access to only the information necessary to perform their tasks. This is one of your requirements. The goal is to enforce this requirement with Google BigQuery. Could you describe three possible approaches? (Choose three.)
  1. Disabling the writes for certain tables.
  2. Restricting access to tables by role.
  3. Ensuring that the data is encrypted at all times.
  4. Restricting BigQuery API access to approved users.
  5. Segregating the data across multiple tables or databases.
  6. Using Google Stackdriver Audit Logging for determining policy violations.

Correct Answer: 2,4,6 

Question 5.  It is your company’s job to handle data processing for a wide range of clients. Some clients allow direct access to Google BigQuery using their own suite of analytics tools, while others rely on their own software. It is crucial to keep the data secure so that your clients cannot see each other’s data. You must make sure that the data can be accessed in an appropriate manner.
Which three steps should you take? (Choose three.)
  1. Loading data into different partitions.
  2. Loading data into a different dataset for each client.
  3. Putting each client’s BigQuery dataset into a different table.
  4. Restricting a client’s dataset to approved users.
  5. Only allowing a service account to access the datasets.
  6. Using the appropriate identity and access management (IAM) roles for each client’s users.

Correct Answer: 2,4,6

Question 6. Your company is deploying 10,000 new Internet of Things devices to monitor temperature in your warehouses around the world. These very large datasets must be processed, stored, and analyzed in real-time. How should you proceed?
  1. Sending the data to Google Cloud Datastore and then exporting to BigQuery.
  2. Sending the data to Google Cloud Pub/Sub, streaming Cloud Pub/Sub to Google Cloud Dataflow, and storing in Google BigQuery.
  3. Sending the data to Cloud Storage and then spinning up an Apache Hadoop cluster as needed in Google Cloud Dataproc whenever analysis is required.
  4. Exporting logs in batch to Google Cloud Storage and then spinning up a Google Cloud SQL instance, importing the data from Cloud Storage, and running an analysis as needed.

Correct Answer: 2

Question 7. Bigtable streams real-time data from the factory floor of your company, and the performance suffers. Could you please suggest a way to improve the performance of Bigtable’s row key when queries populate real-time dashboards?
  1. Using a row key of the form <timestamp>.
  2. Using a row key of the form <sensorid>.
  3. Using a row key of the form <timestamp>#<sensorid>.
  4. Using a row key of the form >#<sensorid>#<timestamp>.

Correct Answer: 4

Question 8. You have experienced rapid growth recently, and your company is ingesting data at a much higher rate than it was previously. Managing batch MapReduce jobs in Hadoop is one of your daily responsibilities. As a result, batch jobs have gotten behind due to the increase in datTo boost the analytics’ responsiveness without increasing costs, you were asked for suggestions. How should they proceed?
  1. Rewriting the job in Pig.
  2. Rewriting the job in Apache Spark.
  3. Increasing the size of the Hadoop cluster.
  4. Decreasing the size of the Hadoop cluster but also rewriting the job in Hive.

Correct Answer: 1

Question 9. One of Henry’s jobs is to batch log files for all applications into one log file at 2:00 am each day. This log file has been processed by a Google Cloud Dataflow joHe must make sure that the log file is processed once a day on the cheapest possible basis. What should he do?
  1. Changing the processing job to use Google Cloud Dataproc instead.
  2. Manually starting the Cloud Dataflow job each morning when you get into the office.
  3. Creating a cron job with Google App Engine Cron Service for running the Cloud Dataflow job.
  4. Configuring the Cloud Dataflow job as a streaming job so that it processes the log data immediately.

Correct Answer: 3

Question 10. Riya is training a spam classifier. It appears to her that she is overfitting the training data. Which three actions among the following can she take for resolving this problem? (Choose three.)
  1. Getting more training examples
  2. Reducing the number of training examples
  3. Using a smaller set of features 
  4. Using a larger set of features
  5. Increasing the regularization parameters 
  6. Decreasing the regularization parameters

Correct Answer: 1,4,6

Question 11. Google BigQuery has been used for 6 months by your organization to collect and analyze datThis data is analyzed mainly in a time-partitioned table called events_partitioneAs a result of creating the events view, which only queries the last 14 days of data, your organization reduced the cost of queries. Legacy SQL is used to describe the view. In the following month, BigQuery will provide ODBC connections so that existing applications can access the events data. For ensuring the applications connect, which two actions must be taken? (Choose two.)
  1. Creating a new view over events using standard SQL
  2. Creating a new partitioned table using a standard SQL query
  3. Creating a new view over events_partitioned using standard SQL 
  4. Creating a service account for the ODBC connection to use for authentication 
  5. Creating a Google Cloud Identity and Access Management (Cloud IAM) role for the ODBC connection and shared “events”

Correct Answer: 1,5

Question 12. Your Firebase Analytics and Google BigQuery integrations are enabled. As a result of this update, Firebase creates a new BigQuery table every day in the format app_events_YYYYMMDIn legacy SQL, you would like to query all of the tables for the past 30 days. What would you use?
  1. TABLE_DATE_RANGE function
  2. WHERE_PARTITIONTIME pseudo column
  3. WHERE date BETWEEN YYYY-MM-DD AND YYYY-MM-DD
  4. SELECT IF.(date >= YYYY-MM-DD AND date <= YYYY-MM-DD

Correct Answer: 1

Reference: https://cloud.google.com/blog/products/gcp/using-bigquery-and-firebase-analytics-to-understand-your-mobile-app?hl=am

Question 13. Both batch- and stream-based event data are received by your company. You are required to process the data using Google Cloud Dataflow over a certain predictable time period.
However, in some instances, the data can arrive either late or out of order. Is there a way to design your Cloud Dataflow pipeline so that it can handle late or out-of-order data?
  1. Setting a single global window for capturing all the data.
  2. Setting sliding windows for capturing all the lagged data.
  3. Using watermarks and timestamps for capturing the lagged data
  4. Ensuring every data source type (stream or batch) has a timestamp, and using the timestamps for defining the logic for lagged data.

Correct Answer: 3

Question 14. The objective of your data pipeline deployment on Google Cloud is to store 20 TB of text files. CSV format is used for the input datIt is required that you minimize the cost of querying aggregate values for more than one users who will be querying the data in Cloud Storage using multiple engines. Which storage service and schema design should you go about using?
  1. Cloud Bigtable for storage: Installing the HBase shell on a Compute Engine instance for querying the Cloud Bigtable data.
  2. Cloud Bigtable for storage: Linking as permanent tables in BigQuery for the query.
  3. Cloud Storage for storage: Linking as permanent tables in BigQuery for a query.
  4. Cloud Storage for storage: Linking as temporary tables in BigQuery for a query.

Correct Answer: 1

Question 15. Using Google Cloud, you need to design storage for two relational tables within a 10-TB database. Your goal is to support horizontally scalable transactions.
The data should also be optimized for range queries on non-key columns. What should you be using?
  1. Cloud SQL for storage: Adding secondary indexes to support query patterns.
  2. Cloud SQL for storage: Using Cloud Dataflow for transforming data to support query patterns.
  3. Cloud Spanner for storage: Adding secondary indexes to support query patterns.
  4. Cloud Spanner for storage: Using Cloud Dataflow for transforming data to support query patterns.

Correct Answer: 3

Question 16. You plan to store 50 TB of financial time-series data in the cloud as your financial services company moves to cloud technology. New data will stream in constantly because this data is frequently updated. To gain insight into this data, your company wants to move its existing Apache Hadoop jobs to the cloud.
Which product would be most suitable to store the data?
  1. Cloud Bigtable
  2. Google BigQuery
  3. Google Cloud Storage
  4. Google Cloud Datastore

Correct Answer: 1

Reference: https://cloud.google.com/bigtable/docs/schema-design-time-series

Question 17. In order to increase the training speed of your neural network model, you need to decrease the time it takes to train. How would you proceed?
  1. Subsampling your test dataset.
  2. Subsampling your training dataset.
  3. Increasing the number of input features to your model.
  4. Increasing the number of layers in your neural network.

Correct Answer: 4

Reference: https://towardsdatascience.com/how-to-increase-the-accuracy-of-a-neural-network-9f5d1c6f407d

Question 18. Creating the ETL pipelines for your company’s Apache Hadoop cluster is your responsibility. It will be necessary to checkpoint and split pipelines during the pipeline construction. Which method would you prefer for writing the pipelines?
  1. PigLatin using Pig
  2. HiveQL using Hive
  3. Java using MapReduce
  4. Python using MapReduce

Correct Answer: 4

Question 19. At your large enterprise company, you are in charge of BI for multiple business units with different priorities and budgets, and you use on-demand pricing for BigQuery with a 2K concurrent slot quota per project. It is sometimes difficult for users at your organization to get slots to execute their queries and this needs to be correcteYou prefer to avoid introducing new projects in your account.
How would you proceed?
  1. Converting the batch BQ queries into interactive BQ queries.
  2. Creating an additional project for overcoming the 2K on-demand per-project quota.
  3. Switching to flat-rate pricing and establishing a hierarchical priority model for the projects.
  4. Increasing the amount of concurrent slots per project at the Quotas page at the Cloud Console.

Correct Answer: 3

Reference: https://cloud.google.com/blog/products/gcp/busting-12-myths-about-bigquery

Question 20. Your job is to process the migration of 2TB relational database to Google Cloud Platform. However, you lack the resources for significantly refactoring the application that uses this database. The cost of operating is your primary concern.
Which service is appropriate for storing and serving the data?
  1. Cloud Spanner
  2. Cloud Bigtable
  3. Cloud Firestore
  4. Cloud SQL

Correct Answer: 4

Google-Cloud-Certified-Professional-Data-Engineer  free practice tests
Menu