AWS Data Analytics Specialty (DAS-C01) Sample Questions

  1. Home
  2. AWS Data Analytics Specialty (DAS-C01) Sample Questions
AWS Data Analytics Specialty Sample Questions

Advanced Sample Questions

What is the most efficient way to store and analyze data that is generated from log files on AWS?

  • A) S3 bucket
  • B) Redshift
  • C) Amazon RDS
  • D) Amazon DynamoDB

Answer: B) Redshift

Explanation: Redshift is designed specifically for data warehousing and is optimized for large datasets and complex queries. It allows for fast querying and analysis of data, making it the most efficient option for storing and analyzing log data on AWS.

Which AWS service can be used to store and process large amounts of time-series data?

  • A) Amazon EBS
  • B) Amazon RDS
  • C) Amazon Kinesis Data Streams
  • D) Amazon S3

Answer: C) Amazon Kinesis Data Streams

Explanation: Amazon Kinesis Data Streams is designed to handle real-time data processing and analysis of large amounts of time-series data. It allows for data to be streamed and stored in real-time, making it ideal for time-series data analysis.

How can you run a batch processing job on AWS to process large amounts of data?

  • A) Amazon S3
  • B) Amazon DynamoDB
  • C) Amazon EC2
  • D) Amazon Glue

Answer: D) Amazon Glue

Explanation: Amazon Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to move data between data stores. It can be used to run batch processing jobs to process large amounts of data, making it an ideal option for data processing on AWS.

Which AWS service is best suited for performing in-depth analysis of complex data structures?

  • A) Amazon EBS
  • B) Amazon Redshift
  • C) Amazon S3
  • D) Amazon QuickSight

Answer: B) Amazon Redshift

Explanation: Amazon Redshift is optimized for data warehousing and is designed for large datasets and complex queries. It provides fast querying and analysis capabilities, making it the best choice for performing in-depth analysis of complex data structures on AWS.

How can you analyze data stored in Amazon S3 in real-time?

  • A) Amazon EC2
  • B) Amazon Redshift
  • C) Amazon Kinesis Data Streams
  • D) Amazon QuickSight

Answer: C) Amazon Kinesis Data Streams

Explanation: Amazon Kinesis Data Streams can be used to analyze data stored in Amazon S3 in real-time by streaming the data directly into Kinesis. This allows for real-time analysis and processing of data stored in S3, making it an ideal option for real-time data analysis on AWS.

Which AWS service allows you to process and analyze large amounts of streaming data in real-time?

  • A) Amazon S3
  • B) Amazon Redshift
  • C) Amazon Kinesis Data Streams
  • D) Amazon QuickSight

Answer: C) Amazon Kinesis Data Streams

Explanation: Amazon Kinesis Data Streams is a service designed to process and analyze large amounts of streaming data in real-time. It allows you to ingest, store, and process real-time data streams, making it an ideal option for real-time data analysis on AWS.

What is the best AWS service to use when you want to create interactive dashboards to visualize and analyze your data?

  • A) Amazon S3
  • B) Amazon Redshift
  • C) Amazon Kinesis Data Streams
  • D) Amazon QuickSight

Answer: D) Amazon QuickSight

Explanation: Amazon QuickSight is a business intelligence service that allows you to easily create interactive dashboards and visualizations for your data. It integrates with a variety of data sources, including Amazon S3, Amazon Redshift, and Amazon Kinesis Data Streams, making it an ideal option for data visualization and analysis on AWS.

Which AWS service is best suited for processing and analyzing large amounts of structured data?

  • A) Amazon S3
  • B) Amazon Redshift
  • C) Amazon Kinesis Data Streams
  • D) Amazon QuickSight

Answer: B) Amazon Redshift

Explanation: Amazon Redshift is a fully managed data warehousing service optimized for processing and analyzing large amounts of structured data. It provides fast querying and analysis capabilities, making it an ideal option for processing and analyzing large amounts of structured data on AWS.

How can you process and analyze large amounts of unstructured data on AWS?

  • A) Amazon S3
  • B) Amazon Redshift
  • C) Amazon Kinesis Data Streams
  • D) Amazon Glue

Answer: D) Amazon Glue

Explanation: Amazon Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to move data between data stores. It can be used to process and analyze large amounts of unstructured data, making it an ideal option for unstructured data analysis on AWS.

What is the best AWS service to use when you want to store and analyze large amounts of event data from multiple sources?

  • A) Amazon S3
  • B) Amazon Redshift
  • C) Amazon Kinesis Data Streams
  • D) Amazon Glue

Answer: C) Amazon Kinesis Data Streams

Explanation: Amazon Kinesis Data Streams is a service designed to handle real-time data processing and analysis of large amounts of time-series data. It can ingest, store, and process event data from multiple sources, making it an ideal option for event data analysis on AWS.

Basic Sample Questions

Question 1 – Data from stock exchanges is aggregated daily by a financial services company and streamed directly into a data store, with occasional SQL modifications. The company also requires that the data be streamed directly into the data store. There should be a dashboard that enables viewing of the top contributors to a stock price anomaly and integrates complex, analytic queries with minimal latency. Which solution would meet the company’s requirements?
  • A. Using Amazon Kinesis Data Firehose for streaming data to Amazon S3. Using Amazon Athena as a data source for Amazon QuickSight for creating a business intelligence dashboard.
  • B. Using Amazon Kinesis Data Streams for streaming data to Amazon Redshift. Using Amazon Redshift as a data source for Amazon QuickSight for creating a business intelligence dashboard.
  • C. Using Amazon Kinesis Data Firehose for streaming data to Amazon Redshift. Using Amazon Redshift as a data source for Amazon QuickSight for creating a business intelligence dashboard.
  • D. Using Amazon Kinesis Data Streams for streaming data to Amazon S3. Using Amazon Athena as a data source for Amazon QuickSight for creating a business intelligence dashboard.

Correct Answer: D

Question 2 – Data lakes and data warehouses are hosted by a financial company on Amazon S3, and dashboards are built using Amazon QuickSight, but the company wishes to secure access from its Active Directory. How should the data be secured?
  • A. Using an Active Directory connector and single sign-on (SSO) in a corporate network environment.
  • B. Using a VPC endpoint to connect to Amazon S3 from Amazon QuickSight and an IAM role for authenticating Amazon Redshift.
  • C. Establishing a secure connection by creating an S3 endpoint for connecting to Amazon QuickSight and a VPC endpoint for connecting to Amazon Redshift.
  • D. Placing Amazon QuickSight and Amazon Redshift in the security group and using an Amazon S3 endpoint for connecting Amazon QuickSight to Amazon S3.

Correct Answer: B

Question 3 – Using an Amazon EMR master node, a real estate company has a mission-critical application that uses Apache HBase. Data over five terabytes are stored on a Hadoop Distributed File System (HDFS) and the company needs a cost-effective solution for securing its HBase data. Which architectural pattern would meet the company’s requirements?
  • A. Using Spot Instances for core and task nodes and a Reserved Instance for the EMR master node. Configuring the EMR cluster with multiple master nodes. Scheduling automated snapshots using Amazon EventBridge.
  • B. Storing the data on an EMR File System (EMRFS) instead of HDFS. Enabling EMRFS consistent view. Creating an EMR HBase cluster with multiple master nodes. Pointing the HBase root directory to an Amazon S3 bucket.
  • C. Storing the data on an EMR File System (EMRFS) instead of HDFS and enabling EMRFS consistent view. Running two separate EMR clusters in two different Availability Zones. Pointing both clusters to the same HBase root directory in the same Amazon S3 bucket.
  • D. Storing the data on an EMR File System (EMRFS) instead of HDFS and enabling EMRFS consistent view. Creating a primary EMR HBase cluster with multiple master nodes. Creating a secondary EMR HBase read-replica cluster in a separate Availability Zone. Pointing both clusters to the same HBase root directory in the same Amazon S3 bucket.

Correct Answer: C

Reference: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hbase-s3.html

Question 4 – Weekly updates are made to an application hosted on Amazon Web Services (AWS). During the application testing process, a solution must be developed to analyze the log files from each Amazon EC2 instance to ensure that the application is working as expected after each deployment. In order to ensure high availability, the collection and analysis solution should be able to display new information quickly. Which method would be used for collecting and analyzing the logs?
  • A. Enabling detailed monitoring on Amazon EC2, using Amazon CloudWatch agent to store logs in Amazon S3, and using Amazon Athena for fast, interactive log analytics.
  • B. Using the Amazon Kinesis Producer Library (KPL) agent on Amazon EC2 for collecting and sending data to Kinesis Data Streams to further push the data to Amazon OpenSearch Service (Amazon Elasticsearch Service) and visualizing using Amazon QuickSight.
  • C. Using the Amazon Kinesis Producer Library (KPL) agent on Amazon EC2 for collecting and sending data to Kinesis Data Firehose to further push the data to Amazon OpenSearch Service (Amazon Elasticsearch Service) and OpenSearch Dashboards (Kibana).
  • D. Using Amazon CloudWatch subscriptions for getting access to a real-time feed of logs and have the logs delivered to Amazon Kinesis Data Streams to further push the data to Amazon OpenSearch Service (Amazon Elasticsearch Service) and OpenSearch Dashboards (Kibana).

Correct Answer: D

Reference: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/Subscriptions.html

Question 5 – Using AWS Glue, a data analyst organizes, cleanses, validates, and formats a 200 GB dataset, and triggers the job to run with Standard worker type. It has been 3 hours since the AWS Glue job started, but there are no error codes in the logs. Analyzing data is important for reducing the execution time of jobs without overprovisioning. Which actions must the data analyst take?
  • A. Enabling job bookmarks in AWS Glue for estimating the number of data processing units (DPUs). Based on the profiled metrics, increasing the value of the executor-cores job parameter.
  • B. Enabling job metrics in AWS Glue for estimating the number of data processing units (DPUs). Based on the profiled metrics, increase the value of the maximum capacity job parameter.
  • C. Enabling job metrics in AWS Glue for estimating the number of data processing units (DPUs). Based on the profiled metrics, increasing the value of the spark.yarn.executor.memory.overhead job parameter.
  • D. Enabling job bookmarks in AWS Glue for estimating the number of data processing units (DPUs). Based on the profiled metrics, increasing the value of the num- executors job parameter.

Correct Answer: B

Reference: https://docs.aws.amazon.com/glue/latest/dg/monitor-debug-capacity.html

Question 6 – A company has a business unit uploading .csv files to an Amazon S3 bucket, and the company’s data platform team has set up an AWS Glue crawler to do discovery, and create tables and schemas. An AWS Glue job writes processed data from the created tables to an Amazon Redshift database, and handles column mapping and creating the Amazon Redshift table appropriately. When the AWS Glue job is rerun for any reason in a day, duplicate records are introduced into the Amazon Redshift table. Which solution will update the Redshift table without duplicates when jobs are rerun?
  • A. Modifying the AWS Glue job for copying the rows into a staging table. Adding SQL commands for replacing the existing rows in the main table as post actions in the DynamicFrameWriter class.
  • B. Loading the previously inserted data into a MySQL database in the AWS Glue job. Performing an upsert operation in MySQL, and copying the results to the Amazon Redshift table.
  • C. Using Apache Spark’s DataFrame drop duplicates() API for eliminating duplicates and then writing the data to Amazon Redshift.
  • D. Using the AWS Glue ResolveChoice built-in transform for selecting the most recent value of the column.

Correct Answer: B

Question 7 – Data is read from Amazon Kinesis Data Streams and immediately written to an Amazon S3 bucket every 10 seconds by a streaming application. In this application, hundreds of shards are being read, but the batch interval cannot be changed due to another requirement. Users are experiencing degradation in query performance as time progresses as they access data with Amazon Athena. Which of the given actions could help improve query performance?
  • A. Merging the files in Amazon S3 to form larger files.
  • B. Increasing the number of shards in Kinesis Data Streams.
  • C. Adding more memory and CPU capacity to the streaming application.
  • D. Writing the files to multiple S3 buckets.

Correct Answer: C

Question 8 – A company stores and analyzes its website clickstream data using Amazon OpenSearch Service (Amazon Elasticsearch Service). An Amazon ES cluster stores one day’s worth of data from Amazon Kinesis Data Firehose and the company ingests one TB of data every day. In addition to slow query performance on Amazon ES, the company occasionally encounters errors in writing to the index with Kinesis Data Firehose. The Amazon ES cluster has 10 nodes running a single index and 3 dedicated master nodes, with each data node There are 1.5 TB of Amazon EBS storage attached to the cluster, and 1,000 shards are configured for the cluster. Sometimes, JVMMemoryPressure errors are found in the cluster log files. JVMMemoryPressure errors are found in the cluster logs, occasionally. Which of the following solutions would improve the performance of Amazon ES?
  • A. Increasing the memory of the Amazon ES master nodes.
  • B. Decreasing the number of Amazon ES data nodes.
  • C. Decreasing the number of Amazon ES shards for the index.
  • D. Increasing the number of Amazon ES shards for the index.

Correct Answer: C

Question 9 – Throughout the year, a manufacturing company is collecting data from IoT sensors on its factory floor and using Amazon Redshift to analyze the data on a daily basis. Using the expected ingestion rate of 2TB per day, a data analyst determined that the cluster will be undersized within 4 months. Data analysts indicated there is a need for a long-term solution, and most queries reference the most recent 13 months, but quarterly reports need to query all data generated from the last 7 years. A long-term solution’s cost, administrative effort, and performance concern the chief technology officer (CTO). Which solution must be used to meet these requirements?
  • A. Creating a daily job in AWS Glue to UNLOAD records older than 13 months to Amazon S3 and deleting those records from Amazon Redshift. Creating an external table in Amazon Redshift for pointing to the S3 location. Using Amazon Redshift Spectrum for joining data that is older than 13 months.
  • B. Taking a snapshot of the Amazon Redshift cluster. Restoring the cluster to a new cluster using dense storage nodes with additional storage capacity.
  • C. Executing a CREATE TABLE AS SELECT (CTAS) statement for moving records that are older than 13 months to quarterly partitioned data in Amazon Redshift Spectrum backed by Amazon S3.
  • D. Unloading all the tables in Amazon Redshift to an Amazon S3 bucket using S3 Intelligent-Tiering. Using AWS Glue for crawling the S3 bucket location for creating external tables in an AWS Glue Data Catalog. Creating an Amazon EMR cluster using Auto Scaling for any daily analytics needs, and using Amazon Athena for the quarterly reports, with both using the same AWS Glue Data Catalog.

Correct Answer: B

Question 10 – An insurance company has raw data in JSON format that is delivered to an Amazon S3 bucket through an Amazon Kinesis Data Firehose delivery stream without a scheduled delivery schedule. Every 8 hours, the AWS Glue crawler updates the data catalog of the tables in the S3 bucket using the schema in the data catalog. Analysis of the data is performed using Apache Spark SQL on Amazon EMR with the AWS Glue Data Catalog as the metastore. It is known that stale data occasionally is received by data analysts. In order to provide the most up-to-date data, a data engineer is needed. Which solution would help meet these requirements?
  • A. Creating an external schema based on the AWS Glue Data Catalog on the existing Amazon Redshift cluster for querying new data in Amazon S3 with Amazon Redshift Spectrum.
  • B. Using Amazon CloudWatch Events with the rate (1 hour) expression for executing the AWS Glue crawler every hour.
  • C. Using the AWS CLI, modifying the execution schedule of the AWS Glue crawler from 8 hours to 1 minute.
  • D. Running the AWS Glue crawler from an AWS Lambda function triggered by an S3: ObjectCreated: event notification on the S3 bucket.

Correct Answer: A

Question 11 – A company producing network devices has millions of users. Hourly data from the devices is collected and stored in an Amazon S3 data lake. For the purposes of detecting abnormalities and troubleshooting and resolving user issues, the company analyzes the last 24 hours’ worth of data flow logs. Additionally, the company analyses historical logs dating back two years to uncover patterns and identify improvement opportunities. In data flow logs, many metrics are stored, including date, timestamps, source IPs, and target IPs. There are about 10 billion events every day. How would you store this data for optimal performance?
  • A. In Apache ORC partitioned by date and sorted by source IP
  • B. In compressed .csv partitioned by date and sorted by source IP
  • C. In Apache Parquet partitioned by source IP and sorted by date
  • D. In compressed nested JSON partitioned by source IP and sorted by date

Correct Answer: D

Question 12 – For storing sensitive data, a banking company uses Amazon Redshift clusters with dense storage (DS) nodes. The cluster was found to be unencrypted in an audit. Databases containing sensitive data must be encrypted through a hardware security module (HSM) with automated key rotation according to compliance requirements. Which of the following combinations of steps will help achieve compliance? (Choose two.)
  • A. Setting up a trusted connection with HSM using a client and server certificate with automatic key rotation.
  • B. Modifying the cluster with an HSM encryption option and automatic key rotation.
  • C. Creating a new HSM-encrypted Amazon Redshift cluster and migrating the data to the new cluster.
  • D. Enabling HSM with key rotation through the AWS CLI.
  • E. Enabling Elliptic Curve Diffie-Hellman Ephemeral (ECDHE) encryption in the HSM.

Correct Answer: BD

Reference: https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-db-encryption.html

Question 13 – In order to demonstrate the capabilities of Amazon SageMaker, the company is planning to utilize a subset of its 3 TB data warehouse to conduct a machine learning (ML) proof of concept. The AWS Direct Connect service is set up and tested as part of the project. Data analysts are collecting and curating data for preparing it for machine learning. A number of tasks need to be completed by the data analysts, including mapping, dropping null fields, resolving choices, and splitting fields. The company requires a fast solution for curating the data for this project. Which solution meets these requirements?
  • A. Ingesting data into Amazon S3 using AWS DataSync and using Apache Spark scrips for curating the data in an Amazon EMR cluster. Storing the curated data in Amazon S3 for ML processing.
  • B. Creating custom ETL jobs on-premises for curating the data. Using AWS DMS for ingesting data into Amazon S3 for ML processing.
  • C. Ingesting data into Amazon S3 using AWS DMS. Using AWS Glue for performing data curation and storing the data in Amazon S3 for ML processing.
  • D. Taking a full backup of the data store and shipping the backup files using AWS Snowball. Uploading Snowball data into Amazon S3 and scheduling data curation jobs using AWS Batch for preparing the data for ML.

Correct Answer: C

Question 14 – A US-based sneaker retail company has launched its global website. Detailed historical transaction data is stored in Amazon Redshift and all the transaction data is stored in Amazon RDS in the us-east-1 Region. Our business intelligence team (BI) wants to provide a sneaker trend dashboard to enhance the user experience. The BI team plans to start using Amazon QuickSight for rendering the website dashboards. During the development process, a team in Japan provisioned Amazon QuickSight in ap- northeast-1. Connecting Amazon QuickSight to Amazon Redshift from ap-northeast-1 is proving difficult. Which of the given solutions would solve this issue to meet the requirements?
  • A. In the Amazon Redshift console, configuring cross-Region snapshots and setting the destination Region as ap-northeast-1. Restoring the Amazon Redshift Cluster from the snapshot and connecting to Amazon QuickSight launched in ap-northeast-1.
  • B. Creating a VPC endpoint from the Amazon QuickSight VPC to the Amazon Redshift VPC so Amazon QuickSight can access data from Amazon Redshift.
  • C. Creating an Amazon Redshift endpoint connection string with Region information in the string and using this connection string in Amazon QuickSight for connecting to Amazon Redshift.
  • D. Creating a new security group for Amazon Redshift in us-east-1 with an inbound rule authorizing access from the appropriate IP address range for the Amazon QuickSight servers in ap-northeast-1.

Correct Answer: B

Question 15 – An airline has a .csv-formatted data that is stored in Amazon S3 with an AWS Glue Data Catalog. As part of a batch process, analysts wish to join this data with call center data stored in Amazon Redshift. There is already a heavy load on the Amazon Redshift cluster. In order to minimize the load on the existing Amazon Redshift cluster, the solution must be managed, serverless, well-performing, and serverless. A minimal amount of effort and development activity should also be required for the solution. Which solution will meet these requirements?
  • A. Unloading the call center data from Amazon Redshift to Amazon S3 using an AWS Lambda function. Performing the join with AWS Glue ETL scripts.
  • B. Exporting the call center data from Amazon Redshift using a Python shell in AWS Glue. Performing the join with AWS Glue ETL scripts.
  • C. Creating an external table using Amazon Redshift Spectrum for the call center data and performing the join with Amazon Redshift.
  • D. Exporting the call center data from Amazon Redshift to Amazon EMR using Apache Sqoop. Performing the join with Apache Hive.

Correct Answer: C

Question 16 – Data analysts use Amazon QuickSight to visualize data generated by multiple applications in different Amazon S3 buckets. Each application stores files within a separate Amazon S3 bucket. All application data in Amazon S3 is cataloged by AWS Glue Data Catalog, and each new application is stored in a separate S3 bucket. The data analyst created an Amazon QuickSight data source from an Amazon Athena table after updating the catalog to include a new application data source, but SPICE was unable to import it. How should the data analyst resolve the issue?
  • A. Editing the permissions for the AWS Glue Data Catalog from within the Amazon QuickSight console.
  • B. Editing the permissions for the new S3 bucket from within the Amazon QuickSight console.
  • C. Editing the permissions for the AWS Glue Data Catalog from within the AWS Glue console.
  • D. Editing the permissions for the new S3 bucket from within the S3 console.

Correct Answer: B

Reference: https://aws.amazon.com/blogs/big-data/harmonize-query-and-visualize-data-from-various-providers-using-aws-glue-amazon-athena-and-amazon- quicksight/

Question 17 – For their new investment strategy, a team of data scientists will analyze market trend data. In large volumes, the trend data is derived from five different sources. As part of their use case, the team would like to use Amazon Kinesis to analyze trends and send notifications based on significant patterns found in the trends. They use SQL-like queries to analyze trends. Further, the data scientists would like to archive the data on Amazon S3 and re-process it historically whenever feasible using AWS-managed services. The team requires to implement the lowest-cost solution. Which is the most suitable solution?
  • A. Publishing data to one Kinesis data stream. Deploying a custom application using the Kinesis Client Library (KCL) to analyze trends, and for sending notifications using Amazon SNS. Configuring Kinesis Data Firehose on the Kinesis data stream for persisting data to an S3 bucket.
  • B. Publishing data to one Kinesis data stream. Deploying Kinesis Data Analytic to the stream for analyzing trends, and configuring an AWS Lambda function as an output for sending notifications using Amazon SNS. Configuring Kinesis Data Firehose on the Kinesis data stream for persisting data to an S3 bucket.
  • C. Publishing data to two Kinesis data streams. Deploying Kinesis Data Analytics to the first stream for analyzing trends, and configuring an AWS Lambda function as an output for sending notifications using Amazon SNS. Configuring Kinesis Data Firehose on the second Kinesis data stream to persist data to an S3 bucket.
  • D. Publishing data to two Kinesis data streams. Deploying a custom application using the Kinesis Client Library (KCL) to the first stream for analyzing trends, and sending notifications using Amazon SNS. Configuring Kinesis Data Firehose on the second Kinesis data stream to persist data to an S3 bucket.

Correct Answer: A

Question 18 – A company is currently using Amazon Athena for querying its global datasets. The regional data has been stored in Amazon S3 in the us-east-1 and us-west-2 Regions, but the data is not encrypted. For simplifying the query process and managing it centrally, the company wishes to start using Athena in us-west-2 for querying data from Amazon S3 in both Regions. Which solution would be as low-cost as possible for achieving this goal?
  • A. Using AWS DMS for migrating the AWS Glue Data Catalog from us-east-1 to us-west-2. Run Athena queries in us-west-2.
  • B. Running the AWS Glue crawler in us-west-2 to catalog datasets in all Regions. Running Athena queries in us-west-2, once the data is crawled.
  • C. Enabling cross-Region replication for the S3 buckets in us-east-1 for replicating data in us-west-2. Running the AWS Glue crawler there for updating the AWS Glue Data Catalog in us-west-2 and running Athena queries, once the data is replicated in us-west-2.
  • D. Updating AWS Glue resource policies for providing us-east-1 AWS Glue Data Catalog access to us-west-2. Once the catalog in us-west-2 has access to the catalog in us-east-1, running Athena queries in us-west-2.

Correct Answer: C

Question 19 – Amazon EC2 is used by a large company to receive files from external parties throughout the day. After the data has been compiled, the file is compressed into a gzip file and uploaded to Amazon S3. There are approximately 100 GB of files every day. After the files get uploaded to Amazon S3, an AWS Batch program starts executing a COPY command for loading the files into an Amazon Redshift cluster. Which program modification would help accelerate this COPY process?
  • A. Uploading the individual files to Amazon S3 and running the COPY command as soon as the files become available.
  • B. Splitting the number of files so they are equal to a multiple of the number of slices in the Amazon Redshift cluster. Gzipping and uploading the files to Amazon S3. Running the COPY command on the files.
  • C. Splitting the number of files so they are equal to a multiple of the number of compute nodes in the Amazon Redshift cluster. Gzipping and uploading the files to Amazon S3. Running the COPY command on the files.
  • D. Applying sharding by breaking up the files so the dickey columns with the same values go to the same file. Gzipping and uploading the sharded files to Amazon S3. Running the COPY command on the files.

Correct Answer: B

Reference: https://docs.aws.amazon.com/redshift/latest/dg/t_splitting-data-files.html

Question 20 – A large ride-sharing company is having thousands of drivers that are globally serving millions of unique customers every day. The company plans the migration of an existing data mart to Amazon Redshift. The already existing schema is inclusive of the following tables.
  • A trips fact table for information on completed rides.
  • A driver’s dimension table for driver profiles.
  • A customer fact table holding customer profile information.
The trip details are analyzed by the date and destination for examining the profitability by region. The driver’s data rarely changes, but the customer’s data changes frequently. Which of the given table designs would provide an optimal query performance?
  • A. Using DISTSTYLE KEY (destination) for the trips table and sorting by date. Using DISTSTYLE ALL for the driver’s and customers’ tables.
  • B. Using DISTSTYLE EVEN for the trips table and sorting by date. Using DISTSTYLE ALL for the drivers’ table. Using DISTSTYLE EVEN for the customers’ table.
  • C. Using DISTSTYLE KEY (destination) for the trips table and sorting by date. Using DISTSTYLE ALL for the drivers’ table. Using DISTSTYLE EVEN for the customers’ table.
  • D. Using DISTSTYLE EVEN for the drivers’ table and sorting by date. Using DISTSTYLE ALL for both fact tables.

Correct Answer: A

AWS Data Analytics Specialty free practice tests
Menu