AWS Certified Data Engineer Associate

The AWS Certified Data Engineer Associate (DEA-C01) exam confirms a candidate’s skill in setting up data pipelines and addressing issues related to cost and performance using best practices. The exam also verifies a candidate’s ability to:

Ingest and transform data, and manage data pipelines with programming concepts.
Opt for the best data store, devise data models, organize data schemas, and handle data lifecycles.
Operate, sustain, and supervise data pipelines.
Evaluate data and guarantee data quality.
Implement suitable authentication, authorization, data encryption, privacy, and governance.
Activate logging.

Target Audience

The ideal candidate should possess around 2–3 years of experience in data engineering. They should grasp how the volume, variety, and velocity of data impact aspects like ingestion, transformation, modeling, security, governance, privacy, schema design, and optimal data store design. Additionally, the candidate should have hands-on experience with AWS services for at least 1–2 years.

Recommended general IT knowledge includes:

Setting up and maintaining extract, transform, and load (ETL) pipelines from ingestion to destination
Application of high-level programming concepts, regardless of language, as required by the pipeline
Utilization of Git commands for source control
Knowledge of data lakes for storing data
General understanding of networking, storage, and compute concepts

Recommended AWS knowledge for the candidate includes:

Knowing how to utilize AWS services to complete the tasks outlined in the Introduction section of this exam guide
Grasping the AWS services related to encryption, governance, protection, and logging for all data within data pipelines
Being able to compare AWS services to comprehend the differences in cost, performance, and functionality
Having the skill to structure and execute SQL queries on AWS services
Understanding how to analyze data, check data quality, and maintain data consistency using AWS services

Exam Details

AWS Data Engineer Associate is an associate-level exam that will have 85 questions. The time duration for the exam is 170 minutes. The exam consists of two types of questions:

Multiple choice: You choose one correct response from four options, including three incorrect ones (distractors).
Multiple response: You pick two or more correct responses from five or more options.

The passing score for the exam is 720. The exam cost is 75$ USD and is available in English language.

Course Outline

This exam course ourline contains information about the weightings, content domains, and tasks for the exam. It provide extra details for each task statement to aid in your preparation. The exam is divided into different content domains, each with its own weighting.

Domain 1: Data Ingestion and Transformation

Task Statement 1.1: Perform data ingestion.

Knowledge of:

Throughput and latency characteristics for AWS services that ingest data
Data ingestion patterns (for example, frequency and data history) (AWS Documentation: Data ingestion patterns)
Streaming data ingestion (AWS Documentation: Streaming ingestion)
Batch data ingestion (for example, scheduled ingestion, event-driven ingestion) (AWS Documentation: Data ingestion methods)
Replayability of data ingestion pipelines
Stateful and stateless data transactions

Skills in:

Reading data from streaming sources (for example, Amazon Kinesis, Amazon Managed Streaming for Apache Kafka [Amazon MSK], Amazon DynamoDB Streams, AWS Database Migration Service [AWS DMS], AWS Glue, Amazon Redshift) (AWS Documentation: Streaming ETL jobs in AWS Glue)
Reading data from batch sources (for example, Amazon S3, AWS Glue, Amazon EMR, AWS DMS, Amazon Redshift, AWS Lambda, Amazon AppFlow) (AWS Documentation: Loading data from Amazon S3)
Implementing appropriate configuration options for batch ingestion
Consuming data APIs (AWS Documentation: Using the Amazon Redshift Data API)
Setting up schedulers by using Amazon EventBridge, Apache Airflow, or time-based schedules for jobs and crawlers (AWS Documentation: Time-based schedules for jobs and crawlers)
Setting up event triggers (for example, Amazon S3 Event Notifications, EventBridge) (AWS Documentation: Using EventBridge)
Calling a Lambda function from Amazon Kinesis (AWS Documentation: Using Lambda with Kinesis Data Streams)
Creating allowlists for IP addresses to allow connections to data sources (AWS Documentation: IP addresses to add to your allow list)
Implementing throttling and overcoming rate limits (for example, DynamoDB, Amazon RDS, Kinesis) (AWS Documentation: Throttling issues for DynamoDB tables using provisioned capacity mode)
Managing fan-in and fan-out for streaming data distribution (AWS Documentation: Developing Enhanced Fan-Out Consumers with the Kinesis Data Streams API)

Task Statement 1.2: Transform and process data.

Knowledge of:

Creation of ETL pipelines based on business requirements (AWS Documentation: Build an ETL service pipeline)
Volume, velocity, and variety of data (for example, structured data, unstructured data)
Cloud computing and distributed computing (AWS Documentation: What is cloud computing?, What is Distributed Computing?)
How to use Apache Spark to process data (AWS Documentation: Apache Spark)
Intermediate data staging locations

Skills in:

Optimizing container usage for performance needs (for example, Amazon Elastic Kubernetes Service [Amazon EKS], Amazon Elastic Container Service [Amazon ECS])
Connecting to different data sources (for example, Java Database Connectivity [JDBC], Open Database Connectivity [ODBC]) (AWS Documentation: Connecting to Amazon Athena with ODBC and JDBC drivers)
Integrating data from multiple sources (AWS Documentation: What is Data Integration?)
Optimizing costs while processing data (AWS Documentation: Cost optimization)
Implementing data transformation services based on requirements (for example, Amazon EMR, AWS Glue, Lambda, Amazon Redshift)
Transforming data between formats (for example, from .csv to Apache Parquet) (AWS Documentation: Three AWS Glue ETL job types for converting data to Apache Parquet)
Troubleshooting and debugging common transformation failures and performance issues (AWS Documentation: Troubleshooting resources)
Creating data APIs to make data available to other systems by using AWS services (AWS Documentation: Using RDS Data API)

Task Statement 1.3: Orchestrate data pipelines.

Knowledge of:

How to integrate various AWS services to create ETL pipelines
Event-driven architecture (AWS Documentation: Event-driven architectures)
How to configure AWS services for data pipelines based on schedules or dependencies (AWS Documentation: What is AWS Data Pipeline?)
Serverless workflows

Skills in:

Using orchestration services to build workflows for data ETL pipelines (for example, Lambda, EventBridge, Amazon Managed Workflows for Apache Airflow [Amazon MWAA], AWS Step Functions, AWS Glue workflows) (AWS Documentation: Migrating workloads from AWS Data Pipeline to Step Functions, Workflow orchestration)
Building data pipelines for performance, availability, scalability, resiliency, and fault tolerance (AWS Documentation: Building a reliable data pipeline)
Implementing and maintaining serverless workflows (AWS Documentation: Developing with a serverless workflow)
Using notification services to send alerts (for example, Amazon Simple Notification Service [Amazon SNS], Amazon Simple Queue Service [Amazon SQS]) (AWS Documentation: Getting started with Amazon SNS)

Task Statement 1.4: Apply programming concepts.

Knowledge of:

Continuous integration and continuous delivery (CI/CD) (implementation, testing, and deployment of data pipelines) (AWS Documentation: Continuous delivery and continuous integration)
SQL queries (for data source queries and data transformations) (AWS Documentation: Using a SQL query to transform data)
Infrastructure as code (IaC) for repeatable deployments (for example, AWS Cloud Development Kit [AWS CDK], AWS CloudFormation) (AWS Documentation: Infrastructure as code)
Distributed computing (AWS Documentation: What is Distributed Computing?)
Data structures and algorithms (for example, graph data structures and tree data structures)
SQL query optimization

Skills in:

Optimizing code to reduce runtime for data ingestion and transformation (AWS Documentation: Code optimization)
Configuring Lambda functions to meet concurrency and performance needs (AWS Documentation: Understanding Lambda function scaling, Configuring reserved concurrency for a function)
Performing SQL queries to transform data (for example, Amazon Redshift stored procedures) (AWS Documentation: Overview of stored procedures in Amazon Redshift)
Structuring SQL queries to meet data pipeline requirements
Using Git commands to perform actions such as creating, updating, cloning, and branching repositories (AWS Documentation: Basic Git commands)
Using the AWS Serverless Application Model (AWS SAM) to package and deploy serverless data pipelines (for example, Lambda functions, Step Functions, DynamoDB tables) (AWS Documentation: What is the AWS Serverless Application Model (AWS SAM)?)
Using and mounting storage volumes from within Lambda functions (AWS Documentation: Configuring file system access for Lambda functions)

Domain 2: Data Store Management

Task Statement 2.1: Choose a data store.

Knowledge of:

Storage platforms and their characteristics (AWS Documentation: Storage)
Storage services and configurations for specific performance demands
Data storage formats (for example, .csv, .txt, Parquet) (AWS Documentation: Data format options for inputs and outputs in AWS Glue for Spark)
How to align data storage with data migration requirements (AWS Documentation: AWS managed migration tools)
How to determine the appropriate storage solution for specific access patterns (AWS Documentation: Choose the optimal storage based on access patterns, data growth, and the performance requirements)
How to manage locks to prevent access to data (for example, Amazon Redshift, Amazon RDS) (AWS Documentation: LOCK)

Skills in:

Implementing the appropriate storage services for specific cost and performance requirements (for example, Amazon Redshift, Amazon EMR, AWS Lake Formation, Amazon RDS, DynamoDB, Amazon Kinesis Data Streams, Amazon MSK) (AWS Documentation: Streaming ingestion)
Configuring the appropriate storage services for specific access patterns and requirements (for example, Amazon Redshift, Amazon EMR, Lake Formation, Amazon RDS, DynamoDB) (AWS Documentation: What is AWS Lake Formation?, Querying external data using Amazon Redshift Spectrum)
Applying storage services to appropriate use cases (for example, Amazon S3) (AWS Documentation: What is Amazon S3?)
Integrating migration tools into data processing systems (for example, AWS Transfer Family)
Implementing data migration or remote access methods (for example, Amazon Redshift federated queries, Amazon Redshift materialized views, Amazon Redshift Spectrum) (AWS Documentation: Querying data with federated queries in Amazon Redshift)

Task Statement 2.2: Understand data cataloging systems.

Knowledge of:

How to create a data catalog (AWS Documentation: Getting started with the AWS Glue Data Catalog)
Data classification based on requirements (AWS Documentation: Data classification models and schemes)
Components of metadata and data catalogs (AWS Documentation: AWS Glue Data Catalog)

Skills in:

Using data catalogs to consume data from the data’s source (AWS Documentation: Data discovery and cataloging in AWS Glue)
Building and referencing a data catalog (for example, AWS Glue Data Catalog, Apache Hive metastore) (AWS Documentation: Using the AWS Glue Data Catalog as the metastore for Hive)
Discovering schemas and using AWS Glue crawlers to populate data catalogs (AWS Documentation: Using crawlers to populate the Data Catalog)
Synchronizing partitions with a data catalog (AWS Documentation: Best practices when using Athena with AWS Glue)
Creating new source or target connections for cataloging (for example, AWS Glue) (AWS Documentation: Configuring data target nodes)

Task Statement 2.3: Manage the lifecycle of data.

Knowledge of:

Appropriate storage solutions to address hot and cold data requirements (AWS Documentation: Cold storage for Amazon OpenSearch Service)
How to optimize the cost of storage based on the data lifecycle (AWS Documentation: Storage optimization services)
How to delete data to meet business and legal requirements
Data retention policies and archiving strategies (AWS Documentation: Implement data retention policies for each class of data in the analytics workload)
How to protect data with appropriate resiliency and availability (AWS Documentation: Data protection in AWS Resilience Hub)

Skills in:

Performing load and unload operations to move data between Amazon S3 and Amazon Redshift (AWS Documentation: Unloading data to Amazon S3)
Managing S3 Lifecycle policies to change the storage tier of S3 data (AWS Documentation: Managing your storage lifecycle)
Expiring data when it reaches a specific age by using S3 Lifecycle policies (AWS Documentation: Expiring objects)
Managing S3 versioning and DynamoDB TTL (AWS Documentation: Time to Live (TTL))

Task Statement 2.4: Design data models and schema evolution.

Knowledge of:

Data modeling concepts (AWS Documentation: Data-modeling process steps)
How to ensure accuracy and trustworthiness of data by using data lineage
Best practices for indexing, partitioning strategies, compression, and other data optimization techniques (AWS Documentation: Optimize your data modeling and data storage for efficient data retrieval)
How to model structured, semi-structured, and unstructured data (AWS Documentation: What’s The Difference Between Structured Data And Unstructured Data?)
Schema evolution techniques (AWS Documentation: Handling schema updates)

Skills in:

Designing schemas for Amazon Redshift, DynamoDB, and Lake Formation (AWS Documentation: CREATE SCHEMA)
Addressing changes to the characteristics of data (AWS Documentation: Disaster recovery options in the cloud)
Performing schema conversion (for example, by using the AWS Schema Conversion Tool [AWS SCT] and AWS DMS Schema Conversion) (AWS Documentation: Converting database schemas using DMS Schema Conversion)
Establishing data lineage by using AWS tools (for example, Amazon SageMaker ML Lineage Tracking)

Domain 3: Data Operations and Support

Task Statement 3.1: Automate data processing by using AWS services.

Knowledge of:

How to maintain and troubleshoot data processing for repeatable business outcomes (AWS Documentation: Define recovery objectives to maintain business continuity)
API calls for data processing
Which services accept scripting (for example, Amazon EMR, Amazon Redshift, AWS Glue) (AWS Documentation: What is AWS Glue?)

Skills in:

Orchestrating data pipelines (for example, Amazon MWAA, Step Functions) (AWS Documentation: Workflow orchestration)
Troubleshooting Amazon managed workflows (AWS Documentation: Troubleshooting Amazon Managed Workflows for Apache Airflow)
Calling SDKs to access Amazon features from code (AWS Documentation: Code examples by SDK using AWS SDKs)
Using the features of AWS services to process data (for example, Amazon EMR, Amazon Redshift, AWS Glue)
Consuming and maintaining data APIs (AWS Documentation: API management)
Preparing data transformation (for example, AWS Glue DataBrew) (AWS Documentation: What is AWS Glue DataBrew?)
Querying data (for example, Amazon Athena)
Using Lambda to automate data processing (AWS Documentation: AWS Lambda)
Managing events and schedulers (for example, EventBridge) (AWS Documentation: What is Amazon EventBridge Scheduler?)

Task Statement 3.2: Analyze data by using AWS services.

Knowledge of:

Tradeoffs between provisioned services and serverless services (AWS Documentation: Understanding serverless architectures)
SQL queries (for example, SELECT statements with multiple qualifiers or JOIN clauses) (AWS Documentation: Subquery examples)
How to visualize data for analysis (AWS Documentation: Analysis and visualization)
When and how to apply cleansing techniques
Data aggregation, rolling average, grouping, and pivoting (AWS Documentation: Aggregate functions, Using pivot tables)

Skills in:

Visualizing data by using AWS services and tools (for example, AWS Glue DataBrew, Amazon QuickSight)
Verifying and cleaning data (for example, Lambda, Athena, QuickSight, Jupyter Notebooks, Amazon SageMaker Data Wrangler)
Using Athena to query data or to create views (AWS Documentation: Working with views)
Using Athena notebooks that use Apache Spark to explore data (AWS Documentation: Using Apache Spark in Amazon Athena)

Task Statement 3.3: Maintain and monitor data pipelines.

Knowledge of:

How to log application data (AWS Documentation: What is Amazon CloudWatch Logs?)
Best practices for performance tuning (AWS Documentation: Best practices for performance tuning AWS Glue for Apache Spark jobs)
How to log access to AWS services (AWS Documentation: Enabling logging from AWS services)
Amazon Macie, AWS CloudTrail, and Amazon CloudWatch

Skills in:

Extracting logs for audits (AWS Documentation: Logging and monitoring in AWS Audit Manager)
Deploying logging and monitoring solutions to facilitate auditing and traceability (AWS Documentation: Designing and implementing logging and monitoring with Amazon CloudWatch)
Using notifications during monitoring to send alerts
Troubleshooting performance issues
Using CloudTrail to track API calls (AWS Documentation: AWS CloudTrail)
Troubleshooting and maintaining pipelines (for example, AWS Glue, Amazon EMR) (AWS Documentation: Building a reliable data pipeline)
Using Amazon CloudWatch Logs to log application data (with a focus on configuration and automation)
Analyzing logs with AWS services (for example, Athena, Amazon EMR, Amazon OpenSearch Service, CloudWatch Logs Insights, big data application logs) (AWS Documentation: Analyzing log data with CloudWatch Logs Insights)

Task Statement 3.4: Ensure data quality.

Knowledge of:

Data sampling techniques (AWS Documentation: Using Spigot to sample your dataset)
How to implement data skew mechanisms (AWS Documentation: Data skew)
Data validation (data completeness, consistency, accuracy, and integrity)
Data profiling

Skills in:

Running data quality checks while processing the data (for example, checking for empty fields) (AWS Documentation: Data Quality Definition Language (DQDL) reference)
Defining data quality rules (for example, AWS Glue DataBrew) (AWS Documentation: Validating data quality in AWS Glue DataBrew)
Investigating data consistency (for example, AWS Glue DataBrew) (AWS Documentation: What is AWS Glue DataBrew)

Domain 4: Data Security and Governance

Task Statement 4.1: Apply authentication mechanisms.

Knowledge of:

VPC security networking concepts (AWS Documentation: What is Amazon VPC?)
Differences between managed services and unmanaged services
Authentication methods (password-based, certificate-based, and role-based) (AWS Documentation: Authentication methods)
Differences between AWS managed policies and customer managed policies (AWS Documentation: Managed policies and inline policies)

Skills in:

Updating VPC security groups (AWS Documentation: Security group rules)
Creating and updating IAM groups, roles, endpoints, and services (AWS Documentation: IAM Identities (users, user groups, and roles))
Creating and rotating credentials for password management (for example, AWS Secrets Manager) (AWS Documentation: Password management with Amazon RDS and AWS Secrets Manager)
Setting up IAM roles for access (for example, Lambda, Amazon API Gateway, AWS CLI, CloudFormation)
Applying IAM policies to roles, endpoints, and services (for example, S3 Access Points, AWS PrivateLink) (AWS Documentation: Configuring IAM policies for using access points)

Task Statement 4.2: Apply authorization mechanisms.

Knowledge of:

Authorization methods (role-based, policy-based, tag-based, and attributebased) (AWS Documentation: What is ABAC for AWS?)
Principle of least privilege as it applies to AWS security
Role-based access control and expected access patterns (AWS Documentation: Types of access control)
Methods to protect data from unauthorized access across services (AWS Documentation: Mitigating Unauthorized Access to Data)

Skills in:

Creating custom IAM policies when a managed policy does not meet the needs (AWS Documentation: Creating IAM policies (console))
Storing application and database credentials (for example, Secrets Manager, AWS Systems Manager Parameter Store) (AWS Documentation: AWS Systems Manager Parameter Store)
Providing database users, groups, and roles access and authority in a database (for example, for Amazon Redshift) (AWS Documentation: Example for controlling user and group access)
Managing permissions through Lake Formation (for Amazon Redshift, Amazon EMR, Athena, and Amazon S3) (AWS Documentation: Managing Lake Formation permissions)

Task Statement 4.3: Ensure data encryption and masking.

Knowledge of:

Data encryption options available in AWS analytics services (for example, Amazon Redshift, Amazon EMR, AWS Glue) (AWS Documentation: Data Encryption)
Differences between client-side encryption and server-side encryption (AWS Documentation: Client-side and server-side encryption)
Protection of sensitive data (AWS Documentation: Data protection in AWS Resource Groups)
Data anonymization, masking, and key salting

Skills in:

Applying data masking and anonymization according to compliance laws or company policies
Using encryption keys to encrypt or decrypt data (for example, AWS Key Management Service [AWS KMS]) (AWS Documentation: Encrypting and decrypting data keys)
Configuring encryption across AWS account boundaries (AWS Documentation: Allowing users in other accounts to use a KMS key)
Enabling encryption in transit for data.

Task Statement 4.4: Prepare logs for audit.

Knowledge of:

How to log application dat (AWS Documentation:a What is Amazon CloudWatch Logs?)
How to log access to AWS services (AWS Documentation: Enabling logging from AWS services)
Centralized AWS logs (AWS Documentation: Centralized Logging on AWS)

Skills in:

Using CloudTrail to track API calls (AWS Documentation: AWS CloudTrail)
Using CloudWatch Logs to store application logs (AWS Documentation: What is Amazon CloudWatch Logs?)
Using AWS CloudTrail Lake for centralized logging queries (AWS Documentation: Querying AWS CloudTrail logs)
Analyzing logs by using AWS services (for example, Athena, CloudWatch Logs Insights, Amazon OpenSearch Service) (AWS Documentation: Analyzing log data with CloudWatch Logs Insights)
Integrating various AWS services to perform logging (for example, Amazon EMR in cases of large volumes of log data)

Task Statement 4.5: Understand data privacy and governance.Knowledge of:

How to protect personally identifiable information (PII) (AWS Documentation: Personally identifiable information (PII))
Data sovereignty

Skills in:

Granting permissions for data sharing (for example, data sharing for Amazon Redshift) (AWS Documentation: Sharing data in Amazon Redshift)
Implementing PII identification (for example, Macie with Lake Formation) (AWS Documentation: Data Protection in Lake Formation)
Implementing data privacy strategies to prevent backups or replications of data to disallowed AWS Regions
Managing configuration changes that have occurred in an account (for example, AWS Config) (AWS Documentation: Managing the Configuration Recorder)

AWS Data Engineer Associate Exam FAQs

Check here for FAQs!

AWS Exam Policy

Amazon Web Services (AWS) lays out specific rules and procedures for their certification exams. These guidelines cover various aspects of exam training and certification. Some of the key policies include:

Exam Retake Policy:

If a candidate doesn’t pass the exam, they must wait for 14 days before being eligible for a retake. There’s no limit on the number of attempts until the exam is passed, but the full registration fee is required for each attempt.

Exam Rescheduling:

To reschedule or cancel an exam, follow these steps:

Sign in to aws.training/Certification.
Click on the “Go to your Account” button.
Choose “Manage PSI” or “Pearson VUE Exams.”
You’ll be directed to the PSI or Pearson VUE dashboard.
If the exam is with PSI, click “View Details” for the scheduled exam. If it’s with Pearson VUE, select the exam in the “Upcoming Appointments” menu.
Keep in mind that you can reschedule the exam up to 24 hours before the scheduled time, and each appointment can only be rescheduled twice. If you need to take the exam a third time, you must cancel it and then schedule it for a suitable date.

AWS Data Engineer Associate Exam Study Guide

AWS Exam Page

AWS furnishes an exam page that includes the certification’s course outline, an overview, and crucial details. These information are crafted by AWS experts to showcase skills and guide candidates through hands-on exercises reflective of exam scenarios. Further, use the certification page validates proficiency in core data-related AWS services, the ability to implement data pipelines, troubleshoot issues, and optimize cost and performance following best practices. If you’re keen on leveraging AWS technology to transform data for analysis and actionable insights, taking this exam provides an early chance to earn the new certification.

AWS Learning Resources

AWS offers a diverse range of learning resources to cater to individuals at various stages of their cloud computing journey. From beginners seeking foundational knowledge to experienced professionals aiming to refine their skills, AWS provides comprehensive documentation, tutorials, and hands-on labs. The AWS Training and Certification platform offers structured courses led by expert instructors, covering a wide array of topics from cloud fundamentals to specialized domains like machine learning and security. Some of them for AWS Data Engineer Associate exams are:

Join Study Groups

Study groups offer a dynamic and collaborative approach to AWS exam preparation. By joining these groups, you gain access to a community of like-minded individuals who are also navigating the complexities of AWS certifications. Engaging in discussions, sharing experiences, and collectively tackling challenges can provide valuable insights and enhance your understanding of key concepts. Study groups create a supportive environment where members can clarify doubts, exchange tips, and stay motivated throughout their certification journey. This collaborative learning experience not only strengthens your grasp of AWS technologies but also fosters a sense of camaraderie among peers pursuing similar goals.

Use Practice Tests

Incorporating AWS practice tests into your preparation strategy is essential for achieving exam success. These practice tests simulate the actual exam environment, allowing you to assess your knowledge, identify areas for improvement, and familiarize yourself with the types of questions you may encounter. Regularly taking practice tests helps build confidence, refines your time-management skills, and ensures you are well-prepared for the specific challenges posed by AWS certification exams. The combination of study groups and practice tests creates a well-rounded and effective approach to mastering AWS technologies and earning your certification.

AWS Certified Data Engineer Associate

Target Audience

Exam Details

Course Outline

Domain 1: Data Ingestion and Transformation

Domain 2: Data Store Management

Domain 3: Data Operations and Support

Domain 4: Data Security and Governance

AWS Data Engineer Associate Exam FAQs

AWS Exam Policy

AWS Data Engineer Associate Exam Study Guide

AWS Exam Page

AWS Learning Resources

Join Study Groups

Use Practice Tests

Prepare for Assured Success