AWS Certified Data Engineer Associate (DEA-C01)

Complete Exam Cheat Sheet — All 4 Domains

Sources: AWS Certified Data Engineer Associate Exam Guide (DEA-C01, v1.1) · AWS Certified Data Engineer (Mishra et al., O'Reilly) Exam format: 65 questions total (50 scored + 15 unscored) | 130 minutes | Passing score ~720/1000

EXAM WEIGHT SNAPSHOT

Domain	Weight	Approx. Scored Questions
1: Data Ingestion and Transformation	34%	~17
2: Data Store Management	26%	~13
3: Data Operations and Support	22%	~11
4: Data Security and Governance	18%	~9

Memory hook — ITOD: Ingestion → Transformation → Operations → Data security. The weight drops in each domain; ingestion is biggest.

DOMAIN 1: DATA INGESTION AND TRANSFORMATION (34%)

1.1 Domain Overview

Task statements (from exam guide v1.1):

Task 1.1: Perform data ingestion (streaming + batch, schedulers, event triggers, rate limits, fan-in/fan-out, stateful/stateless)
Task 1.2: Transform and process data (containers, JDBC/ODBC, format conversion, cost optimization, troubleshooting)
Task 1.3: Orchestrate data pipelines (Step Functions, MWAA, Glue Workflows, EventBridge, SNS/SQS, fault tolerance)
Task 1.4: Apply programming concepts (Lambda concurrency, IaC, CI/CD, distributed computing, LLMs for data processing)

Core theme: Moving data from source to target, reliably and cost-efficiently, at any velocity.

1.2 Services by Category

Streaming Ingestion & Storage

Service	Primary Use	Key Exam Details	Choose When
Kinesis Data Streams (KDS)	Real-time stream storage + replay	Shard = 1 MB/s in, 2 MB/s out; up to 365-day retention; 70ms latency with enhanced fan-out; 200-500ms without	AWS-native integrations, millisecond latency, replay needed
Amazon MSK (Kafka)	Open-source Kafka stream storage	Lowest latency; tiered storage to S3; longer configurable retention; MSK Connect for CDC; MSK Replicator for cross-region	Open-source ecosystem (Debezium, DuckDB), highest throughput, Kafka expertise on team
Amazon Data Firehose	Near-real-time delivery to data stores	Fully managed, no code; buffers to S3/Redshift/OpenSearch/Splunk/Snowflake; can invoke Lambda for transformation; converts JSON→Parquet/ORC	Delivery to S3/Redshift/OpenSearch with zero ops overhead

KDS vs MSK Decision:

Attribute	Kinesis Data Streams	Amazon MSK
Management	Low	Low (Serverless) to Medium (Provisioned)
Scalability	Seconds (one click)	Minutes (one click)
Throughput	On-demand (capped limits)	Highest of the two
Open source	No	Yes (Apache Kafka)
Data retention	Up to 365 days	Longer; configurable; S3 tiered storage
Latency	70ms (enhanced fan-out) / 200-500ms	Lowest
CDC from databases	Via AWS DMS	Native via MSK Connect + Debezium

Batch Ingestion

Service	Use Case	Key Details
AWS Glue	Serverless batch ETL	Spark jobs, Python shell, streaming ETL; Glue Bookmarks for incremental loads; connectors to RDS, Redshift, Salesforce, S3
Amazon EMR	Large-scale big data	Hadoop, Spark, Hive, Flink, Presto; EMR on EC2/Serverless/EKS; use S3 not HDFS for persistent storage
AWS DMS	Database migration + CDC	Sources: Oracle, MySQL, PostgreSQL, MongoDB, S3; Targets: Redshift, S3, DynamoDB, Kafka, OpenSearch
Amazon AppFlow	SaaS-to-AWS batch ingestion	No code; Salesforce, Slack, SAP to S3/Redshift
AWS DataSync	File transfer (on-prem → AWS)	Copies only changed files after initial seed; preserves metadata/permissions
AWS Data Exchange	Third-party dataset ingestion	Marketplace for licensed data (S&P, weather, FINRA)

Data Transformation

Service	Type	Best For	Not Good For
AWS Glue	Batch + Streaming Spark	Serverless ETL, format conversion, managed Spark	Complex custom frameworks, >1 GB files (split them)
Amazon EMR	Batch + Streaming	Complex transforms, Hadoop ecosystem, full control	Teams wanting zero cluster management
Amazon Redshift	SQL batch	Data warehouse transforms, sub-second analytics, COPY/UNLOAD	Unstructured data, Python-heavy transforms
AWS Lambda	Event-driven lightweight	<15 min transforms, enrichment, glue logic	Large datasets, stateful operations
Amazon MSF (Managed Flink)	Streaming stateful	Complex windowing, exactly-once, out-of-order events	Spark-ecosystem teams, schema evolution
Glue DataBrew	Visual no-code prep	Non-technical personas, data quality, PII detection	Production-scale Spark workloads

Orchestration

Service	Best For	Key Detail
AWS Step Functions	Complex multi-step workflows with branching, retries	State machine; Standard (long-running) vs Express (high-volume, short)
Amazon MWAA (Airflow)	Python DAG-based complex workflows, teams with Airflow expertise	Fully managed Apache Airflow; higher complexity
AWS Glue Workflows	Glue-specific job chains	Triggers: on-demand, scheduled, event-based; simpler than Step Functions
Amazon EventBridge	Event-driven scheduling and routing	Cron schedules, event rules, connects 200+ AWS services
Redshift Scheduler	SQL-only pipeline scheduling within Redshift	Built-in; no external orchestrator needed

1.3 Decision Matrix

"Use X when..." — Batch Transformation

Requirement	Answer
Serverless + familiar Spark API	AWS Glue (Spark job)
Full framework control + custom libraries (Hudi, HBase)	Amazon EMR on EC2
SQL transforms on structured warehouse data	Amazon Redshift
Lightweight <15 min event-driven transforms	AWS Lambda
Spark but no cluster management	EMR Serverless
Non-technical user doing data prep	AWS Glue DataBrew

"Use X when..." — Streaming Transformation

Requirement	Answer
Stateful transforms, exactly-once, out-of-order events	Amazon MSF (Managed Flink)
Simple stateless Spark Structured Streaming	AWS Glue Streaming ETL
Lightweight transform before delivery to S3/Redshift	Amazon Data Firehose + Lambda
Low overhead near-real-time delivery	Amazon Data Firehose

"Use X when..." — Orchestration

Requirement	Answer
Branching logic, error handling, retries	AWS Step Functions
Python DAGs, team has Airflow background	Amazon MWAA
Glue-only pipeline with triggers	Glue Workflows
Cron or event-based trigger	Amazon EventBridge

1.4 Key Concepts Deep Dive

Kinesis Data Streams Architecture

Shard = fundamental unit: 1 MB/s write, 2 MB/s read (standard), up to 5 reads/s
Enhanced fan-out: each consumer gets 2 MB/s dedicated → 70ms latency (standard polling = 200-500ms)
Record: max 1 MB, contains stream name + data value + sequence number
Producers: KPL (Kinesis Producer Library), Kinesis Agent, AWS SDK, IoT Core, DMS, CloudWatch
Consumers: KCL (Kinesis Consumer Library), Lambda, Data Firehose, MSF, Glue Streaming

Glue Worker Types (exam-tested)

Worker	vCPU	Memory	Use
G.025X	1	4 GB	Low-volume streaming (Glue 3.0 only)
G.1X	4	16 GB	Standard batch jobs
G.2X	8	32 GB	Lightweight transforms (default recommendation)
G.4X	16	64 GB	Demanding aggregations, joins
G.8X	32	128 GB	Most demanding transforms

DPU = 4 vCPU + 16 GB RAM. Minimum 2 DPUs for Spark jobs. Python shell = 1/16 DPU (cheapest).

Glue Job Types

Spark jobs — batch, min 2 DPUs
Streaming ETL — Spark Structured Streaming, processes in configurable time windows (default 100s), min 2 DPUs
Python shell — lightweight, min 1/16 DPU; cannot run >15 min jobs well

Glue Bookmarks

Stores previously processed data info (S3 file paths, JDBC primary key ranges). Enables incremental processing — processes only new/modified data. Think of it as CDC for Glue batch jobs.

Data Firehose Buffering

Buffer size (MB) and buffer interval (seconds) control when data is flushed
0-second interval = immediate delivery (real-time use cases)
Larger buffer + longer interval = bigger batches for high-throughput S3/Redshift

Stateful vs Stateless Transactions

Stateless: Each event processed independently (filter, route, enrich). No memory of past events.
Stateful: Aggregations, windowing, session detection — requires state across events. Flink excels here.

Fan-In and Fan-Out

Fan-in: Multiple producers → single stream (data aggregation)
Fan-out: Single stream → multiple consumers. Use enhanced fan-out (KDS) or consumer groups (MSK) to avoid read throttling.

1.5 Exam Question Patterns

Common stems:

"...with the least operational overhead" → prefer managed/serverless: Glue > EMR, Data Firehose > custom consumer, Step Functions > custom retry logic
"...requires open source" → MSK over KDS, EMR over Glue
"...needs exactly-once processing" → MSF (Flink)
"...needs to handle out-of-order events" → MSF (Flink)
"...needs CDC from relational database" → AWS DMS or MSK Connect + Debezium
"...needs to convert JSON to Parquet" → Data Firehose (built-in), Glue job, or Lambda

Red herrings:

DMS for CDC to MSK when question says "proprietary only" (DMS is proprietary, MSK Connect + Debezium is open source)
MSF for schema evolution (Flink doesn't support schema evolution well; Spark Streaming + Glue does)
Lambda for jobs >15 minutes (Lambda max timeout = 15 min)
KDS for lowest latency (MSK has lower latency than KDS)

Keywords → Service mapping:

Keyword	Service
"replay data"	Kinesis Data Streams or MSK
"near-real-time delivery to S3"	Amazon Data Firehose
"incremental processing" / "job bookmarks"	AWS Glue
"exactly-once semantics"	Amazon MSF (Flink)
"open source Kafka"	Amazon MSK
"stateful stream processing"	MSF or Glue Streaming
"batch big data, full control"	Amazon EMR
"Debezium CDC"	Amazon MSK Connect

1.6 Quick Reference Tables

Lambda Limits (exam-relevant)

Limit	Value
Max timeout	15 minutes
Max memory	10 GB
Max deployment package (zipped)	50 MB
Concurrency default (per region)	1,000

Firehose Destinations

S3, Redshift, OpenSearch, Splunk, Datadog, Snowflake, MongoDB, New Relic, HTTP endpoint

Format Support Matrix

Format	Splittable	Columnar	Schema Evolution	Use
CSV	Yes (by line)	No	No	Raw ingest
JSON	No	No	No	APIs, web
Avro	Yes	No	Yes	Streaming serialization
Parquet	Yes	Yes	Limited	Analytics, data lake
ORC	Yes	Yes	No	Hadoop-heavy analytics

1.7 Common Mistakes & Traps

Thinking KDS delivers to S3 directly — it doesn't. Firehose sits between KDS/MSK and S3.
Using Lambda for long-running transforms — 15-min max kills this for anything serious.
Ignoring Glue Bookmarks — without them, full re-scan on every run = expensive and slow.
Choosing EMR when question says "least ops overhead" — EMR requires more management than Glue or EMR Serverless.
MSF for schema evolution — Flink does not handle schema evolution; use Spark Streaming (Glue) for that.
Forgetting enhanced fan-out — standard KDS consumer = shared 2 MB/s per shard; enhanced fan-out = dedicated 2 MB/s per consumer.
Choosing G.8X workers for everything — overkill increases cost; match worker to workload.

1.8 Practice Scenarios

Scenario 1

Requirement: IoT devices publish MQTT data. Need to ingest to S3 data lake with lowest latency, least operational overhead, convert to Parquet. Solution: IoT Core → Amazon MSK (lowest latency) → Amazon Data Firehose → S3 (Firehose converts JSON→Parquet natively) Why not KDS? MSK has lower latency. Why not Glue streaming? More operational overhead than Firehose for simple delivery.

Scenario 2

Requirement: Ingest CDC data from Aurora MySQL into a data lake using open-source tooling only. Solution: MSK Connect + Debezium → Amazon MSK → S3 via Data Firehose Wrong answer trap: AWS DMS — it's proprietary, violates the open-source requirement.

Scenario 3

Requirement: Real-time transaction fraud detection — detect patterns across events in a 5-minute window with exactly-once guarantees. Solution: KDS or MSK → Amazon MSF (Flink) — stateful windowing + exactly-once semantics Why not Glue Streaming? Glue Streaming has limitations in windowing and no native exactly-once.

Scenario 4

Requirement: Data team runs daily Spark ETL on S3 data. Jobs are intermittent, want zero cluster management. Solution: AWS Glue (Spark jobs) or EMR Serverless Tie-breaker: If they need custom Hive/HBase/custom frameworks → EMR Serverless. If Spark-only → AWS Glue wins on simplicity.

Scenario 5

Requirement: Ingest Kinesis stream directly into Redshift with minimal operational overhead. Solution: Amazon Redshift Streaming Ingestion (native feature, no intermediate S3) Wrong answer trap: Firehose → S3 → COPY command adds operational steps and latency.

Scenario 6

Requirement: Multi-step pipeline: crawl S3 → Glue ETL → quality check → load to Redshift. Need retries and branching. Solution: AWS Step Functions orchestrating Glue crawler, Glue job, Lambda quality check, Redshift COPY

Scenario 7

Requirement: Non-technical analyst needs to clean data, fill missing values, remove duplicates, with no code. Solution: AWS Glue DataBrew — visual, no-code data preparation tool for non-technical users.

DOMAIN 2: DATA STORE MANAGEMENT (26%)

2.1 Domain Overview

Task statements:

Task 2.1: Choose a data store (cost, performance, access patterns, migration, locks, open table formats, vector indexes)
Task 2.2: Understand data cataloging systems (Glue Data Catalog, crawlers, partitions, Hive metastore, SageMaker Catalog)
Task 2.3: Manage the lifecycle of data (S3 lifecycle policies, versioning, DynamoDB TTL, COPY/UNLOAD)
Task 2.4: Design data models and schema evolution (Redshift schema, DynamoDB keys/indexes, Lake Formation, lineage, compression)

Core theme: Picking the right store, keeping data accessible and cost-efficient throughout its lifecycle.

2.2 Services by Category

Data Warehouses

Service	Key Details	Choose When
Amazon Redshift Provisioned	MPP; RA3 nodes (managed storage); leader + compute nodes; WLM queues; full SQL	Complex OLAP, predictable high load, sub-second queries
Amazon Redshift Serverless	Auto-scales; RPU-based pricing; no cluster management	Variable/intermittent analytics workloads
Redshift Spectrum	Query S3 data directly from Redshift without loading	Lakehouse pattern — query cold data without ETL
Redshift Federated Query	Query live RDS/Aurora from Redshift	Avoid data movement, join warehouse with operational DB
Redshift Materialized Views	Pre-computed results, refresh on-demand or auto	Accelerate repeated complex queries

Data Lakes

Service	Key Details
Amazon S3	Primary data lake storage; 11 9s durability; unlimited scale; tiered storage classes
AWS Lake Formation	Governance layer over S3 + Glue Data Catalog; fine-grained permissions (table/column/row)
Amazon S3 Tables	Native managed Apache Iceberg tables in S3 (new feature)

Databases

Service	Type	Exam Use Case
Amazon DynamoDB	NoSQL key-value/document	High-throughput OLTP, millisecond latency, flexible schema; streams for CDC
Amazon RDS / Aurora	Relational OLTP	Traditional SQL apps; Aurora zero-ETL → Redshift
Amazon Aurora PostgreSQL	Relational	HNSW vector indexing for AI/ML similarity search
Amazon MemoryDB	Redis-compatible in-memory	Ultra-fast key/value, microsecond read latency, durable
Amazon Elasticache	In-memory cache	Cache layer, session data (not durable like MemoryDB)
Amazon Neptune	Graph database	Social graphs, fraud detection, knowledge graphs
Amazon OpenSearch Service	Search + log analytics	Full-text search, log aggregation, dashboards (OpenSearch Dashboards)
Amazon Keyspaces	Managed Cassandra	Wide-column, high-scale OLTP; Cassandra workloads

Open Table Formats

Format	Key Features	Built On	Exam Signal
Apache Iceberg	ACID transactions, schema evolution, time-travel, snapshot isolation, concurrent writes	Any storage	"Schema evolution" or "concurrent writes" → Iceberg
Apache Hudi	Incremental processing, upserts, CDC from databases	Spark	"Upserts into data lake"
Delta Lake	ACID + metadata, built for Spark	Apache Spark	"Spark Delta" scenarios

Memory hook: IHD = Iceberg (most flexible), Hudi (upserts), Delta (Spark-native).

2.3 Decision Matrix

Choose a Data Store

Requirement	Answer
Complex SQL analytics, BI dashboards	Amazon Redshift
OLTP millisecond reads/writes, flexible schema	Amazon DynamoDB
Full-text search, log analytics	Amazon OpenSearch
Graph relationships	Amazon Neptune
Key/value at microsecond latency, durable	Amazon MemoryDB
Vector similarity search (AI/ML)	Aurora PostgreSQL with HNSW
Data lake (raw + curated)	Amazon S3 + Lake Formation
Traditional relational OLTP	Amazon RDS / Aurora

S3 Storage Classes (lifecycle progression)

Class	Access Pattern	Cost	Retrieval
S3 Standard	Frequent	Highest	Immediate
S3 Standard-IA	Infrequent	Medium	Immediate
S3 Intelligent-Tiering	Unknown pattern	Auto-optimizes	Immediate or delayed
S3 Glacier Instant Retrieval	Quarterly	Low	Milliseconds
S3 Glacier Flexible	Archive, annual	Very low	Minutes-hours
S3 Glacier Deep Archive	Compliance archive	Lowest	12-48 hours

2.4 Key Concepts Deep Dive

Redshift Data Modeling

Star schema — fact table + dimension tables (denormalized). Best for OLAP query performance. Snowflake schema — normalized dimensions. Less redundancy, more joins. 3NF — fully normalized. Good for OLTP, bad for analytics (too many joins).

Redshift physical model settings:

Distribution style: EVEN (default), KEY (join column), ALL (small dimension tables), AUTO
Sort key: COMPOUND (range queries), INTERLEAVED (multiple filter columns)
Compression encoding: ZSTD (best general), AZ64 (numeric), BYTEDICT (low cardinality)
COPY command: loads from S3; parallel; supports CSV, JSON, Parquet, ORC
UNLOAD command: exports to S3; parallel; supports Parquet, CSV

WLM (Workload Management):

Manages query concurrency and memory allocation
AutoWLM recommended; or define manual queues
Monitor WLM queue disk spill metric — indicates memory pressure

DynamoDB Data Modeling

Partition key (PK) — determines data distribution. Choose high-cardinality values to avoid hot partitions. Sort key (SK) — enables range queries within a partition. Optional.

Indexes:

GSI (Global Secondary Index): Different PK/SK; separate provisioned capacity; supports eventual consistency only; use for alternate access patterns
LSI (Local Secondary Index): Same PK, different SK; shares table capacity; must be defined at table creation; supports strong consistency

DynamoDB Streams: Ordered sequence of item-level changes → can feed into Lambda, KDS for CDC. DynamoDB TTL: Auto-expire items by setting a TTL attribute (Unix timestamp). Free deletion. Use for session data, temporary records.

Data Catalog (Glue Data Catalog)

Central Hive Metastore-compatible metadata repository
Stores database, table, column, partition, statistics metadata
Glue Crawlers auto-detect schema by scanning data samples; populate Data Catalog
Integrated with Athena, Redshift Spectrum, EMR, Lake Formation, QuickSight
Technical metadata = schema, data types, partitions, lineage
Business metadata = SageMaker Catalog / DataZone (terms, ownership, policies)

Apache Iceberg Key Features

ACID transactions on data lake tables
Schema evolution — add/rename/drop columns without breaking existing queries
Time travel — query historical snapshots
Snapshot isolation — concurrent reads and writes safely
Native support in Athena, Glue, EMR, Redshift Spectrum

Vector Index Types (Skill 2.1.8)

Index	Full Name	Best For
HNSW	Hierarchical Navigable Small World	Fast approximate nearest neighbor search; Aurora PostgreSQL supports this
IVF	Inverted File Index	Large-scale vector search; clusters vectors into buckets

2.5 Data Lifecycle Management

S3 Lifecycle Policies

Transition objects between storage classes after N days
Expire (delete) objects after N days
Apply to specific prefixes or tags
Versioning → lifecycle policy can expire non-current versions

Medallion Architecture (Bronze/Silver/Gold)

Layer	Name	Content	Format
Raw / Bronze	Landing Zone	Unprocessed original data	CSV, JSON, Avro
Stage / Silver	Cleansed & Conformed	Validated, deduplicated, standardized	Parquet, ORC
Curated / Gold	Analytics-ready	Aggregated, business-logic applied	Parquet, Iceberg

Lakehouse Pattern: S3 + Redshift

Raw data → S3 (cold, cheap) → ETL (Glue/EMR) → COPY → Redshift (hot, fast SQL)
Redshift old data → UNLOAD → S3 → query via Spectrum or Athena

DynamoDB TTL

Set ttl attribute (Unix epoch) on items → DynamoDB auto-deletes expired items within 48 hours. Use case: session data, temporary cache, log entries with retention window.

2.6 Exam Question Patterns

Common stems:

"ACID transactions on a data lake" → Apache Iceberg
"Schema evolution without breaking workflows" → Apache Iceberg
"Query S3 data from Redshift without loading" → Redshift Spectrum
"Query operational DB from Redshift" → Redshift Federated Query
"Near-real-time sync Aurora → Redshift, zero ETL" → Aurora zero-ETL integration
"Cost-effective index for S3 data" → use Glue Data Catalog + Athena
"Vector similarity search" → Aurora PostgreSQL HNSW or Aurora + pgvector
"Microsecond key/value reads, durable" → Amazon MemoryDB

Red herrings:

Choosing DynamoDB for complex SQL queries (it's NoSQL, no joins)
Using EBS for big data on EMR (use S3 — HDFS on EBS has operational cost)
Assuming Glue Data Catalog is only for S3 (it also catalogs RDS, Redshift, DynamoDB)
HNSW for large-scale batch search (IVF scales better for massive vector sets)

2.7 Common Mistakes & Traps

HDFS on EMR vs S3 — always prefer S3 for persistent storage (HDFS dies with cluster)
GSI vs LSI confusion — LSI = same partition key, must be created at table creation; GSI = different PK, anytime, separate capacity
Iceberg vs Parquet — Parquet is a file format; Iceberg is a table format (adds ACID, schema evolution on top of Parquet/ORC)
Redshift Spectrum vs Athena — both query S3; Spectrum is Redshift-native (join with Redshift tables); Athena is standalone serverless SQL
S3 Standard vs Intelligent-Tiering — if access pattern is unknown, Intelligent-Tiering automates tier movement
SageMaker Catalog vs Glue Data Catalog — Glue Data Catalog = technical (schemas, partitions); SageMaker Catalog = business (governed data sharing, lineage) [NEW in v1.1]

2.8 Practice Scenarios

Scenario 1

Requirement: Company migrates from HDFS on-prem + Spark analytics to AWS. Need scalable, cost-efficient storage. Solution: Migrate data to Amazon S3; run Spark on Amazon EMR clusters using S3 as storage (not HDFS). Why? S3 is durable, decoupled from compute, infinitely scalable.

Scenario 2

Requirement: Data lake tables break frequently when source schema changes (new columns). Concurrent writes corrupt data. Solution: Migrate to Apache Iceberg table format. Why Iceberg? Native schema evolution (add/rename columns without breaking readers) + snapshot isolation for concurrent writes.

Scenario 3

Requirement: Ecommerce app queries products by category and price range (different access patterns). DynamoDB primary table keyed by ProductID. Solution: Create a GSI with Category as partition key and Price as sort key. Why GSI? Supports different access patterns (category queries) across all partitions.

Scenario 4

Requirement: Near-real-time analytics on MSK streaming data in Amazon Redshift with least operational overhead. Solution: Amazon MSK streaming ingestion directly to Redshift → define Redshift materialized views to consume MSK topics. Wrong answer: Firehose → S3 → COPY adds steps and latency.

Scenario 5

Requirement: OpenSearch index growing too large; costs rising. Need to manage index lifecycle and reduce costs. Solution: Use Index State Management (ISM) policies — automatically transition old indices to UltraWarm storage or delete them.

DOMAIN 3: DATA OPERATIONS AND SUPPORT (22%)

3.1 Domain Overview

Task statements:

Task 3.1: Automate data processing (MWAA, Step Functions, SDKs, EMR, Redshift, Glue, Athena, Lambda, EventBridge)
Task 3.2: Analyze data (QuickSight, Athena, Redshift SQL, DataBrew, SageMaker, Jupyter, data aggregations)
Task 3.3: Maintain and monitor pipelines (CloudWatch, CloudTrail, logs, alerts, SNS, troubleshooting Glue/EMR)
Task 3.4: Ensure data quality (Glue Data Quality, DataBrew rules, sampling, data skew, consistency checks)

Core theme: Keep pipelines running, observable, and producing trustworthy data.

3.2 Services by Category

Monitoring & Logging

Service	Use	Key Detail
Amazon CloudWatch	Metrics, alarms, log groups, dashboards	Set alarms on custom metrics; CloudWatch Logs Insights for log analysis
AWS CloudTrail	API call audit trail	Every AWS API call logged; use for security audits; data events = S3/Lambda
CloudWatch Logs Insights	Query logs with ad-hoc SQL-like syntax	Analyze structured logs at scale
Amazon OpenSearch	Log aggregation + full-text search	Store and search application logs via OpenSearch Dashboards
Amazon Athena	Ad-hoc SQL log analysis on S3	Query CloudTrail logs, application logs stored in S3

Data Quality

Service	Use	Key Detail
AWS Glue Data Quality	Define + run DQDL rules on datasets	Built into Glue; rules: completeness, uniqueness, referential integrity, freshness
AWS Glue DataBrew	Visual data profiling + quality rules	No-code; for non-technical users; 250+ built-in transforms
Amazon Deequ	Scala/Python data quality library	Open-source; runs on EMR Spark; define quality constraints in code
AWS Glue Crawlers	Schema detection	Auto-populates Data Catalog; runs on schedule or event-triggered

Analysis & Query

Service	Use	Key Detail
Amazon Athena	Serverless SQL on S3	Pay-per-query (per TB scanned); Parquet/ORC = cheaper; federated queries supported
Amazon QuickSight	BI dashboards + visualizations	SPICE for in-memory caching; row-level security; ML Insights built-in
Amazon Redshift	Complex SQL analytics	Materialized views, stored procedures, UDFs, Spectrum for S3
Amazon EMR + Spark	Large-scale data analysis	Athena notebooks use Apache Spark

Alerting

Service	Use
Amazon SNS	Push notifications (email, Lambda, SQS, HTTP) for pipeline failures
Amazon SQS	Decouple pipeline stages, buffer messages, dead-letter queue for failures
EventBridge	Route events to targets based on rules; schedule triggers

3.3 Key Concepts Deep Dive

CloudWatch vs CloudTrail

Aspect	CloudWatch	CloudTrail
What it monitors	AWS resource metrics + application logs	AWS API calls (who did what, when)
Use for	Performance monitoring, alarms	Security audits, compliance, access tracking
Log S3 access	Enable S3 metrics + access logging	Enable data events for S3
Key output	Metric alarms → SNS	CloudTrail logs → S3 / CloudWatch Logs

Data Quality Concepts (DQDL — Glue Data Quality Definition Language)

Rules you can define:

Completeness — check for nulls/empty fields
Uniqueness — detect duplicates
Referential integrity — FK relationships
Freshness — data recency checks
Custom rules — regex patterns, value ranges

DataBrew quality actions:

Fill missing values (mean, median, custom)
Identify duplicate records
Formatting functions (date standardization, case normalization)
Nesting/unnesting JSON structures
PII detection and redaction

Data Sampling Techniques (Skill 3.4.4)

Simple random sampling — each record has equal probability
Stratified sampling — sample proportionally from subgroups
Systematic sampling — every Nth record
Reservoir sampling — fixed-size sample from streaming data

Data Skew in Spark (Skill 3.4.5)

Skew = some partitions have far more data than others → one task runs much longer than others. Solutions:

Salting partition keys (add random prefix)
Repartition before joins
Use Adaptive Query Execution (AQE) in Spark 3+
Split large partitions

Athena Key Details

Serverless; no cluster to manage
Charged per TB of data scanned
Partition pruning reduces scan cost (partition by date, region, etc.)
Result reuse — cache query results to avoid re-scanning identical queries (use --result-reuse-configuration)
Federated query — query RDS, DynamoDB, on-prem via Lambda connectors
Athena notebooks — Apache Spark for interactive exploration

QuickSight Key Details

SPICE (Super-fast Parallel In-memory Calculation Engine) — caches dataset; faster queries; not free (extra capacity cost)
Direct Query mode — queries data source live (no SPICE, slower)
Row-level security (RLS) — create permissions dataset mapping users/groups to rows they can see
Column-level security — via Lake Formation integration
ML Insights — anomaly detection, forecasting, auto-narratives built-in

Athena vs Redshift Spectrum

Attribute	Amazon Athena	Redshift Spectrum
Engine	Presto/Trino	Redshift (MPP)
Use	Standalone S3 queries	S3 queries joined with Redshift tables
Requires cluster	No	Yes (Redshift cluster or serverless)
Cost	Per TB scanned	Redshift RPU + per TB scanned
When to use	Ad-hoc, no Redshift needed	Need to join S3 cold data with Redshift hot data

3.4 Exam Question Patterns

Common stems:

"Reduce Athena query costs" → partition data by common filter columns + use Parquet/ORC format
"Monitor pipeline failures and notify team" → CloudWatch Alarm → SNS → email/Lambda
"Audit who accessed data" → AWS CloudTrail
"Ensure data quality before ETL" → AWS Glue Data Quality rules
"Non-technical analyst needs data quality" → AWS Glue DataBrew
"Query logs stored in S3 with SQL" → Amazon Athena
"Search logs with full-text search" → Amazon OpenSearch Service
"Dashboard refresh but reduce costs" → Athena result reuse caching
"Serverless vs provisioned" → provisioned = predictable high load; serverless = variable/intermittent

Keywords → Service mapping:

Keyword	Service
"audit trail" / "who called API"	AWS CloudTrail
"pipeline metrics" / "alerts"	Amazon CloudWatch + SNS
"data quality rules" / "DQDL"	AWS Glue Data Quality
"visual data preparation" / "no-code"	AWS Glue DataBrew
"serverless SQL on S3"	Amazon Athena
"BI dashboard" / "SPICE"	Amazon QuickSight
"log full-text search"	Amazon OpenSearch
"data skew"	Salting / Spark AQE

3.5 Common Mistakes & Traps

CloudWatch vs CloudTrail confusion — CloudWatch = what is happening (metrics); CloudTrail = who did it (API calls)
Athena cost vs performance — Parquet/ORC + partitioning = 90%+ cost reduction vs raw CSV/JSON
QuickSight SPICE — SPICE caches data for performance but has capacity cost; large datasets may prefer Direct Query
Using Glue Data Quality for PII detection — Glue Data Quality checks rules (completeness, uniqueness); PII detection uses Amazon Macie or Glue Studio sensitive data detection
Thinking Step Functions = Airflow — Step Functions is for AWS-service orchestration; MWAA is for Python DAG-based workflows
Forgetting dead letter queues — SQS DLQ captures messages that failed processing; essential for pipeline reliability

3.6 Practice Scenarios

Scenario 1

Requirement: Financial services company ingests data to S3. Before Glue ETL, must check for missing values, duplicates, format issues. Least operational overhead. Solution: AWS Glue Data Quality — define DQDL rules on source data; configure CloudWatch Alarms to notify team of violations. Wrong answers: Lambda custom script (more code/ops), EMR Spark (overkill, more ops), Athena daily check (not real-time, reactive not proactive).

Scenario 2

Requirement: Data lake tables updated every 24h. BI layer queries via Athena hourly. Need to reduce query costs while ensuring fresh dashboards. Solution: Use Athena result reuse (--result-reuse-configuration) with max age = 24h so hourly queries reuse cached results until next ETL run.

Scenario 3

Requirement: Serverless pipeline on EMR Serverless with job dependencies and automatic retries. Minimize operational overhead. Solution: AWS Step Functions — define state machine orchestrating EMR Serverless jobs with built-in retry logic and branching.

Scenario 4

Requirement: Company needs to search application logs by error message content. Logs land in S3 daily. Two valid approaches:

Athena (SQL queries on S3 logs — if structured/semi-structured)
OpenSearch (full-text search, Kibana/OpenSearch Dashboards — if need real-time search and complex text queries)

Scenario 5

Requirement: Regional QuickSight analysts should see only data from their own region. Solution: QuickSight Row-Level Security (RLS) — create a permissions dataset mapping analyst user/group to their allowed region values. Assign RLS dataset to the main dataset.

DOMAIN 4: DATA SECURITY AND GOVERNANCE (18%)

4.1 Domain Overview

Task statements:

Task 4.1: Apply authentication mechanisms (VPC security groups, IAM roles, Secrets Manager, S3 Access Points, PrivateLink)
Task 4.2: Apply authorization mechanisms (custom IAM policies, Secrets Manager, Redshift RBAC, Lake Formation, RBAC/TBAC/ABAC)
Task 4.3: Ensure data encryption and masking (data masking, anonymization, compliance)
Task 4.4: Prepare logs for audit (CloudTrail, CloudWatch, logging config)
Task 4.5: Understand data privacy and governance (lineage, data catalog, data sharing, quality, profiling, lifecycle, auditing)

Core theme: Implement least privilege, encrypt everything, log everything, govern data at scale.

4.2 Services by Category

Authentication & Network Security

Service	Use	Key Detail
AWS IAM	Identity and access management	Users, groups, roles, policies; principle of least privilege
AWS IAM Identity Center	SSO + multi-account access	Integrates with AD/SAML apps; integrates with Lake Formation
VPC Security Groups	Network-level firewall	Stateful; allow rules only (no deny); control inbound/outbound
VPC Endpoints	Private connectivity to AWS services	Gateway (S3, DynamoDB) or Interface (PrivateLink) — no internet traffic
AWS PrivateLink	Expose services privately across VPCs	Interface VPC endpoint; secure cross-account service access
AWS Secrets Manager	Credential storage + auto-rotation	Store DB passwords, API keys; auto-rotate; integrates with RDS, Redshift
AWS Systems Manager Parameter Store	Configuration + secret storage	Simpler, cheaper alternative to Secrets Manager for non-rotating secrets

Authorization

Service	Mechanism	Granularity
AWS IAM Policies	RBAC	Service/resource level
AWS Lake Formation	TBAC (tag-based) + name-based	Table, column, row level on S3 data lake
Amazon Redshift RBAC	Database roles	Row-level security, column-level, dynamic data masking
QuickSight RLS	Row-level security dataset	Row level per user/group

Encryption

Service	Use	Key Detail
AWS KMS	Key management	CMKs (Customer Managed Keys); integrates with S3, Redshift, Glue, DynamoDB
SSE-S3	Server-side encryption with S3-managed keys	Simplest; AWS manages keys
SSE-KMS	Server-side encryption with KMS keys	Audit key usage via CloudTrail; customer controls key policy
DSSE-KMS	Dual-layer server-side encryption	Two layers of encryption in one operation (compliance requirement)
Client-side encryption	Encrypt before upload	Max control; customer manages encryption/decryption

Data Governance

Service	Role
AWS Glue Data Catalog	Technical metadata (schemas, partitions, lineage)
Amazon DataZone / SageMaker Catalog	Business catalog (data marketplace, sharing, governance)
Amazon Macie	PII/sensitive data detection in S3 using ML
AWS CloudTrail	API-level audit trail
Amazon CloudWatch	Operational monitoring + logging
AWS Lake Formation	Fine-grained access, data sharing, tagging

4.3 Authorization Deep Dive

IAM Policy Types

Type	When to Use	Notes
AWS Managed Policy	Quick setup; may be too broad	Broad permissions; not principle of least privilege
Customer Managed Policy	Precise control; reusable	Recommended; define exact actions + specific ARNs
Inline Policy	Role-specific, non-reusable	Attached directly to one role; avoid for most use cases
Resource-based Policy	S3 buckets, KMS keys, Lambda	Principal specified in policy; cross-account access

Principle of least privilege = grant only the minimum permissions needed for the task. Always prefer custom managed policies over AWS managed.

Lake Formation Access Control

Three levels:

Name-based — grant permissions on specific databases, tables, columns directly by name
Tag-based (LF-TBAC) — assign LF-Tags to resources; grant permissions to tags; scalable for many resources
Row/column filtering — fine-grained row-level security + column exclusion

Setup flow:

Register S3 prefix as data lake location
Grant permissions on Glue Data Catalog resources (databases, tables)
Column/row filters for fine-grained access

Best practices:

Don't add bucket policies for S3 buckets registered with Lake Formation (Lake Formation controls access)
Don't use root AWS user as data lake admin
Use LF-TBAC for large numbers of tables (fewer grants to manage)

Authorization Methods

Method	Description	Best For
RBAC (Role-based)	Permissions based on role (job function)	Standard org structures
ABAC (Attribute-based)	Permissions based on tags/attributes on both principal and resource	Dynamic, scalable
TBAC (Tag-based, Lake Formation)	LF-Tags on data resources matched with IAM principal tags	Large data lake with many tables

Redshift Database Security

Superuser — full access to all objects
Database users — created with CREATE USER; can have CREATEDB, CREATEUSER permissions
Roles — role-based access (CREATE ROLE; GRANT role TO user); roles can be nested
Row-level security (RLS) — define RLS policies filtering rows per user/role
Dynamic data masking (DDM) — mask column data at query time per user/role; no actual data change
Column-level security — GRANT SELECT (col1, col2) ON table TO role

4.4 Encryption Deep Dive

Encryption at Rest — S3 Options

Option	Key Management	Audit via CloudTrail	Compliance
SSE-S3	AWS managed	No key-level audit	Basic
SSE-KMS	AWS KMS (customer or AWS managed CMK)	Yes	Most common
DSSE-KMS	Two KMS layers	Yes	High compliance (e.g., HIPAA dual-layer)
Client-side	Customer	N/A	Max control

DSSE-KMS = Dual-layer Server-Side Encryption. Applies two independent layers of encryption in one S3 PUT operation. Least operational overhead for dual-layer compliance requirements.

Encryption in Transit

All AWS data movement services (DMS, DataSync, Backup, VPN) encrypt in transit by default using SSL/TLS.

Sensitive Data Detection & Masking

Service	Mechanism	Use
Amazon Macie	ML-based PII detection	Scan S3 buckets; detect names, SSN, credit card, email, etc.
Macie Custom Data Identifiers	Regex + keywords	Detect org-specific patterns (customer IDs, internal codes)
Glue Studio	Sensitive data detection + redaction	Detect and mask PII in Glue ETL pipelines; partial redaction
Redshift DDM	Dynamic data masking	Column-level masking at query time
DataBrew	Recipe-based PII masking	Visual, no-code masking for non-technical users

4.5 Logging & Audit

CloudTrail Data Events vs Management Events

Type	What it captures	Default?
Management events	API calls that manage AWS resources (create, delete, modify)	Yes
Data events	Object-level S3 operations (GetObject, PutObject), Lambda invocations	No — must enable separately

To audit S3 object access: Enable CloudTrail S3 data events for specific buckets.

Log Analysis Options

Scenario	Tool
Query structured logs in S3 with SQL	Amazon Athena
Real-time log analytics + dashboards	Amazon OpenSearch + OpenSearch Dashboards
Filter/search CloudWatch Logs	CloudWatch Logs Insights
Big data log processing	Amazon EMR
Monitor Glue/EMR job metrics	Amazon CloudWatch

4.6 Data Governance Pillars

Pillar	AWS Service(s)
Metadata management	AWS Glue Data Catalog, SageMaker Catalog
Data sharing	AWS Lake Formation cross-account sharing, DataZone
Data quality	AWS Glue Data Quality, DataBrew, Deequ
Data profiling	DataBrew, Glue Data Quality
Data lifecycle	S3 Lifecycle policies, DynamoDB TTL
Data lineage	SageMaker ML Lineage Tracking, SageMaker Catalog
Logging & auditing	CloudTrail, CloudWatch, S3 Access Logging

4.7 Exam Question Patterns

Common stems:

"Store DB credentials securely + auto-rotate" → AWS Secrets Manager
"Audit who accessed S3 objects" → CloudTrail data events (must be explicitly enabled)
"Column-level access control on data lake" → AWS Lake Formation
"PII detection in S3" → Amazon Macie
"Custom PII patterns (internal IDs)" → Macie custom data identifiers
"Two layers of encryption, least overhead" → DSSE-KMS
"Credentials in Glue job scripts (security risk)" → Move to AWS Secrets Manager, create Glue connection referencing the secret
"Fine-grained row access in QuickSight" → QuickSight Row-Level Security (RLS) dataset
"Large number of tables, scalable access control" → Lake Formation tag-based access control (LF-TBAC)
"Cross-account data lake access" → Lake Formation cross-account grants

Red herrings:

Using KMS encryption to implement column-level access control (KMS = encryption, not access control)
IAM policies alone for column/row-level permissions in data lake (IAM doesn't do column/row; Lake Formation does)
GuardDuty for PII classification (GuardDuty = threat detection; Macie = data classification)
AWS DMS for CDC when "open source" is required (DMS is proprietary)
Parameter Store for auto-rotating database credentials (use Secrets Manager; Parameter Store doesn't auto-rotate)

Keywords → Service mapping:

Keyword	Service
"auto-rotate credentials"	AWS Secrets Manager
"PII detection in S3"	Amazon Macie
"column-level / row-level on data lake"	AWS Lake Formation
"two layers encryption, least overhead"	DSSE-KMS
"audit API calls"	AWS CloudTrail
"SSO + multi-account"	IAM Identity Center
"private connectivity to AWS services"	VPC Endpoints / PrivateLink
"dynamic data masking at query time"	Redshift DDM
"data lineage"	SageMaker ML Lineage / SageMaker Catalog
"scalable tag-based access control"	Lake Formation LF-TBAC

4.8 Common Mistakes & Traps

Secrets Manager vs Parameter Store — Secrets Manager has auto-rotation; Parameter Store is simpler/cheaper but no built-in rotation
KMS vs Lake Formation — KMS encrypts data at rest; Lake Formation controls who can access which columns/rows
SSE-S3 vs SSE-KMS — SSE-KMS has CloudTrail audit for key usage; SSE-S3 doesn't
Macie managed vs custom identifiers — managed = standard PII (credit cards, SSN); custom = org-specific regex patterns
CloudTrail data events not enabled by default — management events are on by default; S3 data events (object reads/writes) require explicit enablement
Lake Formation bucket policies conflict — when S3 bucket is registered with Lake Formation, remove/don't use bucket policies (they conflict); Lake Formation manages access
Hardcoded credentials in Glue jobs — always store in Secrets Manager and reference via Glue connection

4.9 Practice Scenarios

Scenario 1

Requirement: Healthcare org must apply two independent layers of encryption to all S3 files. Least operational overhead. Solution: DSSE-KMS — dual-layer server-side encryption with a single S3 operation. Wrong answers: Client-side + SSE = two steps, more operational overhead. SSE-KMS alone = one layer. SSE-S3 + SSE-KMS = not a valid combination.

Scenario 2

Requirement: Financial company needs to automatically detect credit card numbers and internal customer reference IDs (custom format) in S3 data lake. Solution: Enable Macie managed data identifiers (for credit cards) + define Macie custom data identifiers (for internal reference patterns using regex). Wrong answers: GuardDuty (threat detection, not data classification), Lake Formation (access control, not data discovery), EMR with custom regex (too much operational overhead).

Scenario 3

Requirement: Glue job has MongoDB credentials hardcoded in the script. Fix the security vulnerability. Solution: Store credentials in AWS Secrets Manager → create a Glue connection to MongoDB referencing the secret → Glue job uses the connection. Wrong answers: Glue job parameters (still exposed in logs), IAM Identity Center (authenticates AWS users, not MongoDB), S3 config file (still exposed).

Scenario 4

Requirement: Multiple analytics teams (Athena, Glue, Redshift Spectrum) accessing a large S3 data lake with 500+ tables. Need scalable fine-grained access control. Solution: Register S3 with Lake Formation → define LF-Tags on tables/columns → grant permissions to IAM principals based on tags (LF-TBAC). Scalable — fewer grants than name-based. Wrong answers: IAM bucket policies (no column-level control, doesn't scale), KMS (encryption only, not access control).

Scenario 5

Requirement: Company processes patient records. Need to detect and redact PII (names, addresses) in Glue ETL pipelines while preserving data usability. Solution: Use AWS Glue Studio sensitive data detection with partial redaction option (not full deletion — preserves data utility for analytics). Why not delete? Full deletion prevents downstream analytics that need the record structure. Partial redaction masks specific fields while keeping record integrity.

CROSS-DOMAIN QUICK REFERENCE

Zero-ETL Integrations (High-Value Exam Topic)

Zero-ETL = direct data movement with no ETL pipeline code to write. Always preferred when available.

Source	Destination	Integration
Amazon Aurora MySQL/PostgreSQL	Amazon Redshift	Aurora zero-ETL
Amazon RDS for MySQL	Amazon Redshift	RDS zero-ETL
Amazon DynamoDB	Amazon OpenSearch	DynamoDB zero-ETL
Amazon MSK	Amazon Redshift	Redshift streaming ingestion
Amazon KDS	Amazon Redshift	Redshift streaming ingestion

Service "Minimum Operational Overhead" Ranking

When the exam says "least operational overhead," rank options like this:

Zero-ETL / native integrations (always first if available)
Fully managed serverless (Glue, Athena, Lambda, Data Firehose, MSF)
Managed with some config (EMR Serverless, Redshift Serverless, MSK Serverless)
Provisioned managed (EMR on EC2, Redshift Provisioned, MSK Provisioned)
Custom code/scripts (always avoid in "least overhead" questions)

The "Open Source" Test

If the question mentions "open source" requirement:

MSK instead of KDS
EMR instead of Glue (EMR supports more open frameworks)
MSK Connect + Debezium instead of AWS DMS for CDC
Apache Iceberg/Hudi instead of proprietary formats
Athena (Presto/Trino) instead of Redshift for ad-hoc serverless SQL

Critical Service Limits

Service	Limit	Exam Relevance
Lambda timeout	15 minutes max	Eliminates Lambda for long-running transforms
KDS shard write	1 MB/s, 1000 records/s	Design for throughput
KDS shard read	2 MB/s standard; 2 MB/s per consumer enhanced fan-out	Enhanced fan-out for multiple consumers
KDS retention	24h default, up to 365 days	Replay window
Glue Python shell DPU	1/16 DPU min	Cheapest compute
Glue Spark jobs DPU	2 DPU min	Minimum resource
S3 object size	5 TB max	Rarely tested but know it
Redshift COPY	Parallel from S3	Fastest way to load Redshift

SageMaker Unified Studio (New in v1.1)

Combines DataBrew, SageMaker Studio, EMR Studio, Glue Studio into one unified experience
SageMaker Catalog = business data catalog (data discovery, governed sharing, lineage)
SageMaker Lakehouse = unified access layer across S3 data lakes and Redshift
Domain/domain units/projects = org structure for access control in SageMaker Unified Studio (Skill 4.1.7)