Loading...
Loading...
Loading...
> **Sources:** AWS Certified Data Engineer Associate Exam Guide (DEA-C01, v1.1) · *AWS Certified Data Engineer* (Mishra et al., O'Reilly)
# AWS Certified Data Engineer Associate (DEA-C01) # Complete Exam Cheat Sheet — All 4 Domains > **Sources:** AWS Certified Data Engineer Associate Exam Guide (DEA-C01, v1.1) · *AWS Certified Data Engineer* (Mishra et al., O'Reilly) > **Exam format:** 65 questions total (50 scored + 15 unscored) | 130 minutes | Passing score ~720/1000 --- ## EXAM WEIGHT SNAPSHOT | Domain | Weight | Approx. Scored Questions | |--------|--------|--------------------------| | 1: Data Ingestion and Transformation | 34% | ~17 | | 2: Data Store Management | 26% | ~13 | | 3: Data Operations and Support | 22% | ~11 | | 4: Data Security and Governance | 18% | ~9 | > **Memory hook — ITOD:** **I**ngestion → **T**ransformation → **O**perations → **D**ata security. The weight drops in each domain; ingestion is biggest. --- --- # DOMAIN 1: DATA INGESTION AND TRANSFORMATION (34%) ## 1.1 Domain Overview **Task statements (from exam guide v1.1):** - **Task 1.1:** Perform data ingestion (streaming + batch, schedulers, event triggers, rate limits, fan-in/fan-out, stateful/stateless) - **Task 1.2:** Transform and process data (containers, JDBC/ODBC, format conversion, cost optimization, troubleshooting) - **Task 1.3:** Orchestrate data pipelines (Step Functions, MWAA, Glue Workflows, EventBridge, SNS/SQS, fault tolerance) - **Task 1.4:** Apply programming concepts (Lambda concurrency, IaC, CI/CD, distributed computing, LLMs for data processing) **Core theme:** Moving data from source to target, reliably and cost-efficiently, at any velocity. --- ## 1.2 Services by Category ### Streaming Ingestion & Storage | Service | Primary Use | Key Exam Details | Choose When | |---------|-------------|------------------|-------------| | **Kinesis Data Streams (KDS)** | Real-time stream storage + replay | Shard = 1 MB/s in, 2 MB/s out; up to 365-day retention; 70ms latency with enhanced fan-out; 200-500ms without | AWS-native integrations, millisecond latency, replay needed | | **Amazon MSK (Kafka)** | Open-source Kafka stream storage | Lowest latency; tiered storage to S3; longer configurable retention; MSK Connect for CDC; MSK Replicator for cross-region | Open-source ecosystem (Debezium, DuckDB), highest throughput, Kafka expertise on team | | **Amazon Data Firehose** | Near-real-time delivery to data stores | Fully managed, no code; buffers to S3/Redshift/OpenSearch/Splunk/Snowflake; can invoke Lambda for transformation; converts JSON→Parquet/ORC | Delivery to S3/Redshift/OpenSearch with zero ops overhead | **KDS vs MSK Decision:** | Attribute | Kinesis Data Streams | Amazon MSK | |-----------|---------------------|------------| | Management | Low | Low (Serverless) to Medium (Provisioned) | | Scalability | Seconds (one click) | Minutes (one click) | | Throughput | On-demand (capped limits) | Highest of the two | | Open source | No | Yes (Apache Kafka) | | Data retention | Up to 365 days | Longer; configurable; S3 tiered storage | | Latency | 70ms (enhanced fan-out) / 200-500ms | Lowest | | CDC from databases | Via AWS DMS | Native via MSK Connect + Debezium | ### Batch Ingestion | Service | Use Case | Key Details | |---------|----------|-------------| | **AWS Glue** | Serverless batch ETL | Spark jobs, Python shell, streaming ETL; Glue Bookmarks for incremental loads; connectors to RDS, Redshift, Salesforce, S3 | | **Amazon EMR** | Large-scale big data | Hadoop, Spark, Hive, Flink, Presto; EMR on EC2/Serverless/EKS; use S3 not HDFS for persistent storage | | **AWS DMS** | Database migration + CDC | Sources: Oracle, MySQL, PostgreSQL, MongoDB, S3; Targets: Redshift, S3, DynamoDB, Kafka, OpenSearch | | **Amazon AppFlow** | SaaS-to-AWS batch ingestion | No code; Salesforce, Slack, SAP to S3/Redshift | | **AWS DataSync** | File transfer (on-prem → AWS) | Copies only changed files after initial seed; preserves metadata/permissions | | **AWS Data Exchange** | Third-party dataset ingestion | Marketplace for licensed data (S&P, weather, FINRA) | ### Data Transformation | Service | Type | Best For | Not Good For | |---------|------|----------|--------------| | **AWS Glue** | Batch + Streaming Spark | Serverless ETL, format conversion, managed Spark | Complex custom frameworks, >1 GB files (split them) | | **Amazon EMR** | Batch + Streaming | Complex transforms, Hadoop ecosystem, full control | Teams wanting zero cluster management | | **Amazon Redshift** | SQL batch | Data warehouse transforms, sub-second analytics, COPY/UNLOAD | Unstructured data, Python-heavy transforms | | **AWS Lambda** | Event-driven lightweight | <15 min transforms, enrichment, glue logic | Large datasets, stateful operations | | **Amazon MSF (Managed Flink)** | Streaming stateful | Complex windowing, exactly-once, out-of-order events | Spark-ecosystem teams, schema evolution | | **Glue DataBrew** | Visual no-code prep | Non-technical personas, data quality, PII detection | Production-scale Spark workloads | ### Orchestration | Service | Best For | Key Detail | |---------|----------|------------| | **AWS Step Functions** | Complex multi-step workflows with branching, retries | State machine; Standard (long-running) vs Express (high-volume, short) | | **Amazon MWAA (Airflow)** | Python DAG-based complex workflows, teams with Airflow expertise | Fully managed Apache Airflow; higher complexity | | **AWS Glue Workflows** | Glue-specific job chains | Triggers: on-demand, scheduled, event-based; simpler than Step Functions | | **Amazon EventBridge** | Event-driven scheduling and routing | Cron schedules, event rules, connects 200+ AWS services | | **Redshift Scheduler** | SQL-only pipeline scheduling within Redshift | Built-in; no external orchestrator needed | --- ## 1.3 Decision Matrix ### "Use X when..." — Batch Transformation | Requirement | Answer | |-------------|--------| | Serverless + familiar Spark API | AWS Glue (Spark job) | | Full framework control + custom libraries (Hudi, HBase) | Amazon EMR on EC2 | | SQL transforms on structured warehouse data | Amazon Redshift | | Lightweight <15 min event-driven transforms | AWS Lambda | | Spark but no cluster management | EMR Serverless | | Non-technical user doing data prep | AWS Glue DataBrew | ### "Use X when..." — Streaming Transformation | Requirement | Answer | |-------------|--------| | Stateful transforms, exactly-once, out-of-order events | Amazon MSF (Managed Flink) | | Simple stateless Spark Structured Streaming | AWS Glue Streaming ETL | | Lightweight transform before delivery to S3/Redshift | Amazon Data Firehose + Lambda | | Low overhead near-real-time delivery | Amazon Data Firehose | ### "Use X when..." — Orchestration | Requirement | Answer | |-------------|--------| | Branching logic, error handling, retries | AWS Step Functions | | Python DAGs, team has Airflow background | Amazon MWAA | | Glue-only pipeline with triggers | Glue Workflows | | Cron or event-based trigger | Amazon EventBridge | --- ## 1.4 Key Concepts Deep Dive ### Kinesis Data Streams Architecture - **Shard** = fundamental unit: 1 MB/s write, 2 MB/s read (standard), up to 5 reads/s - **Enhanced fan-out**: each consumer gets 2 MB/s dedicated → 70ms latency (standard polling = 200-500ms) - **Record**: max 1 MB, contains stream name + data value + sequence number - **Producers**: KPL (Kinesis Producer Library), Kinesis Agent, AWS SDK, IoT Core, DMS, CloudWatch - **Consumers**: KCL (Kinesis Consumer Library), Lambda, Data Firehose, MSF, Glue Streaming ### Glue Worker Types (exam-tested) | Worker | vCPU | Memory | Use | |--------|------|--------|-----| | G.025X | 1 | 4 GB | Low-volume streaming (Glue 3.0 only) | | G.1X | 4 | 16 GB | Standard batch jobs | | G.2X | 8 | 32 GB | Lightweight transforms (default recommendation) | | G.4X | 16 | 64 GB | Demanding aggregations, joins | | G.8X | 32 | 128 GB | Most demanding transforms | **DPU = 4 vCPU + 16 GB RAM. Minimum 2 DPUs for Spark jobs. Python shell = 1/16 DPU (cheapest).** ### Glue Job Types - **Spark jobs** — batch, min 2 DPUs - **Streaming ETL** — Spark Structured Streaming, processes in configurable time windows (default 100s), min 2 DPUs - **Python shell** — lightweight, min 1/16 DPU; cannot run >15 min jobs well ### Glue Bookmarks Stores previously processed data info (S3 file paths, JDBC primary key ranges). Enables **incremental processing** — processes only new/modified data. Think of it as CDC for Glue batch jobs. ### Data Firehose Buffering - **Buffer size** (MB) and **buffer interval** (seconds) control when data is flushed - 0-second interval = immediate delivery (real-time use cases) - Larger buffer + longer interval = bigger batches for high-throughput S3/Redshift ### Stateful vs Stateless Transactions - **Stateless**: Each event processed independently (filter, route, enrich). No memory of past events. - **Stateful**: Aggregations, windowing, session detection — requires state across events. Flink excels here. ### Fan-In and Fan-Out - **Fan-in**: Multiple producers → single stream (data aggregation) - **Fan-out**: Single stream → multiple consumers. Use enhanced fan-out (KDS) or consumer groups (MSK) to avoid read throttling. --- ## 1.5 Exam Question Patterns **Common stems:** - "...with the **least operational overhead**" → prefer managed/serverless: Glue > EMR, Data Firehose > custom consumer, Step Functions > custom retry logic - "...requires **open source**" → MSK over KDS, EMR over Glue - "...needs **exactly-once** processing" → MSF (Flink) - "...needs to handle **out-of-order events**" → MSF (Flink) - "...needs **CDC from relational database**" → AWS DMS or MSK Connect + Debezium - "...needs to convert **JSON to Parquet**" → Data Firehose (built-in), Glue job, or Lambda **Red herrings:** - DMS for CDC to MSK when question says "proprietary only" (DMS is proprietary, MSK Connect + Debezium is open source) - MSF for schema evolution (Flink doesn't support schema evolution well; Spark Streaming + Glue does) - Lambda for jobs >15 minutes (Lambda max timeout = 15 min) - KDS for lowest latency (MSK has lower latency than KDS) **Keywords → Service mapping:** | Keyword | Service | |---------|---------| | "replay data" | Kinesis Data Streams or MSK | | "near-real-time delivery to S3" | Amazon Data Firehose | | "incremental processing" / "job bookmarks" | AWS Glue | | "exactly-once semantics" | Amazon MSF (Flink) | | "open source Kafka" | Amazon MSK | | "stateful stream processing" | MSF or Glue Streaming | | "batch big data, full control" | Amazon EMR | | "Debezium CDC" | Amazon MSK Connect | --- ## 1.6 Quick Reference Tables ### Lambda Limits (exam-relevant) | Limit | Value | |-------|-------| | Max timeout | 15 minutes | | Max memory | 10 GB | | Max deployment package (zipped) | 50 MB | | Concurrency default (per region) | 1,000 | ### Firehose Destinations S3, Redshift, OpenSearch, Splunk, Datadog, Snowflake, MongoDB, New Relic, HTTP endpoint ### Format Support Matrix | Format | Splittable | Columnar | Schema Evolution | Use | |--------|-----------|----------|-----------------|-----| | CSV | Yes (by line) | No | No | Raw ingest | | JSON | No | No | No | APIs, web | | Avro | Yes | No | **Yes** | Streaming serialization | | Parquet | Yes | **Yes** | Limited | Analytics, data lake | | ORC | Yes | **Yes** | No | Hadoop-heavy analytics | --- ## 1.7 Common Mistakes & Traps 1. **Thinking KDS delivers to S3 directly** — it doesn't. Firehose sits between KDS/MSK and S3. 2. **Using Lambda for long-running transforms** — 15-min max kills this for anything serious. 3. **Ignoring Glue Bookmarks** — without them, full re-scan on every run = expensive and slow. 4. **Choosing EMR when question says "least ops overhead"** — EMR requires more management than Glue or EMR Serverless. 5. **MSF for schema evolution** — Flink does not handle schema evolution; use Spark Streaming (Glue) for that. 6. **Forgetting enhanced fan-out** — standard KDS consumer = shared 2 MB/s per shard; enhanced fan-out = dedicated 2 MB/s per consumer. 7. **Choosing G.8X workers for everything** — overkill increases cost; match worker to workload. --- ## 1.8 Practice Scenarios ### Scenario 1 **Requirement:** IoT devices publish MQTT data. Need to ingest to S3 data lake with lowest latency, least operational overhead, convert to Parquet. **Solution:** IoT Core → Amazon MSK (lowest latency) → Amazon Data Firehose → S3 (Firehose converts JSON→Parquet natively) **Why not KDS?** MSK has lower latency. Why not Glue streaming? More operational overhead than Firehose for simple delivery. ### Scenario 2 **Requirement:** Ingest CDC data from Aurora MySQL into a data lake using open-source tooling only. **Solution:** MSK Connect + Debezium → Amazon MSK → S3 via Data Firehose **Wrong answer trap:** AWS DMS — it's proprietary, violates the open-source requirement. ### Scenario 3 **Requirement:** Real-time transaction fraud detection — detect patterns across events in a 5-minute window with exactly-once guarantees. **Solution:** KDS or MSK → Amazon MSF (Flink) — stateful windowing + exactly-once semantics **Why not Glue Streaming?** Glue Streaming has limitations in windowing and no native exactly-once. ### Scenario 4 **Requirement:** Data team runs daily Spark ETL on S3 data. Jobs are intermittent, want zero cluster management. **Solution:** AWS Glue (Spark jobs) or EMR Serverless **Tie-breaker:** If they need custom Hive/HBase/custom frameworks → EMR Serverless. If Spark-only → AWS Glue wins on simplicity. ### Scenario 5 **Requirement:** Ingest Kinesis stream directly into Redshift with minimal operational overhead. **Solution:** Amazon Redshift Streaming Ingestion (native feature, no intermediate S3) **Wrong answer trap:** Firehose → S3 → COPY command adds operational steps and latency. ### Scenario 6 **Requirement:** Multi-step pipeline: crawl S3 → Glue ETL → quality check → load to Redshift. Need retries and branching. **Solution:** AWS Step Functions orchestrating Glue crawler, Glue job, Lambda quality check, Redshift COPY ### Scenario 7 **Requirement:** Non-technical analyst needs to clean data, fill missing values, remove duplicates, with no code. **Solution:** AWS Glue DataBrew — visual, no-code data preparation tool for non-technical users. --- --- # DOMAIN 2: DATA STORE MANAGEMENT (26%) ## 2.1 Domain Overview **Task statements:** - **Task 2.1:** Choose a data store (cost, performance, access patterns, migration, locks, open table formats, vector indexes) - **Task 2.2:** Understand data cataloging systems (Glue Data Catalog, crawlers, partitions, Hive metastore, SageMaker Catalog) - **Task 2.3:** Manage the lifecycle of data (S3 lifecycle policies, versioning, DynamoDB TTL, COPY/UNLOAD) - **Task 2.4:** Design data models and schema evolution (Redshift schema, DynamoDB keys/indexes, Lake Formation, lineage, compression) **Core theme:** Picking the right store, keeping data accessible and cost-efficient throughout its lifecycle. --- ## 2.2 Services by Category ### Data Warehouses | Service | Key Details | Choose When | |---------|-------------|-------------| | **Amazon Redshift Provisioned** | MPP; RA3 nodes (managed storage); leader + compute nodes; WLM queues; full SQL | Complex OLAP, predictable high load, sub-second queries | | **Amazon Redshift Serverless** | Auto-scales; RPU-based pricing; no cluster management | Variable/intermittent analytics workloads | | **Redshift Spectrum** | Query S3 data directly from Redshift without loading | Lakehouse pattern — query cold data without ETL | | **Redshift Federated Query** | Query live RDS/Aurora from Redshift | Avoid data movement, join warehouse with operational DB | | **Redshift Materialized Views** | Pre-computed results, refresh on-demand or auto | Accelerate repeated complex queries | ### Data Lakes | Service | Key Details | |---------|-------------| | **Amazon S3** | Primary data lake storage; 11 9s durability; unlimited scale; tiered storage classes | | **AWS Lake Formation** | Governance layer over S3 + Glue Data Catalog; fine-grained permissions (table/column/row) | | **Amazon S3 Tables** | Native managed Apache Iceberg tables in S3 (new feature) | ### Databases | Service | Type | Exam Use Case | |---------|------|---------------| | **Amazon DynamoDB** | NoSQL key-value/document | High-throughput OLTP, millisecond latency, flexible schema; streams for CDC | | **Amazon RDS / Aurora** | Relational OLTP | Traditional SQL apps; Aurora zero-ETL → Redshift | | **Amazon Aurora PostgreSQL** | Relational | HNSW vector indexing for AI/ML similarity search | | **Amazon MemoryDB** | Redis-compatible in-memory | Ultra-fast key/value, microsecond read latency, durable | | **Amazon Elasticache** | In-memory cache | Cache layer, session data (not durable like MemoryDB) | | **Amazon Neptune** | Graph database | Social graphs, fraud detection, knowledge graphs | | **Amazon OpenSearch Service** | Search + log analytics | Full-text search, log aggregation, dashboards (OpenSearch Dashboards) | | **Amazon Keyspaces** | Managed Cassandra | Wide-column, high-scale OLTP; Cassandra workloads | ### Open Table Formats | Format | Key Features | Built On | Exam Signal | |--------|-------------|----------|-------------| | **Apache Iceberg** | ACID transactions, schema evolution, time-travel, snapshot isolation, concurrent writes | Any storage | "Schema evolution" or "concurrent writes" → Iceberg | | **Apache Hudi** | Incremental processing, upserts, CDC from databases | Spark | "Upserts into data lake" | | **Delta Lake** | ACID + metadata, built for Spark | Apache Spark | "Spark Delta" scenarios | **Memory hook:** **IHD = I**ceberg (most flexible), **H**udi (upserts), **D**elta (Spark-native). --- ## 2.3 Decision Matrix ### Choose a Data Store | Requirement | Answer | |-------------|--------| | Complex SQL analytics, BI dashboards | Amazon Redshift | | OLTP millisecond reads/writes, flexible schema | Amazon DynamoDB | | Full-text search, log analytics | Amazon OpenSearch | | Graph relationships | Amazon Neptune | | Key/value at microsecond latency, durable | Amazon MemoryDB | | Vector similarity search (AI/ML) | Aurora PostgreSQL with HNSW | | Data lake (raw + curated) | Amazon S3 + Lake Formation | | Traditional relational OLTP | Amazon RDS / Aurora | ### S3 Storage Classes (lifecycle progression) | Class | Access Pattern | Cost | Retrieval | |-------|---------------|------|-----------| | S3 Standard | Frequent | Highest | Immediate | | S3 Standard-IA | Infrequent | Medium | Immediate | | S3 Intelligent-Tiering | Unknown pattern | Auto-optimizes | Immediate or delayed | | S3 Glacier Instant Retrieval | Quarterly | Low | Milliseconds | | S3 Glacier Flexible | Archive, annual | Very low | Minutes-hours | | S3 Glacier Deep Archive | Compliance archive | Lowest | 12-48 hours | --- ## 2.4 Key Concepts Deep Dive ### Redshift Data Modeling **Star schema** — fact table + dimension tables (denormalized). Best for OLAP query performance. **Snowflake schema** — normalized dimensions. Less redundancy, more joins. **3NF** — fully normalized. Good for OLTP, bad for analytics (too many joins). **Redshift physical model settings:** - **Distribution style**: EVEN (default), KEY (join column), ALL (small dimension tables), AUTO - **Sort key**: COMPOUND (range queries), INTERLEAVED (multiple filter columns) - **Compression encoding**: ZSTD (best general), AZ64 (numeric), BYTEDICT (low cardinality) - **COPY command**: loads from S3; parallel; supports CSV, JSON, Parquet, ORC - **UNLOAD command**: exports to S3; parallel; supports Parquet, CSV **WLM (Workload Management):** - Manages query concurrency and memory allocation - AutoWLM recommended; or define manual queues - Monitor `WLM queue disk spill` metric — indicates memory pressure ### DynamoDB Data Modeling **Partition key (PK)** — determines data distribution. Choose high-cardinality values to avoid hot partitions. **Sort key (SK)** — enables range queries within a partition. Optional. **Indexes:** - **GSI (Global Secondary Index)**: Different PK/SK; separate provisioned capacity; supports eventual consistency only; use for alternate access patterns - **LSI (Local Secondary Index)**: Same PK, different SK; shares table capacity; must be defined at table creation; supports strong consistency **DynamoDB Streams**: Ordered sequence of item-level changes → can feed into Lambda, KDS for CDC. **DynamoDB TTL**: Auto-expire items by setting a TTL attribute (Unix timestamp). Free deletion. Use for session data, temporary records. ### Data Catalog (Glue Data Catalog) - Central **Hive Metastore-compatible** metadata repository - Stores database, table, column, partition, statistics metadata - **Glue Crawlers** auto-detect schema by scanning data samples; populate Data Catalog - Integrated with Athena, Redshift Spectrum, EMR, Lake Formation, QuickSight - **Technical metadata** = schema, data types, partitions, lineage - **Business metadata** = SageMaker Catalog / DataZone (terms, ownership, policies) ### Apache Iceberg Key Features - ACID transactions on data lake tables - **Schema evolution** — add/rename/drop columns without breaking existing queries - **Time travel** — query historical snapshots - **Snapshot isolation** — concurrent reads and writes safely - Native support in Athena, Glue, EMR, Redshift Spectrum ### Vector Index Types (Skill 2.1.8) | Index | Full Name | Best For | |-------|-----------|----------| | **HNSW** | Hierarchical Navigable Small World | Fast approximate nearest neighbor search; Aurora PostgreSQL supports this | | **IVF** | Inverted File Index | Large-scale vector search; clusters vectors into buckets | --- ## 2.5 Data Lifecycle Management ### S3 Lifecycle Policies - Transition objects between storage classes after N days - Expire (delete) objects after N days - Apply to specific prefixes or tags - Versioning → lifecycle policy can expire non-current versions ### Medallion Architecture (Bronze/Silver/Gold) | Layer | Name | Content | Format | |-------|------|---------|--------| | Raw / Bronze | Landing Zone | Unprocessed original data | CSV, JSON, Avro | | Stage / Silver | Cleansed & Conformed | Validated, deduplicated, standardized | Parquet, ORC | | Curated / Gold | Analytics-ready | Aggregated, business-logic applied | Parquet, Iceberg | ### Lakehouse Pattern: S3 + Redshift ``` Raw data → S3 (cold, cheap) → ETL (Glue/EMR) → COPY → Redshift (hot, fast SQL) Redshift old data → UNLOAD → S3 → query via Spectrum or Athena ``` ### DynamoDB TTL Set `ttl` attribute (Unix epoch) on items → DynamoDB auto-deletes expired items within 48 hours. Use case: session data, temporary cache, log entries with retention window. --- ## 2.6 Exam Question Patterns **Common stems:** - "ACID transactions on a data lake" → Apache Iceberg - "Schema evolution without breaking workflows" → Apache Iceberg - "Query S3 data from Redshift without loading" → Redshift Spectrum - "Query operational DB from Redshift" → Redshift Federated Query - "Near-real-time sync Aurora → Redshift, zero ETL" → Aurora zero-ETL integration - "Cost-effective index for S3 data" → use Glue Data Catalog + Athena - "Vector similarity search" → Aurora PostgreSQL HNSW or Aurora + pgvector - "Microsecond key/value reads, durable" → Amazon MemoryDB **Red herrings:** - Choosing DynamoDB for complex SQL queries (it's NoSQL, no joins) - Using EBS for big data on EMR (use S3 — HDFS on EBS has operational cost) - Assuming Glue Data Catalog is only for S3 (it also catalogs RDS, Redshift, DynamoDB) - HNSW for large-scale batch search (IVF scales better for massive vector sets) --- ## 2.7 Common Mistakes & Traps 1. **HDFS on EMR vs S3** — always prefer S3 for persistent storage (HDFS dies with cluster) 2. **GSI vs LSI confusion** — LSI = same partition key, must be created at table creation; GSI = different PK, anytime, separate capacity 3. **Iceberg vs Parquet** — Parquet is a file format; Iceberg is a table format (adds ACID, schema evolution on top of Parquet/ORC) 4. **Redshift Spectrum vs Athena** — both query S3; Spectrum is Redshift-native (join with Redshift tables); Athena is standalone serverless SQL 5. **S3 Standard vs Intelligent-Tiering** — if access pattern is unknown, Intelligent-Tiering automates tier movement 6. **SageMaker Catalog vs Glue Data Catalog** — Glue Data Catalog = technical (schemas, partitions); SageMaker Catalog = business (governed data sharing, lineage) [NEW in v1.1] --- ## 2.8 Practice Scenarios ### Scenario 1 **Requirement:** Company migrates from HDFS on-prem + Spark analytics to AWS. Need scalable, cost-efficient storage. **Solution:** Migrate data to Amazon S3; run Spark on Amazon EMR clusters using S3 as storage (not HDFS). **Why?** S3 is durable, decoupled from compute, infinitely scalable. ### Scenario 2 **Requirement:** Data lake tables break frequently when source schema changes (new columns). Concurrent writes corrupt data. **Solution:** Migrate to Apache Iceberg table format. **Why Iceberg?** Native schema evolution (add/rename columns without breaking readers) + snapshot isolation for concurrent writes. ### Scenario 3 **Requirement:** Ecommerce app queries products by category and price range (different access patterns). DynamoDB primary table keyed by ProductID. **Solution:** Create a GSI with Category as partition key and Price as sort key. **Why GSI?** Supports different access patterns (category queries) across all partitions. ### Scenario 4 **Requirement:** Near-real-time analytics on MSK streaming data in Amazon Redshift with least operational overhead. **Solution:** Amazon MSK streaming ingestion directly to Redshift → define Redshift materialized views to consume MSK topics. **Wrong answer:** Firehose → S3 → COPY adds steps and latency. ### Scenario 5 **Requirement:** OpenSearch index growing too large; costs rising. Need to manage index lifecycle and reduce costs. **Solution:** Use Index State Management (ISM) policies — automatically transition old indices to UltraWarm storage or delete them. --- --- # DOMAIN 3: DATA OPERATIONS AND SUPPORT (22%) ## 3.1 Domain Overview **Task statements:** - **Task 3.1:** Automate data processing (MWAA, Step Functions, SDKs, EMR, Redshift, Glue, Athena, Lambda, EventBridge) - **Task 3.2:** Analyze data (QuickSight, Athena, Redshift SQL, DataBrew, SageMaker, Jupyter, data aggregations) - **Task 3.3:** Maintain and monitor pipelines (CloudWatch, CloudTrail, logs, alerts, SNS, troubleshooting Glue/EMR) - **Task 3.4:** Ensure data quality (Glue Data Quality, DataBrew rules, sampling, data skew, consistency checks) **Core theme:** Keep pipelines running, observable, and producing trustworthy data. --- ## 3.2 Services by Category ### Monitoring & Logging | Service | Use | Key Detail | |---------|-----|------------| | **Amazon CloudWatch** | Metrics, alarms, log groups, dashboards | Set alarms on custom metrics; CloudWatch Logs Insights for log analysis | | **AWS CloudTrail** | API call audit trail | Every AWS API call logged; use for security audits; data events = S3/Lambda | | **CloudWatch Logs Insights** | Query logs with ad-hoc SQL-like syntax | Analyze structured logs at scale | | **Amazon OpenSearch** | Log aggregation + full-text search | Store and search application logs via OpenSearch Dashboards | | **Amazon Athena** | Ad-hoc SQL log analysis on S3 | Query CloudTrail logs, application logs stored in S3 | ### Data Quality | Service | Use | Key Detail | |---------|-----|------------| | **AWS Glue Data Quality** | Define + run DQDL rules on datasets | Built into Glue; rules: completeness, uniqueness, referential integrity, freshness | | **AWS Glue DataBrew** | Visual data profiling + quality rules | No-code; for non-technical users; 250+ built-in transforms | | **Amazon Deequ** | Scala/Python data quality library | Open-source; runs on EMR Spark; define quality constraints in code | | **AWS Glue Crawlers** | Schema detection | Auto-populates Data Catalog; runs on schedule or event-triggered | ### Analysis & Query | Service | Use | Key Detail | |---------|-----|------------| | **Amazon Athena** | Serverless SQL on S3 | Pay-per-query (per TB scanned); Parquet/ORC = cheaper; federated queries supported | | **Amazon QuickSight** | BI dashboards + visualizations | SPICE for in-memory caching; row-level security; ML Insights built-in | | **Amazon Redshift** | Complex SQL analytics | Materialized views, stored procedures, UDFs, Spectrum for S3 | | **Amazon EMR + Spark** | Large-scale data analysis | Athena notebooks use Apache Spark | ### Alerting | Service | Use | |---------|-----| | **Amazon SNS** | Push notifications (email, Lambda, SQS, HTTP) for pipeline failures | | **Amazon SQS** | Decouple pipeline stages, buffer messages, dead-letter queue for failures | | **EventBridge** | Route events to targets based on rules; schedule triggers | --- ## 3.3 Key Concepts Deep Dive ### CloudWatch vs CloudTrail | Aspect | CloudWatch | CloudTrail | |--------|-----------|------------| | What it monitors | AWS resource metrics + application logs | AWS API calls (who did what, when) | | Use for | Performance monitoring, alarms | Security audits, compliance, access tracking | | Log S3 access | Enable S3 metrics + access logging | Enable data events for S3 | | Key output | Metric alarms → SNS | CloudTrail logs → S3 / CloudWatch Logs | ### Data Quality Concepts (DQDL — Glue Data Quality Definition Language) **Rules you can define:** - **Completeness** — check for nulls/empty fields - **Uniqueness** — detect duplicates - **Referential integrity** — FK relationships - **Freshness** — data recency checks - **Custom rules** — regex patterns, value ranges **DataBrew quality actions:** - Fill missing values (mean, median, custom) - Identify duplicate records - Formatting functions (date standardization, case normalization) - Nesting/unnesting JSON structures - PII detection and redaction ### Data Sampling Techniques (Skill 3.4.4) - **Simple random sampling** — each record has equal probability - **Stratified sampling** — sample proportionally from subgroups - **Systematic sampling** — every Nth record - **Reservoir sampling** — fixed-size sample from streaming data ### Data Skew in Spark (Skill 3.4.5) **Skew** = some partitions have far more data than others → one task runs much longer than others. **Solutions:** - Salting partition keys (add random prefix) - Repartition before joins - Use Adaptive Query Execution (AQE) in Spark 3+ - Split large partitions ### Athena Key Details - Serverless; no cluster to manage - Charged per TB of data scanned - **Partition pruning** reduces scan cost (partition by date, region, etc.) - **Result reuse** — cache query results to avoid re-scanning identical queries (use `--result-reuse-configuration`) - **Federated query** — query RDS, DynamoDB, on-prem via Lambda connectors - **Athena notebooks** — Apache Spark for interactive exploration ### QuickSight Key Details - **SPICE** (Super-fast Parallel In-memory Calculation Engine) — caches dataset; faster queries; not free (extra capacity cost) - **Direct Query** mode — queries data source live (no SPICE, slower) - **Row-level security (RLS)** — create permissions dataset mapping users/groups to rows they can see - **Column-level security** — via Lake Formation integration - **ML Insights** — anomaly detection, forecasting, auto-narratives built-in ### Athena vs Redshift Spectrum | Attribute | Amazon Athena | Redshift Spectrum | |-----------|--------------|-------------------| | Engine | Presto/Trino | Redshift (MPP) | | Use | Standalone S3 queries | S3 queries joined with Redshift tables | | Requires cluster | No | Yes (Redshift cluster or serverless) | | Cost | Per TB scanned | Redshift RPU + per TB scanned | | When to use | Ad-hoc, no Redshift needed | Need to join S3 cold data with Redshift hot data | --- ## 3.4 Exam Question Patterns **Common stems:** - "Reduce Athena query costs" → partition data by common filter columns + use Parquet/ORC format - "Monitor pipeline failures and notify team" → CloudWatch Alarm → SNS → email/Lambda - "Audit who accessed data" → AWS CloudTrail - "Ensure data quality before ETL" → AWS Glue Data Quality rules - "Non-technical analyst needs data quality" → AWS Glue DataBrew - "Query logs stored in S3 with SQL" → Amazon Athena - "Search logs with full-text search" → Amazon OpenSearch Service - "Dashboard refresh but reduce costs" → Athena result reuse caching - "Serverless vs provisioned" → provisioned = predictable high load; serverless = variable/intermittent **Keywords → Service mapping:** | Keyword | Service | |---------|---------| | "audit trail" / "who called API" | AWS CloudTrail | | "pipeline metrics" / "alerts" | Amazon CloudWatch + SNS | | "data quality rules" / "DQDL" | AWS Glue Data Quality | | "visual data preparation" / "no-code" | AWS Glue DataBrew | | "serverless SQL on S3" | Amazon Athena | | "BI dashboard" / "SPICE" | Amazon QuickSight | | "log full-text search" | Amazon OpenSearch | | "data skew" | Salting / Spark AQE | --- ## 3.5 Common Mistakes & Traps 1. **CloudWatch vs CloudTrail confusion** — CloudWatch = what is happening (metrics); CloudTrail = who did it (API calls) 2. **Athena cost vs performance** — Parquet/ORC + partitioning = 90%+ cost reduction vs raw CSV/JSON 3. **QuickSight SPICE** — SPICE caches data for performance but has capacity cost; large datasets may prefer Direct Query 4. **Using Glue Data Quality for PII detection** — Glue Data Quality checks rules (completeness, uniqueness); PII detection uses Amazon Macie or Glue Studio sensitive data detection 5. **Thinking Step Functions = Airflow** — Step Functions is for AWS-service orchestration; MWAA is for Python DAG-based workflows 6. **Forgetting dead letter queues** — SQS DLQ captures messages that failed processing; essential for pipeline reliability --- ## 3.6 Practice Scenarios ### Scenario 1 **Requirement:** Financial services company ingests data to S3. Before Glue ETL, must check for missing values, duplicates, format issues. Least operational overhead. **Solution:** AWS Glue Data Quality — define DQDL rules on source data; configure CloudWatch Alarms to notify team of violations. **Wrong answers:** Lambda custom script (more code/ops), EMR Spark (overkill, more ops), Athena daily check (not real-time, reactive not proactive). ### Scenario 2 **Requirement:** Data lake tables updated every 24h. BI layer queries via Athena hourly. Need to reduce query costs while ensuring fresh dashboards. **Solution:** Use Athena result reuse (`--result-reuse-configuration`) with max age = 24h so hourly queries reuse cached results until next ETL run. ### Scenario 3 **Requirement:** Serverless pipeline on EMR Serverless with job dependencies and automatic retries. Minimize operational overhead. **Solution:** AWS Step Functions — define state machine orchestrating EMR Serverless jobs with built-in retry logic and branching. ### Scenario 4 **Requirement:** Company needs to search application logs by error message content. Logs land in S3 daily. **Two valid approaches:** - **Athena** (SQL queries on S3 logs — if structured/semi-structured) - **OpenSearch** (full-text search, Kibana/OpenSearch Dashboards — if need real-time search and complex text queries) ### Scenario 5 **Requirement:** Regional QuickSight analysts should see only data from their own region. **Solution:** QuickSight Row-Level Security (RLS) — create a permissions dataset mapping analyst user/group to their allowed region values. Assign RLS dataset to the main dataset. --- --- # DOMAIN 4: DATA SECURITY AND GOVERNANCE (18%) ## 4.1 Domain Overview **Task statements:** - **Task 4.1:** Apply authentication mechanisms (VPC security groups, IAM roles, Secrets Manager, S3 Access Points, PrivateLink) - **Task 4.2:** Apply authorization mechanisms (custom IAM policies, Secrets Manager, Redshift RBAC, Lake Formation, RBAC/TBAC/ABAC) - **Task 4.3:** Ensure data encryption and masking (data masking, anonymization, compliance) - **Task 4.4:** Prepare logs for audit (CloudTrail, CloudWatch, logging config) - **Task 4.5:** Understand data privacy and governance (lineage, data catalog, data sharing, quality, profiling, lifecycle, auditing) **Core theme:** Implement least privilege, encrypt everything, log everything, govern data at scale. --- ## 4.2 Services by Category ### Authentication & Network Security | Service | Use | Key Detail | |---------|-----|------------| | **AWS IAM** | Identity and access management | Users, groups, roles, policies; principle of least privilege | | **AWS IAM Identity Center** | SSO + multi-account access | Integrates with AD/SAML apps; integrates with Lake Formation | | **VPC Security Groups** | Network-level firewall | Stateful; allow rules only (no deny); control inbound/outbound | | **VPC Endpoints** | Private connectivity to AWS services | Gateway (S3, DynamoDB) or Interface (PrivateLink) — no internet traffic | | **AWS PrivateLink** | Expose services privately across VPCs | Interface VPC endpoint; secure cross-account service access | | **AWS Secrets Manager** | Credential storage + auto-rotation | Store DB passwords, API keys; auto-rotate; integrates with RDS, Redshift | | **AWS Systems Manager Parameter Store** | Configuration + secret storage | Simpler, cheaper alternative to Secrets Manager for non-rotating secrets | ### Authorization | Service | Mechanism | Granularity | |---------|-----------|-------------| | **AWS IAM Policies** | RBAC | Service/resource level | | **AWS Lake Formation** | TBAC (tag-based) + name-based | Table, column, row level on S3 data lake | | **Amazon Redshift RBAC** | Database roles | Row-level security, column-level, dynamic data masking | | **QuickSight RLS** | Row-level security dataset | Row level per user/group | ### Encryption | Service | Use | Key Detail | |---------|-----|------------| | **AWS KMS** | Key management | CMKs (Customer Managed Keys); integrates with S3, Redshift, Glue, DynamoDB | | **SSE-S3** | Server-side encryption with S3-managed keys | Simplest; AWS manages keys | | **SSE-KMS** | Server-side encryption with KMS keys | Audit key usage via CloudTrail; customer controls key policy | | **DSSE-KMS** | Dual-layer server-side encryption | Two layers of encryption in one operation (compliance requirement) | | **Client-side encryption** | Encrypt before upload | Max control; customer manages encryption/decryption | ### Data Governance | Service | Role | |---------|------| | **AWS Glue Data Catalog** | Technical metadata (schemas, partitions, lineage) | | **Amazon DataZone / SageMaker Catalog** | Business catalog (data marketplace, sharing, governance) | | **Amazon Macie** | PII/sensitive data detection in S3 using ML | | **AWS CloudTrail** | API-level audit trail | | **Amazon CloudWatch** | Operational monitoring + logging | | **AWS Lake Formation** | Fine-grained access, data sharing, tagging | --- ## 4.3 Authorization Deep Dive ### IAM Policy Types | Type | When to Use | Notes | |------|-------------|-------| | **AWS Managed Policy** | Quick setup; may be too broad | Broad permissions; not principle of least privilege | | **Customer Managed Policy** | Precise control; reusable | Recommended; define exact actions + specific ARNs | | **Inline Policy** | Role-specific, non-reusable | Attached directly to one role; avoid for most use cases | | **Resource-based Policy** | S3 buckets, KMS keys, Lambda | Principal specified in policy; cross-account access | **Principle of least privilege** = grant only the minimum permissions needed for the task. Always prefer custom managed policies over AWS managed. ### Lake Formation Access Control **Three levels:** 1. **Name-based** — grant permissions on specific databases, tables, columns directly by name 2. **Tag-based (LF-TBAC)** — assign LF-Tags to resources; grant permissions to tags; scalable for many resources 3. **Row/column filtering** — fine-grained row-level security + column exclusion **Setup flow:** 1. Register S3 prefix as data lake location 2. Grant permissions on Glue Data Catalog resources (databases, tables) 3. Column/row filters for fine-grained access **Best practices:** - Don't add bucket policies for S3 buckets registered with Lake Formation (Lake Formation controls access) - Don't use root AWS user as data lake admin - Use LF-TBAC for large numbers of tables (fewer grants to manage) ### Authorization Methods | Method | Description | Best For | |--------|-------------|----------| | **RBAC** (Role-based) | Permissions based on role (job function) | Standard org structures | | **ABAC** (Attribute-based) | Permissions based on tags/attributes on both principal and resource | Dynamic, scalable | | **TBAC** (Tag-based, Lake Formation) | LF-Tags on data resources matched with IAM principal tags | Large data lake with many tables | ### Redshift Database Security - **Superuser** — full access to all objects - **Database users** — created with `CREATE USER`; can have `CREATEDB`, `CREATEUSER` permissions - **Roles** — role-based access (`CREATE ROLE`; `GRANT role TO user`); roles can be nested - **Row-level security (RLS)** — define RLS policies filtering rows per user/role - **Dynamic data masking (DDM)** — mask column data at query time per user/role; no actual data change - **Column-level security** — `GRANT SELECT (col1, col2) ON table TO role` --- ## 4.4 Encryption Deep Dive ### Encryption at Rest — S3 Options | Option | Key Management | Audit via CloudTrail | Compliance | |--------|---------------|---------------------|------------| | SSE-S3 | AWS managed | No key-level audit | Basic | | SSE-KMS | AWS KMS (customer or AWS managed CMK) | Yes | Most common | | DSSE-KMS | Two KMS layers | Yes | High compliance (e.g., HIPAA dual-layer) | | Client-side | Customer | N/A | Max control | **DSSE-KMS** = Dual-layer Server-Side Encryption. Applies two independent layers of encryption in one S3 PUT operation. Least operational overhead for dual-layer compliance requirements. ### Encryption in Transit All AWS data movement services (DMS, DataSync, Backup, VPN) encrypt in transit by default using SSL/TLS. ### Sensitive Data Detection & Masking | Service | Mechanism | Use | |---------|-----------|-----| | **Amazon Macie** | ML-based PII detection | Scan S3 buckets; detect names, SSN, credit card, email, etc. | | **Macie Custom Data Identifiers** | Regex + keywords | Detect org-specific patterns (customer IDs, internal codes) | | **Glue Studio** | Sensitive data detection + redaction | Detect and mask PII in Glue ETL pipelines; partial redaction | | **Redshift DDM** | Dynamic data masking | Column-level masking at query time | | **DataBrew** | Recipe-based PII masking | Visual, no-code masking for non-technical users | --- ## 4.5 Logging & Audit ### CloudTrail Data Events vs Management Events | Type | What it captures | Default? | |------|-----------------|---------| | **Management events** | API calls that manage AWS resources (create, delete, modify) | Yes | | **Data events** | Object-level S3 operations (GetObject, PutObject), Lambda invocations | No — must enable separately | **To audit S3 object access:** Enable CloudTrail S3 data events for specific buckets. ### Log Analysis Options | Scenario | Tool | |----------|------| | Query structured logs in S3 with SQL | Amazon Athena | | Real-time log analytics + dashboards | Amazon OpenSearch + OpenSearch Dashboards | | Filter/search CloudWatch Logs | CloudWatch Logs Insights | | Big data log processing | Amazon EMR | | Monitor Glue/EMR job metrics | Amazon CloudWatch | --- ## 4.6 Data Governance Pillars | Pillar | AWS Service(s) | |--------|---------------| | Metadata management | AWS Glue Data Catalog, SageMaker Catalog | | Data sharing | AWS Lake Formation cross-account sharing, DataZone | | Data quality | AWS Glue Data Quality, DataBrew, Deequ | | Data profiling | DataBrew, Glue Data Quality | | Data lifecycle | S3 Lifecycle policies, DynamoDB TTL | | Data lineage | SageMaker ML Lineage Tracking, SageMaker Catalog | | Logging & auditing | CloudTrail, CloudWatch, S3 Access Logging | --- ## 4.7 Exam Question Patterns **Common stems:** - "Store DB credentials securely + auto-rotate" → AWS Secrets Manager - "Audit who accessed S3 objects" → CloudTrail data events (must be explicitly enabled) - "Column-level access control on data lake" → AWS Lake Formation - "PII detection in S3" → Amazon Macie - "Custom PII patterns (internal IDs)" → Macie custom data identifiers - "Two layers of encryption, least overhead" → DSSE-KMS - "Credentials in Glue job scripts (security risk)" → Move to AWS Secrets Manager, create Glue connection referencing the secret - "Fine-grained row access in QuickSight" → QuickSight Row-Level Security (RLS) dataset - "Large number of tables, scalable access control" → Lake Formation tag-based access control (LF-TBAC) - "Cross-account data lake access" → Lake Formation cross-account grants **Red herrings:** - Using KMS encryption to implement column-level access control (KMS = encryption, not access control) - IAM policies alone for column/row-level permissions in data lake (IAM doesn't do column/row; Lake Formation does) - GuardDuty for PII classification (GuardDuty = threat detection; Macie = data classification) - AWS DMS for CDC when "open source" is required (DMS is proprietary) - Parameter Store for auto-rotating database credentials (use Secrets Manager; Parameter Store doesn't auto-rotate) **Keywords → Service mapping:** | Keyword | Service | |---------|---------| | "auto-rotate credentials" | AWS Secrets Manager | | "PII detection in S3" | Amazon Macie | | "column-level / row-level on data lake" | AWS Lake Formation | | "two layers encryption, least overhead" | DSSE-KMS | | "audit API calls" | AWS CloudTrail | | "SSO + multi-account" | IAM Identity Center | | "private connectivity to AWS services" | VPC Endpoints / PrivateLink | | "dynamic data masking at query time" | Redshift DDM | | "data lineage" | SageMaker ML Lineage / SageMaker Catalog | | "scalable tag-based access control" | Lake Formation LF-TBAC | --- ## 4.8 Common Mistakes & Traps 1. **Secrets Manager vs Parameter Store** — Secrets Manager has auto-rotation; Parameter Store is simpler/cheaper but no built-in rotation 2. **KMS vs Lake Formation** — KMS encrypts data at rest; Lake Formation controls who can access which columns/rows 3. **SSE-S3 vs SSE-KMS** — SSE-KMS has CloudTrail audit for key usage; SSE-S3 doesn't 4. **Macie managed vs custom identifiers** — managed = standard PII (credit cards, SSN); custom = org-specific regex patterns 5. **CloudTrail data events not enabled by default** — management events are on by default; S3 data events (object reads/writes) require explicit enablement 6. **Lake Formation bucket policies conflict** — when S3 bucket is registered with Lake Formation, remove/don't use bucket policies (they conflict); Lake Formation manages access 7. **Hardcoded credentials in Glue jobs** — always store in Secrets Manager and reference via Glue connection --- ## 4.9 Practice Scenarios ### Scenario 1 **Requirement:** Healthcare org must apply two independent layers of encryption to all S3 files. Least operational overhead. **Solution:** DSSE-KMS — dual-layer server-side encryption with a single S3 operation. **Wrong answers:** Client-side + SSE = two steps, more operational overhead. SSE-KMS alone = one layer. SSE-S3 + SSE-KMS = not a valid combination. ### Scenario 2 **Requirement:** Financial company needs to automatically detect credit card numbers and internal customer reference IDs (custom format) in S3 data lake. **Solution:** Enable Macie managed data identifiers (for credit cards) + define Macie custom data identifiers (for internal reference patterns using regex). **Wrong answers:** GuardDuty (threat detection, not data classification), Lake Formation (access control, not data discovery), EMR with custom regex (too much operational overhead). ### Scenario 3 **Requirement:** Glue job has MongoDB credentials hardcoded in the script. Fix the security vulnerability. **Solution:** Store credentials in AWS Secrets Manager → create a Glue connection to MongoDB referencing the secret → Glue job uses the connection. **Wrong answers:** Glue job parameters (still exposed in logs), IAM Identity Center (authenticates AWS users, not MongoDB), S3 config file (still exposed). ### Scenario 4 **Requirement:** Multiple analytics teams (Athena, Glue, Redshift Spectrum) accessing a large S3 data lake with 500+ tables. Need scalable fine-grained access control. **Solution:** Register S3 with Lake Formation → define LF-Tags on tables/columns → grant permissions to IAM principals based on tags (LF-TBAC). Scalable — fewer grants than name-based. **Wrong answers:** IAM bucket policies (no column-level control, doesn't scale), KMS (encryption only, not access control). ### Scenario 5 **Requirement:** Company processes patient records. Need to detect and redact PII (names, addresses) in Glue ETL pipelines while preserving data usability. **Solution:** Use AWS Glue Studio sensitive data detection with partial redaction option (not full deletion — preserves data utility for analytics). **Why not delete?** Full deletion prevents downstream analytics that need the record structure. Partial redaction masks specific fields while keeping record integrity. --- --- # CROSS-DOMAIN QUICK REFERENCE ## Zero-ETL Integrations (High-Value Exam Topic) Zero-ETL = direct data movement with no ETL pipeline code to write. Always preferred when available. | Source | Destination | Integration | |--------|-------------|-------------| | Amazon Aurora MySQL/PostgreSQL | Amazon Redshift | Aurora zero-ETL | | Amazon RDS for MySQL | Amazon Redshift | RDS zero-ETL | | Amazon DynamoDB | Amazon OpenSearch | DynamoDB zero-ETL | | Amazon MSK | Amazon Redshift | Redshift streaming ingestion | | Amazon KDS | Amazon Redshift | Redshift streaming ingestion | ## Service "Minimum Operational Overhead" Ranking When the exam says "least operational overhead," rank options like this: 1. **Zero-ETL / native integrations** (always first if available) 2. **Fully managed serverless** (Glue, Athena, Lambda, Data Firehose, MSF) 3. **Managed with some config** (EMR Serverless, Redshift Serverless, MSK Serverless) 4. **Provisioned managed** (EMR on EC2, Redshift Provisioned, MSK Provisioned) 5. **Custom code/scripts** (always avoid in "least overhead" questions) ## The "Open Source" Test If the question mentions "open source" requirement: - MSK instead of KDS - EMR instead of Glue (EMR supports more open frameworks) - MSK Connect + Debezium instead of AWS DMS for CDC - Apache Iceberg/Hudi instead of proprietary formats - Athena (Presto/Trino) instead of Redshift for ad-hoc serverless SQL ## Critical Service Limits | Service | Limit | Exam Relevance | |---------|-------|----------------| | Lambda timeout | 15 minutes max | Eliminates Lambda for long-running transforms | | KDS shard write | 1 MB/s, 1000 records/s | Design for throughput | | KDS shard read | 2 MB/s standard; 2 MB/s per consumer enhanced fan-out | Enhanced fan-out for multiple consumers | | KDS retention | 24h default, up to 365 days | Replay window | | Glue Python shell DPU | 1/16 DPU min | Cheapest compute | | Glue Spark jobs DPU | 2 DPU min | Minimum resource | | S3 object size | 5 TB max | Rarely tested but know it | | Redshift COPY | Parallel from S3 | Fastest way to load Redshift | ## SageMaker Unified Studio (New in v1.1) - Combines DataBrew, SageMaker Studio, EMR Studio, Glue Studio into one unified experience - **SageMaker Catalog** = business data catalog (data discovery, governed sharing, lineage) - **SageMaker Lakehouse** = unified access layer across S3 data lakes and Redshift - **Domain/domain units/projects** = org structure for access control in SageMaker Unified Studio (Skill 4.1.7) --- *Sources: AWS Certified Data Engineer Associate Exam Guide (DEA-C01), Version 1.1 (Copyright © 2026 Amazon Web Services, Inc.) | Mishra et al., "AWS Certified Data Engineer," O'Reilly*
<img src="https://gfassets.fra1.cdn.digitaloceanspaces.com/logo/logo-mono.png" /><br /><br />
[](https://www.python.org/downloads/)
**AI Penetration Testing Framework: Scoping, CVE/CWE Mapping, and Threat Correlation**
<img src="assets/GraphBit_Final_GB_Github_GIF.gif" style="max-width: 600px; height: auto;" alt="Logo" />