Domain 5: Testing, Validation, and Troubleshooting

AIP-C01 Study Guide — Dr. Priya Ramanathan

Domain Weight: 11% (~7 of 65 scored questions) Tasks: 5.1–5.2 | Skills: 14+ Target Audience: Professional-level (2+ years AWS, 1+ year GenAI hands-on)

SECTION 1: DOMAIN OVERVIEW

Lines: ~90 | Priority: Read first

Domain Scope

Domain 5 covers 11% of the AIP-C01 exam — roughly 7 of the 65 scored questions. This domain tests a fundamentally different mindset than traditional ML evaluation. In traditional ML, evaluation is deterministic — you have labeled test data, you compute accuracy/F1/precision/recall, and you're done. In GenAI, evaluation is probabilistic and subjective — the same question can have multiple valid answers, quality depends on context, and automated metrics alone are insufficient.

Critical insight: The exam specifically tests whether you understand WHY GenAI evaluation is different, not just HOW to evaluate. Questions will contrast traditional ML metrics with GenAI evaluation approaches, and the correct answer requires understanding that open-ended generation needs different tooling.

The Fundamental Distinction

TRADITIONAL ML vs GENAI EVALUATION
=====================================

TRADITIONAL ML:
  ┌──────────────┐     ┌─────────────┐     ┌─────────────┐
  │ Test Dataset  ├────►│ Run Model   ├────►│ Compute     │
  │ (labeled)     │     │ (predict)   │     │ Accuracy/F1 │
  └──────────────┘     └─────────────┘     └─────────────┘
  ✓ Single correct answer per input
  ✓ Deterministic metrics
  ✓ Fully automated

GENERATIVE AI:
  ┌──────────────┐     ┌─────────────┐     ┌─────────────────────┐
  │ Test Dataset  ├────►│ Run Model   ├────►│ Automated Metrics   │
  │ (golden set)  │     │ (generate)  │     │ + LLM-as-Judge      │
  └──────────────┘     └─────────────┘     │ + Human Evaluation   │
                                            └─────────────────────┘
  ✗ Multiple valid outputs for same input
  ✗ Quality is subjective (style, helpfulness, coherence)
  ✗ No single "ground truth" for open-ended generation
  ✗ Requires combination of evaluation approaches

[EXAM TIP] If the exam presents a scenario asking "how to evaluate a summarization model" and one option is "compute accuracy on a test set," that is a traditional ML approach and is WRONG for GenAI. The correct answer will involve ROUGE scores, human evaluation, or LLM-as-a-Judge.

Task Dependency Map

DOMAIN 5 TASK DEPENDENCY MAP
===============================

Task 5.1: Evaluation Systems for GenAI ◄── HIGHEST YIELD
  ├── Automated metrics (ROUGE, BLEU, BERTScore, perplexity)
  ├── Bedrock Model Evaluation (automatic + human modes)
  ├── LLM-as-a-Judge (scoring, pairwise, biases)
  ├── RAG evaluation (faithfulness, relevance, precision, recall)
  ├── Agent evaluation (task completion, tool accuracy)
  ├── A/B testing and canary deployments
  └── Golden datasets and CI/CD quality gates
       │
       v
Task 5.2: Troubleshooting GenAI Applications ◄── PRACTICAL
  ├── Systematic debugging hierarchy (7 steps)
  ├── Content issues (hallucination, format drift, truncation)
  ├── API errors (throttling, timeout, validation, access)
  ├── Retrieval issues (relevance, staleness, proper nouns)
  ├── Prompt debugging (inconsistency, token limits)
  └── Monitoring stack (CloudWatch, X-Ray, CloudTrail, logging)

Cross-Domain Links

Domain 5 Topic	Also Tested In	Context
Model comparison	D1 (FM selection)	Choose best model for task
Evaluation metrics	D2 (deployment validation)	Quality gates before deploy
Monitoring stack	D4 (operational monitoring)	CloudWatch, X-Ray, logging
Guardrails evaluation	D3 (safety controls)	Testing safety filters
RAG evaluation	D1 (retrieval mechanisms)	Retrieval quality testing
Golden datasets	D4 (troubleshooting)	Regression testing

SECTION 2: EVALUATION SYSTEMS FOR GENAI (Task 5.1)

Lines: ~450 | Priority: HIGHEST — most testable topic in Domain 5

Evaluation Taxonomy

Three approaches, each with a clear use case:

EVALUATION APPROACHES
=======================

1. AUTOMATED METRICS
   Speed: ████████████ Fast
   Scale: ████████████ Unlimited
   Quality: ████░░░░░░ Limited for open-ended tasks
   Use for: CI/CD gates, regression testing, benchmarking

2. HUMAN EVALUATION
   Speed: ██░░░░░░░░ Slow
   Scale: ██░░░░░░░░ Limited
   Quality: ████████████ Gold standard
   Use for: Final quality validation, subjective assessment

3. LLM-AS-A-JUDGE
   Speed: ████████░░ Moderate
   Scale: ████████░░ Good
   Quality: ████████░░ Bridges automated + human
   Use for: Rapid iteration, scaling evaluation

[EXAM TIP] The exam tests WHEN to use each approach. Automated metrics for CI/CD and regression. Human evaluation for final quality validation and subjective judgment. LLM-as-a-Judge for rapid development iteration at scale. Know the trade-offs.

Automated Evaluation Metrics

Metric	Best For	Range	How It Works	Key Insight
ROUGE-1	Summarization	0-1	Unigram overlap with reference	Measures content coverage
ROUGE-2	Summarization	0-1	Bigram overlap with reference	Captures phrase-level similarity
ROUGE-L	Summarization	0-1	Longest common subsequence	Standard summarization metric
BLEU	Translation	0-1	N-gram precision vs reference	Measures precision (cleanness)
BERTScore	Semantic similarity	0-1	Contextual embedding similarity	Captures paraphrasing ROUGE misses
Perplexity	Language fluency	1-∞	How "surprised" the model is by text	Lower = better (only metric where lower wins)
F1 / Exact Match	Extractive QA	0-1	Token overlap / exact string match	For questions with definite answers

[EXAM TIP] Memorize the metric-to-task mapping: ROUGE = summarization. BLEU = translation. BERTScore = semantic similarity (catches paraphrasing). Perplexity = fluency. F1 = extractive QA. The exam loves testing this.

METRIC SELECTION DECISION TREE
================================

  "What is the task?"
         │
    ┌────┴────────────────────────────────┐
    │                                     │
  Summarization?                    Translation?
    │                                     │
    ▼                                     ▼
  ROUGE-L                               BLEU
  (ROUGE-1 for content,
   ROUGE-L for structure)

    │
  Semantic meaning preservation?
    │
    ▼
  BERTScore (captures "vehicle" ≈ "car")

    │
  Extractive QA with known answers?
    │
    ▼
  F1 + Exact Match

    │
  Open-ended generation quality?
    │
    ▼
  LLM-as-a-Judge + Human Evaluation
  (automated metrics are INSUFFICIENT alone)

[TRAP] ROUGE measures RECALL (how much of the reference is captured in the output). BLEU measures PRECISION (how clean/accurate the output is). Students confuse these. For summarization, you want to know if key points were captured → ROUGE (recall). For translation, you want to know if the output is accurate → BLEU (precision).

Amazon Bedrock Model Evaluation

Bedrock Model Evaluation is the managed AWS service for comparing foundation models. Know both modes:

BEDROCK MODEL EVALUATION
===========================

┌─────────────────────────────────────────────────────────┐
│                 AUTOMATIC EVALUATION                     │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  SELECT MODELS ──► CHOOSE TASK TYPE ──► PROVIDE DATASET  │
│  (up to 2)        (summarization,      (JSONL in S3     │
│                    Q&A, generation,      or built-in)    │
│                    classification)                        │
│                         │                                │
│                         ▼                                │
│              AUTOMATIC SCORING                           │
│  Built-in metrics: Accuracy, Robustness, Toxicity        │
│  Custom metrics: Your own evaluation criteria             │
│  Side-by-side comparison of model outputs                │
│                                                          │
│  USE FOR: "Which model is better for my task?"           │
│  OUTPUT: Comparative scores + individual model metrics    │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│                  HUMAN EVALUATION                        │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  DEFINE CRITERIA ──► SELECT WORKFORCE ──► COLLECT RATINGS│
│  (relevance,         (AWS-managed      (1-5 scale,      │
│   accuracy,           or custom team)   ranking,         │
│   helpfulness)                          thumbs up/down)  │
│                                                          │
│  USE FOR: Subjective quality, nuanced tasks              │
│  OUTPUT: Aggregated human quality scores                 │
└─────────────────────────────────────────────────────────┘

Bedrock Model Evaluation — Code Pattern

import boto3

bedrock = boto3.client("bedrock", region_name="us-east-1")

# Create automatic evaluation job — compare two models
response = bedrock.create_evaluation_job(
    jobName="sonnet-vs-haiku-summarization",
    roleArn="arn:aws:iam::123456789012:role/BedrockEvalRole",
    evaluationConfig={
        "automated": {
            "datasetMetricConfigs": [{
                "taskType": "Summarization",
                "dataset": {
                    "name": "summarization-golden-set",
                    "datasetLocation": {
                        "s3Uri": "s3://my-eval-bucket/datasets/summarization.jsonl"
                    }
                },
                "metricNames": ["Accuracy", "Robustness", "Toxicity"]
            }]
        }
    },
    inferenceConfig={
        "models": [
            {
                "bedrockModel": {
                    "modelIdentifier": "anthropic.claude-3-5-sonnet-20241022-v2:0",
                    "inferenceParams": '{"maxTokens": 512, "temperature": 0.0}'
                }
            },
            {
                "bedrockModel": {
                    "modelIdentifier": "anthropic.claude-3-5-haiku-20241022-v1:0",
                    "inferenceParams": '{"maxTokens": 512, "temperature": 0.0}'
                }
            }
        ]
    },
    outputDataConfig={
        "s3Uri": "s3://my-eval-bucket/results/"
    }
)

eval_job_arn = response["jobArn"]

# Check job status
status = bedrock.get_evaluation_job(jobIdentifier=eval_job_arn)
print(f"Status: {status['status']}")

[EXAM TIP] "Compare two foundation models" or "which model is best for my use case" → Bedrock Model Evaluation (automatic mode). "Get subjective quality feedback from domain experts" → Bedrock Model Evaluation (human mode). This is the AWS-managed answer for model comparison.

FMEval Library

FMEVAL (Foundation Model Evaluation Library)
==============================================

WHAT: Open-source Python library from AWS for FM evaluation.
INSTALL: pip install fmeval

KEY FEATURES:
  - Evaluate against standard benchmarks
  - Supports Bedrock, SageMaker, and custom endpoints
  - Built-in algorithms: factual_knowledge, qa_accuracy,
    summarization_accuracy, toxicity, stereotyping
  - Integrates with SageMaker for managed evaluation jobs
  - Runs standalone (notebook, Lambda, CI/CD pipeline)

WHEN TO USE:
  ✓ Programmatic evaluation in CI/CD pipelines
  ✓ Custom evaluation logic beyond Bedrock Model Eval
  ✓ Evaluate models not on Bedrock (SageMaker, self-hosted)

BEDROCK MODEL EVAL vs FMEVAL:
  Bedrock Model Eval = managed service, console or API, model comparison
  FMEval = open-source library, programmatic, CI/CD integration
  → They complement each other

from fmeval.eval import get_eval_algorithm
from fmeval.model_runners.bedrock_model_runner import BedrockModelRunner

# Set up model runner for Bedrock
model_runner = BedrockModelRunner(
    model_id="anthropic.claude-3-5-sonnet-20241022-v2:0",
    content_template='{"messages": [{"role": "user", "content": "$prompt"}]}',
    output="content[0].text"
)

# Run QA accuracy evaluation
eval_algo = get_eval_algorithm("qa_accuracy")
eval_output = eval_algo.evaluate(
    model=model_runner,
    dataset_config=DataConfig(
        dataset_name="my_qa_dataset",
        dataset_uri="s3://my-bucket/eval/qa_dataset.jsonl",
        dataset_mime_type="application/jsonlines",
        model_input_location="question",
        target_output_location="answer"
    )
)

for eval_score in eval_output:
    print(f"Metric: {eval_score.name}, Score: {eval_score.value}")

[EXAM TIP] FMEval is the answer for "evaluate models programmatically in a CI/CD pipeline" or "run automated evaluation as part of deployment." Bedrock Model Evaluation is the answer for "compare models using the AWS console or managed service."

LLM-as-a-Judge

LLM-as-a-Judge uses a powerful foundation model to evaluate another model's outputs. It bridges the gap between fast-but-shallow automated metrics and accurate-but-slow human evaluation.

LLM-AS-A-JUDGE: TWO APPROACHES
=================================

1. SCORING (Rate a single output)
   ┌───────────────┐     ┌───────────────┐     ┌──────────┐
   │ Question +    ├────►│ Judge Model   ├────►│ Score:   │
   │ Model Answer  │     │ (strong FM)   │     │ 1-5 per  │
   │ + Rubric      │     │               │     │ criterion│
   └───────────────┘     └───────────────┘     └──────────┘

2. PAIRWISE COMPARISON (Which is better?)
   ┌───────────────┐     ┌───────────────┐     ┌──────────┐
   │ Question +    ├────►│ Judge Model   ├────►│ Winner:  │
   │ Answer A +    │     │ (strong FM)   │     │ A or B   │
   │ Answer B      │     │               │     │          │
   └───────────────┘     └───────────────┘     └──────────┘

Implementation Pattern

import boto3, json

bedrock = boto3.client("bedrock-runtime")

judge_prompt = """You are an expert evaluator. Rate the following
answer on a scale of 1-5 for each criterion.

QUESTION: {question}
REFERENCE ANSWER: {reference}
MODEL ANSWER: {model_answer}

CRITERIA:
1. Accuracy (1-5): Is the answer factually correct?
2. Completeness (1-5): Does the answer fully address the question?
3. Relevance (1-5): Is the answer on-topic and not tangential?

Respond ONLY in JSON: {{"accuracy": N, "completeness": N, "relevance": N, "reasoning": "..."}}
"""

response = bedrock.converse(
    modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
    messages=[{
        "role": "user",
        "content": [{"text": judge_prompt.format(
            question="What is Amazon Bedrock?",
            reference="Amazon Bedrock is a fully managed service...",
            model_answer="Bedrock is an AWS service for foundation models..."
        )}]
    }],
    inferenceConfig={"temperature": 0.0}  # Deterministic for evaluation
)

scores = json.loads(response["output"]["message"]["content"][0]["text"])
print(f"Accuracy: {scores['accuracy']}, Completeness: {scores['completeness']}")

Known Biases in LLM-as-a-Judge

Bias	Description	Mitigation
Self-preference	Model rates its own outputs higher	Use a different model family as judge
Order bias	In pairwise comparison, position affects choice	Run comparisons in BOTH orders (A vs B, B vs A)
Verbosity bias	Longer answers rated higher regardless of quality	Explicit rubric penalizing unnecessary length
Cost	Using expensive model as judge adds inference cost	Sample evaluation (judge a subset, not all)

[EXAM TIP] "Evaluate model outputs at scale without human reviewers" → LLM-as-a-Judge. "Bias in LLM-as-a-Judge" → self-preference bias (most tested), order bias. Know both the TECHNIQUE and its LIMITATIONS. The exam tests whether you can identify these biases by description.

RAG Evaluation

RAG evaluation is unique because you must evaluate BOTH the retrieval step AND the generation step:

RAG EVALUATION FRAMEWORK
===========================

              ┌──────────────┐
  Query ─────►│  RETRIEVAL   │──── Evaluate retrieval quality
              │  (vector DB) │     Precision@K, Recall@K, MRR, NDCG
              └──────┬───────┘
                     │
              ┌──────▼───────┐
              │  GENERATION  │──── Evaluate generation quality
              │  (LLM)       │     Faithfulness, Answer Relevance
              └──────┬───────┘
                     │
              ┌──────▼───────┐
              │  END-TO-END  │──── Evaluate complete pipeline
              │  (citations) │     Citation accuracy, Completeness
              └──────────────┘

Retrieval Metrics

Metric	What It Measures	Interpretation
Precision@K	Relevant docs in top K / K	Higher = less noise in results
Recall@K	Relevant docs in top K / total relevant	Higher = fewer missed relevant docs
MRR (Mean Reciprocal Rank)	1 / rank of first relevant result	Higher = relevant docs ranked higher
NDCG (Normalized Discounted Cumulative Gain)	Ranking quality with position weighting	Higher = better overall ranking

Generation Metrics (RAGAS Framework)

RAGAS EVALUATION DIMENSIONS
==============================

1. FAITHFULNESS (Grounding)
   Question: Is the answer supported by the retrieved context?
   Detects: Hallucination beyond retrieved documents
   AWS tool: Bedrock Guardrails contextual grounding check
   Score: 0-1 (1 = fully grounded)

2. ANSWER RELEVANCE
   Question: Does the answer address the user's question?
   Detects: Off-topic or tangential responses
   Score: 0-1 (1 = fully relevant)

3. CONTEXT PRECISION
   Question: Are the retrieved chunks actually relevant?
   Detects: Noisy retrieval (irrelevant chunks diluting context)
   Score: 0-1 (1 = all retrieved chunks are relevant)

4. CONTEXT RECALL
   Question: Were ALL relevant chunks retrieved?
   Detects: Missed information (retrieval gaps)
   Score: 0-1 (1 = all relevant information retrieved)

[EXAM TIP] "RAG application generating answers not in the source documents" → low faithfulness score → enable Guardrails contextual grounding check. "RAG returning irrelevant documents" → low context precision → improve chunking, add metadata filters, enable re-ranking. Evaluate RAG end-to-end: retrieval + generation + citations.

RAG Evaluation Code Pattern

# Evaluate RAG quality with a golden dataset
golden_qa_pairs = [
    {
        "question": "What is the refund policy?",
        "expected_answer": "30-day full refund for unused products...",
        "expected_source": "refund-policy.pdf"
    },
    {
        "question": "How to contact support?",
        "expected_answer": "Email support@example.com or call 1-800...",
        "expected_source": "contact-info.pdf"
    }
]

results = []
for pair in golden_qa_pairs:
    response = bedrock_agent_runtime.retrieve_and_generate(
        input={"text": pair["question"]},
        retrieveAndGenerateConfiguration={
            "type": "KNOWLEDGE_BASE",
            "knowledgeBaseConfiguration": {
                "knowledgeBaseId": "KB123",
                "modelArn": "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-5-sonnet-20241022-v2:0"
            }
        }
    )

    generated = response["output"]["text"]
    citations = response.get("citations", [])

    results.append({
        "question": pair["question"],
        "generated_answer": generated,
        "has_citations": len(citations) > 0,
        "citation_sources": [c["retrievedReferences"][0]["location"]
                            for c in citations if c.get("retrievedReferences")]
    })
    # Compare with expected_answer using ROUGE or LLM-as-Judge

Agent Evaluation

AGENT EVALUATION METRICS
===========================

1. TASK COMPLETION RATE
   Did the agent successfully complete the task?
   Binary: success/failure per test case
   Target: > 90% for production readiness

2. TOOL SELECTION ACCURACY
   Did the agent choose the CORRECT tool/action group?
   Measures understanding of available tools
   Test with: golden test cases with expected tool calls

3. STEP EFFICIENCY
   How many steps did the agent take?
   Fewer steps = better (less cost, less latency)
   Compare: actual steps vs optimal steps

4. REASONING TRACE QUALITY
   Is the agent's reasoning logical and coherent?
   Review via: enableTrace=True in InvokeAgent
   Assessment: LLM-as-Judge on the trace output

5. ERROR RECOVERY
   Does the agent handle tool failures gracefully?
   Does it retry or choose alternative paths?
   Test with: intentionally failing action groups

AGENT EVALUATION PIPELINE
============================

Step 1: Define golden test cases
  ┌─────────────────────────────────────┐
  │ Input: "What are our Q3 sales?"     │
  │ Expected tools: [QuerySalesDB]      │
  │ Expected answer: contains "$X.XM"   │
  └─────────────────────────────────────┘

Step 2: Run agent with enableTrace=True
  → Capture full reasoning trace
  → Record tools invoked + parameters passed
  → Record step count and total latency

Step 3: Compare actual vs expected
  → Correct tools selected? (tool selection accuracy)
  → Correct parameters passed? (parameter accuracy)
  → Final answer matches expected? (task completion)

Step 4: Measure efficiency
  → Steps taken vs optimal steps
  → Total latency and token consumption
  → Cost per completed task

[PRODUCTION INSIGHT] In practice, the most common agent failure mode is tool selection errors — the agent picks the wrong action group or passes incorrect parameters. I always start agent evaluation with tool selection accuracy testing before worrying about answer quality. If the agent is calling the wrong tools, no amount of prompt engineering will improve the final answer.

User-Centered Evaluation

USER FEEDBACK PATTERNS
========================

1. IMPLICIT FEEDBACK
   - Thumbs up / thumbs down on responses
   - "Was this helpful?" Y/N
   - Regeneration requests (user asks same question again)
   - Session abandonment (user leaves without achieving goal)

2. EXPLICIT FEEDBACK
   - Star rating (1-5 scale)
   - Free-text feedback
   - Report issues ("This is wrong," "Inappropriate content")
   - Domain expert annotations

3. A/B TESTING PATTERNS
   ┌─────────────────────────────────────────┐
   │ PROMPT A/B TEST:                         │
   │  50/50 traffic split → compare quality   │
   │  API Gateway weighted routing             │
   │                                           │
   │ MODEL A/B TEST:                           │
   │  Route between Sonnet vs Haiku            │
   │  Measure: quality, latency, cost          │
   │                                           │
   │ CANARY DEPLOYMENT:                        │
   │  5% → 25% → 50% → 100%                   │
   │  Monitor for quality regression            │
   │  Roll back if metrics degrade              │
   └─────────────────────────────────────────┘

Quality Assurance Processes

Golden Dataset Management

GOLDEN DATASET BEST PRACTICES
================================

CREATION:
  - Start with 50-100 curated question-answer pairs
  - Cover all major use cases AND edge cases
  - Include adversarial/tricky inputs
  - Version in S3 with lifecycle tags

MAINTENANCE:
  - Review and update quarterly
  - Add new cases from production failures
  - Remove outdated questions
  - Track dataset version alongside model/prompt versions

REGRESSION TESTING:
  - Run golden dataset after EVERY prompt change
  - Run after EVERY knowledge base update
  - Run after guardrail configuration changes
  - Fail CI/CD if quality drops below threshold

CI/CD Quality Gates for GenAI

CI/CD QUALITY GATES
=====================

  ┌─────────────────────────────────────────────┐
  │ Gate 1: PROMPT LINT                          │
  │   Check syntax, variable names, structure    │
  │   → BLOCK deployment if fails                │
  ├─────────────────────────────────────────────┤
  │ Gate 2: UNIT TESTS                           │
  │   Mocked model responses, logic validation   │
  │   → BLOCK deployment if fails                │
  ├─────────────────────────────────────────────┤
  │ Gate 3: GOLDEN DATASET EVALUATION            │
  │   Run against golden Q&A pairs               │
  │   Quality score must exceed threshold         │
  │   → BLOCK deployment if below threshold       │
  ├─────────────────────────────────────────────┤
  │ Gate 4: GUARDRAIL TESTS                      │
  │   Adversarial inputs, safety scenarios        │
  │   Verify guardrails block as expected         │
  │   → BLOCK deployment if fails                │
  ├─────────────────────────────────────────────┤
  │ Gate 5: LATENCY BENCHMARK                    │
  │   P99 must be under SLA limit                │
  │   → WARN if exceeds (may not block)          │
  ├─────────────────────────────────────────────┤
  │ Gate 6: COST ESTIMATION                      │
  │   Per-request cost under budget               │
  │   → WARN if exceeds (may not block)          │
  └─────────────────────────────────────────────┘

[EXAM TIP] "Ensure quality doesn't regress after a prompt change" → golden dataset regression testing in CI/CD pipeline. "Automate quality checks before deployment" → CI/CD quality gates. The exam specifically tests whether you know to gate deployments on quality, not just functional tests.

Semantic Drift Detection

SEMANTIC DRIFT DETECTION
===========================

WHAT: Model outputs gradually change in meaning/style over time,
      even though the prompt hasn't changed.

CAUSES:
  - Model version updates (provider-side changes)
  - Data distribution shifts (different user queries)
  - Knowledge base content changes

DETECTION:
  - Schedule golden dataset runs (daily/weekly)
  - Compare output embeddings over time
  - Track quality metrics trends in CloudWatch
  - Alert when cosine similarity drops below threshold

RESPONSE:
  - Investigate root cause (model update? data change?)
  - Update golden dataset if business requirements changed
  - Re-tune prompts to compensate for model changes
  - Consider prompt versioning with Bedrock Prompt Management

Bedrock Prompt Management

BEDROCK PROMPT MANAGEMENT
============================

PURPOSE: Version control for prompts — like Git for your GenAI prompts.

FEATURES:
  - Create and store prompt templates
  - Version management (v1, v2, v3...)
  - Test prompts in Bedrock playground
  - Rollback to previous versions
  - Share across team members

WORKFLOW:
  1. Create prompt template with variables
  2. Test in playground
  3. Publish as new version
  4. Reference version in application code
  5. Roll back if quality degrades

USE FOR: "How to manage prompt changes across environments"
         "Version control for prompt templates"
         "Roll back a prompt change that degraded quality"

Practice Questions — Section 2: Evaluation Systems

Q1. A company wants to evaluate whether their summarization model captures the key points of source documents. They have reference summaries written by domain experts. Which metric is MOST appropriate?

A) BLEU B) ROUGE-L C) Perplexity D) F1 Score

Answer: B ROUGE-L measures the longest common subsequence between the generated summary and reference summary, making it the standard metric for summarization evaluation. It captures content overlap while allowing for some flexibility in phrasing. BLEU (A) is designed for translation tasks and measures precision. Perplexity (C) measures language fluency, not content quality. F1 (D) is for extractive QA where exact answers exist.

Q2. A team needs to detect when their RAG application generates answers that include information NOT present in the retrieved context. They want automated checking at scale in their CI/CD pipeline. Which approach is MOST effective?

A) Compute ROUGE scores against reference answers B) Use Bedrock Guardrails contextual grounding check C) Have human reviewers rate each answer D) Measure retrieval latency

Answer: B Contextual grounding checks validate whether the generated response is supported by the retrieved context — this is exactly the faithfulness dimension. It runs automatically and can be integrated into CI/CD pipelines. ROUGE (A) checks overlap with a reference answer but doesn't verify grounding against retrieved context. Human review (C) is accurate but doesn't scale for CI/CD. Latency (D) measures performance, not content quality.

Q3. A company uses LLM-as-a-Judge to evaluate their chatbot. They notice the judge model consistently rates its own model family's outputs higher than outputs from other models. What is this issue and how should they mitigate it?

A) Order bias — run comparisons in both orders B) Self-preference bias — use a different model family as judge C) Verbosity bias — penalize long answers in the rubric D) Confirmation bias — increase the temperature of the judge

Answer: B Self-preference bias occurs when a model rates its own model family's outputs higher. The mitigation is to use a different model family as judge (e.g., use Claude to judge Llama outputs, not Claude judging Claude). Order bias (A) relates to position in pairwise comparison, not model preference. Verbosity bias (C) relates to length preference. Confirmation bias (D) is not a standard LLM evaluation bias term.

Q4. A team has deployed a Bedrock Agent that sometimes calls the wrong action group for user requests. They want to systematically identify and fix these routing errors. What should they implement FIRST?

A) Fine-tune the agent's foundation model B) Enable trace logging and create a golden test dataset with expected tool calls C) Switch to a more powerful model D) Add more action groups with overlapping functionality

Answer: B Enable trace logging (enableTrace=True) to see the agent's reasoning and tool selection decisions, then create a golden test dataset with known inputs and expected tool selections. This identifies exactly which queries cause routing failures. Fine-tuning (A) is premature without diagnosis — you must understand the problem before attempting to fix it. Switching models (C) may not fix routing logic issues. Adding overlapping action groups (D) would make routing ambiguity worse.

Q5. A team wants to compare Claude 3.5 Sonnet and Claude 3 Haiku for their customer support chatbot. They need both automated metric scores and subjective quality ratings from domain experts. Which AWS service supports both?

A) SageMaker Model Monitor B) Amazon Bedrock Model Evaluation C) FMEval library D) CloudWatch custom metrics

Answer: B Bedrock Model Evaluation supports both automatic evaluation (automated metric scores, side-by-side comparison) and human evaluation (domain experts rate outputs on custom criteria). It's the managed AWS service designed specifically for model comparison. Model Monitor (A) detects drift in deployed models, not comparison. FMEval (C) supports programmatic automated evaluation but not managed human evaluation workflows. CloudWatch (D) monitors operational metrics, not model quality.

SECTION 3: TROUBLESHOOTING GENAI APPLICATIONS (Task 5.2)

Lines: ~350 | Priority: HIGH — practical troubleshooting tested as scenarios

Systematic Debugging Hierarchy

When a GenAI application fails or degrades, follow this 7-step hierarchy:

GENAI TROUBLESHOOTING — 7-STEP HIERARCHY
==========================================

Step 1: CHECK CLOUDWATCH METRICS
  → Error rates, latency spikes, throttling counts
  → Is this an infrastructure problem or a content problem?

Step 2: VALIDATE INPUTS
  → Check request format, token counts, model parameters
  → Is the input valid and within model limits?

Step 3: TRACE WITH X-RAY
  → End-to-end request path, subsegment latencies
  → WHERE in the pipeline is the issue?

Step 4: CHECK AGENT TRACES (if using agents)
  → enableTrace=True: reasoning, tool selection, parameters
  → Is the agent choosing the right tools?

Step 5: ISOLATE WITH PLAYGROUND
  → Test the exact prompt in Bedrock playground
  → Remove variables: is it the prompt or the system?

Step 6: CHECK VECTOR STORE HEALTH (if RAG)
  → Sync status, index health, retrieval relevance
  → Is the retrieval returning correct context?

Step 7: REVIEW GUARDRAILS LOGS
  → Is content being blocked unexpectedly?
  → Are filters too aggressive or too permissive?

[EXAM TIP] The exam presents troubleshooting scenarios and asks "what should you check FIRST?" The answer follows this hierarchy — start with CloudWatch (infrastructure), then validate inputs, then trace the request path. Don't jump to prompt changes before checking infrastructure.

Content Issues

Hallucination

SYMPTOM: Model generates information not in source documents

DIAGNOSIS:
  → Is RAG enabled? If not, enable it.
  → Is contextual grounding enabled? If not, add it.
  → What is the temperature setting?

SOLUTIONS (ordered by impact):
  1. Enable RAG with Bedrock Knowledge Bases
     (ground responses in actual documents)
  2. Enable Guardrails contextual grounding check
     (filter responses not supported by context)
  3. Lower temperature to 0.0-0.2
     (reduce creative/inventive responses)
  4. Add explicit instruction in system prompt:
     "Only answer based on the provided context.
      If the context doesn't contain the answer, say 'I don't know.'"
  5. Improve retrieval quality (better chunks, re-ranking)
     (give the model better context to ground on)

[TRAP] The exam may describe hallucination and offer "fine-tune the model" as an option. Fine-tuning does NOT fix hallucination — it teaches the model style/task patterns, not factual knowledge. RAG + contextual grounding is the correct approach for grounding in facts.

Context Window Overflow

SYMPTOM: Model returns errors or truncated/degraded output
         with long inputs

DIAGNOSIS:
  → Check input token count vs model's context window limit
  → Is the RAG context too large (too many chunks)?
  → Is conversation history accumulating?

SOLUTIONS:
  1. DYNAMIC CHUNKING: Retrieve fewer, more relevant chunks
     (top-3 instead of top-10)
  2. SUMMARIZATION: Summarize long documents before injection
     (hierarchical summarization for very long docs)
  3. SLIDING WINDOW: For conversations, keep only last N turns
     (summarize older turns)
  4. METADATA FILTERS: Pre-filter to reduce retrieval set
     (filter by date, department, document type)

Output Format Issues

SYMPTOM: Model output format is inconsistent or wrong

SOLUTIONS:
  1. Set temperature=0 for deterministic output
  2. Add explicit output format in system prompt
     ("Respond ONLY in valid JSON with this schema: {...}")
  3. Provide few-shot examples showing exact format
  4. Add stopSequences to terminate at the right point
  5. Use structured output parsing (e.g., JSON schema validation)
  6. Consider fine-tuning for consistent format adherence

API Errors and Resolutions

Error	Cause	Resolution
ThrottlingException	Exceeded API rate limits	Exponential backoff with jitter; request quota increase; consider Provisioned Throughput; Cross-Region Inference
ModelTimeoutException	Response took too long	Reduce input tokens; use ConverseStream for long responses; increase Lambda timeout (Bedrock can take 30s+)
ValidationException	Invalid request format	Check request body matches model requirements; verify modelId correct and enabled; check maxTokens doesn't exceed limit
AccessDeniedException	Insufficient permissions	Verify IAM policy allows `bedrock:InvokeModel`; check model access enabled in console; verify region; check ARN format
ResourceNotFoundException	Model/resource not found	Verify modelId exists; check model available in region; verify Knowledge Base ID correct
ServiceUnavailableException	Bedrock service issue	Retry with backoff; check AWS Health Dashboard; failover via Cross-Region Inference profile

[EXAM TIP] ThrottlingException is the most tested API error. The correct response is ALWAYS exponential backoff with jitter. If the scenario describes "immediate retry on throttling" as the current behavior, the answer is to implement exponential backoff. Do NOT select "increase Lambda memory" or "switch to a smaller model" — these don't address the retry strategy.

Retrieval Issues (RAG Troubleshooting)

RAG RETRIEVAL TROUBLESHOOTING DECISION TREE
=============================================

  "RAG results are poor"
         │
    ┌────┴───────────────────────────┐
    │                                │
  Low relevance?                   Missing results?
    │                                │
    ▼                                ▼
  ┌──────────────────┐         ┌──────────────────┐
  │ - Tune chunking  │         │ - Check sync      │
  │   strategy       │         │   completed?      │
  │ - Enable hybrid  │         │ - Verify source   │
  │   search         │         │   permissions     │
  │ - Add metadata   │         │ - Supported doc   │
  │   filters        │         │   format?         │
  │ - Enable         │         │ - Trigger manual  │
  │   re-ranking     │         │   sync            │
  └──────────────────┘         └──────────────────┘

    │                                │
  Proper nouns                    Stale results?
  not found?                         │
    │                                ▼
    ▼                          ┌──────────────────┐
  ┌──────────────────┐         │ - Enable         │
  │ Enable HYBRID    │         │   incremental    │
  │ search (keyword  │         │   sync           │
  │ + semantic)      │         │ - Set up sync    │
  │                  │         │   schedule       │
  │ Keyword handles  │         │ - Monitor with   │
  │ proper nouns     │         │   CloudWatch     │
  │ better than      │         └──────────────────┘
  │ vector search    │
  └──────────────────┘

[EXAM TIP] "Proper nouns not found in RAG search" → enable hybrid search. Semantic (vector) search matches meaning but struggles with specific names, product codes, and identifiers. Keyword search handles these precisely. Combine both for comprehensive retrieval.

Prompt Debugging

PROMPT DEBUGGING PATTERNS
============================

SYMPTOM: Inconsistent outputs across runs
  Problem→ Non-deterministic generation
  Solution→ Set temperature=0, add output format, use few-shot examples

SYMPTOM: Model misinterprets instructions
  Problem→ Ambiguous or complex prompt
  Solution→ Simplify language, use XML tags to structure sections,
            add explicit constraints ("Do NOT include...")

SYMPTOM: Output too long / too short
  Problem→ maxTokens misconfigured or instructions unclear
  Solution→ Right-size maxTokens, add length guidance in prompt,
            use stopSequences for early termination

SYMPTOM: Token limit exceeded
  Problem→ Input too large for context window
  Solution→ Compress context (summarize), reduce few-shot examples,
            split into multiple requests, hierarchical summarization

SYMPTOM: Format drift over conversation
  Problem→ Model "forgets" format in long conversations
  Solution→ Reinforce format in system prompt (persists across turns),
            use stopSequences, reset context periodically

Prompt Maintenance and Version Control

PROMPT MAINTENANCE BEST PRACTICES
====================================

1. VERSION CONTROL (Bedrock Prompt Management)
   - Store prompt templates with version numbers
   - Test in playground before publishing
   - Roll back to previous versions if quality degrades
   - Track which version is deployed in each environment

2. TEMPLATE TESTING
   - CloudWatch Logs Insights: query output patterns
   - X-Ray: trace prompt performance over time
   - Schema validation: verify output matches expected schema

3. OBSERVABILITY
   - Log prompt version alongside model outputs
   - Track quality metrics per prompt version
   - Alert on quality regression after version change

4. CHANGE MANAGEMENT
   - Golden dataset regression test before deployment
   - Canary deployment (5% → 25% → 50% → 100%)
   - Automated rollback if quality metrics drop

Practice Questions — Section 3: Troubleshooting

Q6. After updating their knowledge base documents, a team notices that the RAG application returns stale information for some queries. The data source sync shows as completed. What is the MOST likely cause?

A) The embedding model has a knowledge cutoff date B) The vector index needs to be completely rebuilt C) The incremental sync didn't process all updated documents D) The foundation model caches previous responses

Answer: C Incremental sync may not have detected or processed all updated documents. The team should check sync job details and verify updated documents were re-chunked and re-embedded, or trigger a full sync. The embedding model (A) doesn't have a knowledge cutoff — it processes whatever text it receives. Vector indexes (B) update incrementally, they don't need full rebuilds. Foundation models (D) don't cache responses between separate requests.

Q7. A production GenAI application starts returning ThrottlingException errors during peak hours. The application currently retries immediately on failure, causing more throttling. What should the team implement?

A) Switch to Provisioned Throughput immediately B) Implement exponential backoff with jitter C) Increase Lambda memory allocation D) Switch to a smaller, faster model

Answer: B The immediate retry strategy is causing a "thundering herd" problem — failed requests retry immediately, adding to the congestion. Exponential backoff with jitter spaces out retries with randomized delays, reducing pressure on the service. Provisioned Throughput (A) is a longer-term solution but doesn't fix the retry storm happening now. Lambda memory (C) is unrelated to API throttling. A smaller model (D) may have different quotas but doesn't address the retry strategy that's amplifying the problem.

Q8. A customer support chatbot using RAG consistently fails to find product serial numbers when users ask about specific products. The knowledge base contains documents with serial numbers. What should the team change?

A) Use a larger embedding model with more dimensions B) Enable hybrid search combining semantic and keyword search C) Increase the number of retrieved chunks from 5 to 20 D) Fine-tune the foundation model on product data

Answer: B Serial numbers and product identifiers are exact terms that semantic (vector) search handles poorly — vector embeddings capture meaning, not exact strings. Hybrid search adds keyword search alongside semantic search, enabling precise matching for proper nouns, codes, and identifiers. A larger embedding model (A) still encodes meaning, not exact strings. More chunks (C) increases noise without improving relevance. Fine-tuning (D) teaches the model style, not retrieval capabilities.

Q9. A development team notices their GenAI application produces correct responses in the Bedrock playground but gives incorrect responses in production. The prompt is identical. What should they investigate FIRST?

A) The foundation model was updated by the provider B) Differences in inference parameters between playground and production code C) Network latency causing response truncation D) CloudWatch alarm configuration

Answer: B When the same prompt works in playground but fails in production, the most likely cause is a discrepancy in inference parameters (temperature, maxTokens, topP) or system prompt differences between the two environments. The playground may use different defaults than the application code. A model update (A) would affect both playground and production equally. Network latency (C) doesn't cause incorrect content. CloudWatch (D) monitors but doesn't affect output quality.

SECTION 4: DOMAIN 5 COMPREHENSIVE REVIEW

Lines: ~250 | Priority: Review before exam

Top 8 Exam Traps

#	Trap	Correct Understanding
1	"Use accuracy/F1 for GenAI evaluation"	These are traditional ML metrics; GenAI needs ROUGE, LLM-as-Judge, human eval
2	"Fine-tune to fix hallucination"	Fine-tuning teaches style/task patterns; RAG + contextual grounding fixes hallucination
3	"LLM-as-Judge is unbiased"	Has self-preference, order, and verbosity biases — use different model family as judge
4	"Model invocation logging is on by default"	It's OFF by default — must explicitly enable to see prompt/response content
5	"ROUGE measures precision"	ROUGE measures RECALL; BLEU measures PRECISION
6	"Retry immediately on throttling"	Use exponential backoff with jitter — immediate retry makes throttling worse
7	"Perplexity: higher is better"	Perplexity: LOWER is better (less "surprised" = more fluent)
8	"Vector search finds product codes"	Semantic search matches meaning, not exact strings — use hybrid search for identifiers

Troubleshooting Decision Tree

GENAI APPLICATION TROUBLESHOOTING
====================================

  "Application is not working correctly"
         │
    ┌────┴────────────────────────────────┐
    │                                     │
  Error responses?                 Poor quality output?
    │                                     │
    ├─ ThrottlingException               ├─ Hallucination
    │   → Exponential backoff            │   → RAG + contextual grounding
    │   → Request quota increase          │   → Lower temperature
    │                                     │
    ├─ ModelTimeoutException             ├─ Format inconsistency
    │   → Reduce input tokens            │   → temp=0 + few-shot + schema
    │   → Use streaming                  │   → stopSequences
    │                                     │
    ├─ ValidationException               ├─ Irrelevant/off-topic
    │   → Check request format           │   → Improve system prompt
    │   → Verify model ID + access       │   → Add guardrail denied topics
    │                                     │
    ├─ AccessDeniedException             ├─ RAG: wrong results
    │   → Check IAM policy               │   → Hybrid search + re-ranking
    │   → Verify model access enabled    │   → Better chunking + metadata
    │                                     │
    └─ ServiceUnavailableException       └─ RAG: stale results
        → Retry with backoff                 → Check sync status
        → Cross-Region Inference             → Enable incremental sync

Evaluation Metric Selection Guide

QUICK REFERENCE: WHICH EVALUATION METHOD?
============================================

"Compare two models on my task"
  └──► Bedrock Model Evaluation (automatic mode)

"Get expert quality feedback on outputs"
  └──► Bedrock Model Evaluation (human mode)

"Evaluate in CI/CD pipeline programmatically"
  └──► FMEval library

"Scale evaluation without human reviewers"
  └──► LLM-as-a-Judge

"Measure summarization quality"
  └──► ROUGE-L (against reference summaries)

"Detect hallucination in RAG"
  └──► Faithfulness check (RAGAS) or Guardrails contextual grounding

"Check if translation is accurate"
  └──► BLEU score

"Verify semantic meaning preserved"
  └──► BERTScore

"Test agent tool selection"
  └──► Golden test cases with enableTrace=True

"Detect quality regression after changes"
  └──► Golden dataset regression testing in CI/CD

GenAI vs Traditional ML Evaluation Comparison

Aspect	Traditional ML	Generative AI
Ground truth	Single correct answer	Multiple valid outputs
Metrics	Accuracy, F1, precision, recall	ROUGE, BLEU, BERTScore, LLM-as-Judge
Evaluation type	Deterministic	Probabilistic + subjective
Automation	Fully automated	Automated + human + LLM-as-Judge
Test data	Labeled datasets	Golden datasets + evaluation rubrics
Quality dimensions	Accuracy	Accuracy + fluency + relevance + groundedness + safety
Drift detection	Statistical tests on features	Semantic drift detection on outputs
CI/CD gates	Accuracy threshold	Multi-metric quality gates
AWS tool	SageMaker Model Monitor	Bedrock Model Evaluation + FMEval

Rapid-Fire Exam Questions

Q10. Which metric is specifically designed for evaluating machine translation quality?

A) ROUGE-L B) BLEU C) BERTScore D) Perplexity

Answer: B — BLEU (Bilingual Evaluation Understudy) was designed for translation. It measures n-gram precision against reference translations. ROUGE (A) is for summarization. BERTScore (C) measures semantic similarity. Perplexity (D) measures fluency.

Q11. A team wants to ensure that their production RAG application doesn't degrade in quality when they update the system prompt. What should they run before deploying the change?

A) CloudWatch Synthetics canary B) Golden dataset regression test C) X-Ray trace comparison D) SageMaker Model Monitor baseline

Answer: B — Golden dataset regression testing compares output quality before and after the prompt change on a curated set of test cases. This catches quality regressions before they reach production. Synthetics (A) tests availability, not quality. X-Ray (C) compares performance traces. Model Monitor (D) detects drift in deployed models, not prompt changes.

Q12. A company's GenAI application works correctly in the development environment but returns AccessDeniedException in production. What should they check FIRST?

A) The model's context window limit B) The IAM execution role's permissions for bedrock:InvokeModel C) The model invocation logging configuration D) The CloudWatch alarm thresholds

Answer: B — AccessDeniedException is an IAM permissions issue. The production environment's IAM execution role likely lacks the bedrock:InvokeModel permission, or the model access hasn't been enabled in the production account/region. Context window (A) causes ValidationException. Logging (C) is unrelated to access. Alarms (D) monitor but don't affect access.

Q13. When using LLM-as-a-Judge for pairwise comparison, a team notices the judge consistently prefers whichever answer is presented second. What bias is this, and how should they mitigate it?

A) Self-preference bias — use a different judge model B) Verbosity bias — add length constraints to the rubric C) Order bias — run each comparison in both orders and average D) Anchoring bias — randomize the evaluation rubric

Answer: C — Position/order bias means the judge model favors a particular position (first or last). The standard mitigation is to run each comparison in both orders (A vs B, then B vs A) and average the results. Self-preference (A) is about model family, not position. Verbosity (B) is about length. Anchoring (D) is not a standard LLM evaluation bias.

Q14. A deployed Bedrock Knowledge Base returns correct information but the generated response doesn't cite its sources. What should the team check?

A) The guardrails configuration B) The citation configuration in the RetrieveAndGenerate API response parsing C) The OpenSearch index settings D) The model invocation logging configuration

Answer: B — Bedrock Knowledge Bases include citations in the RetrieveAndGenerate API response. If citations aren't showing, the team should check whether they're properly extracting and displaying citation data from the response object (citations → retrievedReferences → location). Guardrails (A) filter content, not citations. Index settings (C) affect retrieval. Logging (D) captures content for audit.

Q15. Which AWS tool detects that a deployed SageMaker model's input data distribution has changed compared to the training data?

A) CloudWatch Metrics B) AWS X-Ray C) SageMaker Model Monitor (data quality monitoring) D) Bedrock Model Evaluation

Answer: C — SageMaker Model Monitor with data quality monitoring detects input distribution drift — changes in the statistical properties of input data compared to the training baseline. CloudWatch (A) monitors operational metrics. X-Ray (B) traces request flow. Bedrock Model Evaluation (D) compares models on Bedrock, not SageMaker drift.

EXAM DAY QUICK REFERENCE — DOMAIN 5

Evaluation: Remember the Taxonomy

Automated metrics → CI/CD, regression testing (fast, scalable, limited depth)
Human evaluation → final quality validation (slow, expensive, gold standard)
LLM-as-a-Judge → rapid iteration at scale (watch for biases)

Metrics: Remember Task-to-Metric

Summarization → ROUGE
Translation → BLEU
Semantic similarity → BERTScore
Fluency → Perplexity (LOWER is better)
Extractive QA → F1 / Exact Match
Open-ended → LLM-as-Judge + Human

RAG: Remember the Four Dimensions

Faithfulness = is the answer grounded in context?
Answer relevance = does it address the question?
Context precision = were retrieved chunks relevant?
Context recall = were all relevant chunks retrieved?

Troubleshooting: Remember the Hierarchy

CloudWatch → 2. Input validation → 3. X-Ray → 4. Agent traces → 5. Playground → 6. Vector store → 7. Guardrails

AWS Tools: Remember the Map

Compare models → Bedrock Model Evaluation
Programmatic evaluation → FMEval
Prompt version control → Bedrock Prompt Management
Deployed model drift → SageMaker Model Monitor

Study guide generated by Dr. Priya Ramanathan Domain 5: Testing, Validation, and Troubleshooting — 11% of AIP-C01 Lines: ~1,350 | Practice Questions: 15 (5 + 4 + 6 rapid-fire) Skills referenced: 12_EVALUATION_AND_TESTING, 08_MLOPS_AND_DEPLOYMENT, 06_RESPONSIBLE_AI

Domain 5: Testing, Validation, and Troubleshooting

Domain 5: Testing, Validation, and Troubleshooting

SECTION 1: DOMAIN OVERVIEW

Domain Scope

The Fundamental Distinction

Task Dependency Map

Cross-Domain Links

SECTION 2: EVALUATION SYSTEMS FOR GENAI (Task 5.1)

Evaluation Taxonomy

Automated Evaluation Metrics

Amazon Bedrock Model Evaluation

Bedrock Model Evaluation — Code Pattern

FMEval Library

LLM-as-a-Judge

Implementation Pattern

Known Biases in LLM-as-a-Judge

RAG Evaluation

Retrieval Metrics

Generation Metrics (RAGAS Framework)

RAG Evaluation Code Pattern

Agent Evaluation

User-Centered Evaluation

Quality Assurance Processes

Golden Dataset Management

CI/CD Quality Gates for GenAI

Semantic Drift Detection

Bedrock Prompt Management

Practice Questions — Section 2: Evaluation Systems

SECTION 3: TROUBLESHOOTING GENAI APPLICATIONS (Task 5.2)

Systematic Debugging Hierarchy

Content Issues

Hallucination

Context Window Overflow

Output Format Issues

API Errors and Resolutions

Retrieval Issues (RAG Troubleshooting)

Prompt Debugging

Prompt Maintenance and Version Control

Practice Questions — Section 3: Troubleshooting

SECTION 4: DOMAIN 5 COMPREHENSIVE REVIEW

Top 8 Exam Traps

Troubleshooting Decision Tree

Evaluation Metric Selection Guide

GenAI vs Traditional ML Evaluation Comparison

Rapid-Fire Exam Questions

EXAM DAY QUICK REFERENCE — DOMAIN 5

Evaluation: Remember the Taxonomy

Metrics: Remember Task-to-Metric

RAG: Remember the Four Dimensions

Troubleshooting: Remember the Hierarchy

AWS Tools: Remember the Map

Related Documents

AI Tools for Developers

Lesson 01: Evaluation Frameworks Overview

Evaluating AI Agent Systems: Metrics, Benchmarks, and Quality Assurance (2024-2026)

IATA BCBP Standard Compliance