Loading...
Loading...
Loading...
**AIP-C01 Study Guide — Dr. Priya Ramanathan**
# Domain 5: Testing, Validation, and Troubleshooting
**AIP-C01 Study Guide — Dr. Priya Ramanathan**
> **Domain Weight**: 11% (~7 of 65 scored questions)
> **Tasks**: 5.1–5.2 | **Skills**: 14+
> **Target Audience**: Professional-level (2+ years AWS, 1+ year GenAI hands-on)
---
## SECTION 1: DOMAIN OVERVIEW
*Lines: ~90 | Priority: Read first*
### Domain Scope
Domain 5 covers 11% of the AIP-C01 exam — roughly 7 of the 65 scored questions. This domain tests a fundamentally different mindset than traditional ML evaluation. In traditional ML, evaluation is deterministic — you have labeled test data, you compute accuracy/F1/precision/recall, and you're done. In GenAI, evaluation is **probabilistic and subjective** — the same question can have multiple valid answers, quality depends on context, and automated metrics alone are insufficient.
**Critical insight**: The exam specifically tests whether you understand WHY GenAI evaluation is different, not just HOW to evaluate. Questions will contrast traditional ML metrics with GenAI evaluation approaches, and the correct answer requires understanding that open-ended generation needs different tooling.
### The Fundamental Distinction
```
TRADITIONAL ML vs GENAI EVALUATION
=====================================
TRADITIONAL ML:
┌──────────────┐ ┌─────────────┐ ┌─────────────┐
│ Test Dataset ├────►│ Run Model ├────►│ Compute │
│ (labeled) │ │ (predict) │ │ Accuracy/F1 │
└──────────────┘ └─────────────┘ └─────────────┘
✓ Single correct answer per input
✓ Deterministic metrics
✓ Fully automated
GENERATIVE AI:
┌──────────────┐ ┌─────────────┐ ┌─────────────────────┐
│ Test Dataset ├────►│ Run Model ├────►│ Automated Metrics │
│ (golden set) │ │ (generate) │ │ + LLM-as-Judge │
└──────────────┘ └─────────────┘ │ + Human Evaluation │
└─────────────────────┘
✗ Multiple valid outputs for same input
✗ Quality is subjective (style, helpfulness, coherence)
✗ No single "ground truth" for open-ended generation
✗ Requires combination of evaluation approaches
```
**[EXAM TIP]** If the exam presents a scenario asking "how to evaluate a summarization model" and one option is "compute accuracy on a test set," that is a traditional ML approach and is WRONG for GenAI. The correct answer will involve ROUGE scores, human evaluation, or LLM-as-a-Judge.
### Task Dependency Map
```
DOMAIN 5 TASK DEPENDENCY MAP
===============================
Task 5.1: Evaluation Systems for GenAI ◄── HIGHEST YIELD
├── Automated metrics (ROUGE, BLEU, BERTScore, perplexity)
├── Bedrock Model Evaluation (automatic + human modes)
├── LLM-as-a-Judge (scoring, pairwise, biases)
├── RAG evaluation (faithfulness, relevance, precision, recall)
├── Agent evaluation (task completion, tool accuracy)
├── A/B testing and canary deployments
└── Golden datasets and CI/CD quality gates
│
v
Task 5.2: Troubleshooting GenAI Applications ◄── PRACTICAL
├── Systematic debugging hierarchy (7 steps)
├── Content issues (hallucination, format drift, truncation)
├── API errors (throttling, timeout, validation, access)
├── Retrieval issues (relevance, staleness, proper nouns)
├── Prompt debugging (inconsistency, token limits)
└── Monitoring stack (CloudWatch, X-Ray, CloudTrail, logging)
```
### Cross-Domain Links
| Domain 5 Topic | Also Tested In | Context |
|----------------|---------------|---------|
| Model comparison | D1 (FM selection) | Choose best model for task |
| Evaluation metrics | D2 (deployment validation) | Quality gates before deploy |
| Monitoring stack | D4 (operational monitoring) | CloudWatch, X-Ray, logging |
| Guardrails evaluation | D3 (safety controls) | Testing safety filters |
| RAG evaluation | D1 (retrieval mechanisms) | Retrieval quality testing |
| Golden datasets | D4 (troubleshooting) | Regression testing |
---
## SECTION 2: EVALUATION SYSTEMS FOR GENAI (Task 5.1)
*Lines: ~450 | Priority: HIGHEST — most testable topic in Domain 5*
### Evaluation Taxonomy
Three approaches, each with a clear use case:
```
EVALUATION APPROACHES
=======================
1. AUTOMATED METRICS
Speed: ████████████ Fast
Scale: ████████████ Unlimited
Quality: ████░░░░░░ Limited for open-ended tasks
Use for: CI/CD gates, regression testing, benchmarking
2. HUMAN EVALUATION
Speed: ██░░░░░░░░ Slow
Scale: ██░░░░░░░░ Limited
Quality: ████████████ Gold standard
Use for: Final quality validation, subjective assessment
3. LLM-AS-A-JUDGE
Speed: ████████░░ Moderate
Scale: ████████░░ Good
Quality: ████████░░ Bridges automated + human
Use for: Rapid iteration, scaling evaluation
```
**[EXAM TIP]** The exam tests WHEN to use each approach. Automated metrics for CI/CD and regression. Human evaluation for final quality validation and subjective judgment. LLM-as-a-Judge for rapid development iteration at scale. Know the trade-offs.
### Automated Evaluation Metrics
| Metric | Best For | Range | How It Works | Key Insight |
|--------|---------|-------|-------------|-------------|
| **ROUGE-1** | Summarization | 0-1 | Unigram overlap with reference | Measures content coverage |
| **ROUGE-2** | Summarization | 0-1 | Bigram overlap with reference | Captures phrase-level similarity |
| **ROUGE-L** | Summarization | 0-1 | Longest common subsequence | Standard summarization metric |
| **BLEU** | Translation | 0-1 | N-gram precision vs reference | Measures precision (cleanness) |
| **BERTScore** | Semantic similarity | 0-1 | Contextual embedding similarity | Captures paraphrasing ROUGE misses |
| **Perplexity** | Language fluency | 1-∞ | How "surprised" the model is by text | **Lower = better** (only metric where lower wins) |
| **F1 / Exact Match** | Extractive QA | 0-1 | Token overlap / exact string match | For questions with definite answers |
**[EXAM TIP]** Memorize the metric-to-task mapping: ROUGE = summarization. BLEU = translation. BERTScore = semantic similarity (catches paraphrasing). Perplexity = fluency. F1 = extractive QA. The exam loves testing this.
```
METRIC SELECTION DECISION TREE
================================
"What is the task?"
│
┌────┴────────────────────────────────┐
│ │
Summarization? Translation?
│ │
▼ ▼
ROUGE-L BLEU
(ROUGE-1 for content,
ROUGE-L for structure)
│
Semantic meaning preservation?
│
▼
BERTScore (captures "vehicle" ≈ "car")
│
Extractive QA with known answers?
│
▼
F1 + Exact Match
│
Open-ended generation quality?
│
▼
LLM-as-a-Judge + Human Evaluation
(automated metrics are INSUFFICIENT alone)
```
**[TRAP]** ROUGE measures RECALL (how much of the reference is captured in the output). BLEU measures PRECISION (how clean/accurate the output is). Students confuse these. For summarization, you want to know if key points were captured → ROUGE (recall). For translation, you want to know if the output is accurate → BLEU (precision).
### Amazon Bedrock Model Evaluation
Bedrock Model Evaluation is the **managed AWS service** for comparing foundation models. Know both modes:
```
BEDROCK MODEL EVALUATION
===========================
┌─────────────────────────────────────────────────────────┐
│ AUTOMATIC EVALUATION │
├─────────────────────────────────────────────────────────┤
│ │
│ SELECT MODELS ──► CHOOSE TASK TYPE ──► PROVIDE DATASET │
│ (up to 2) (summarization, (JSONL in S3 │
│ Q&A, generation, or built-in) │
│ classification) │
│ │ │
│ ▼ │
│ AUTOMATIC SCORING │
│ Built-in metrics: Accuracy, Robustness, Toxicity │
│ Custom metrics: Your own evaluation criteria │
│ Side-by-side comparison of model outputs │
│ │
│ USE FOR: "Which model is better for my task?" │
│ OUTPUT: Comparative scores + individual model metrics │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ HUMAN EVALUATION │
├─────────────────────────────────────────────────────────┤
│ │
│ DEFINE CRITERIA ──► SELECT WORKFORCE ──► COLLECT RATINGS│
│ (relevance, (AWS-managed (1-5 scale, │
│ accuracy, or custom team) ranking, │
│ helpfulness) thumbs up/down) │
│ │
│ USE FOR: Subjective quality, nuanced tasks │
│ OUTPUT: Aggregated human quality scores │
└─────────────────────────────────────────────────────────┘
```
#### Bedrock Model Evaluation — Code Pattern
```python
import boto3
bedrock = boto3.client("bedrock", region_name="us-east-1")
# Create automatic evaluation job — compare two models
response = bedrock.create_evaluation_job(
jobName="sonnet-vs-haiku-summarization",
roleArn="arn:aws:iam::123456789012:role/BedrockEvalRole",
evaluationConfig={
"automated": {
"datasetMetricConfigs": [{
"taskType": "Summarization",
"dataset": {
"name": "summarization-golden-set",
"datasetLocation": {
"s3Uri": "s3://my-eval-bucket/datasets/summarization.jsonl"
}
},
"metricNames": ["Accuracy", "Robustness", "Toxicity"]
}]
}
},
inferenceConfig={
"models": [
{
"bedrockModel": {
"modelIdentifier": "anthropic.claude-3-5-sonnet-20241022-v2:0",
"inferenceParams": '{"maxTokens": 512, "temperature": 0.0}'
}
},
{
"bedrockModel": {
"modelIdentifier": "anthropic.claude-3-5-haiku-20241022-v1:0",
"inferenceParams": '{"maxTokens": 512, "temperature": 0.0}'
}
}
]
},
outputDataConfig={
"s3Uri": "s3://my-eval-bucket/results/"
}
)
eval_job_arn = response["jobArn"]
# Check job status
status = bedrock.get_evaluation_job(jobIdentifier=eval_job_arn)
print(f"Status: {status['status']}")
```
**[EXAM TIP]** "Compare two foundation models" or "which model is best for my use case" → Bedrock Model Evaluation (automatic mode). "Get subjective quality feedback from domain experts" → Bedrock Model Evaluation (human mode). This is the AWS-managed answer for model comparison.
### FMEval Library
```
FMEVAL (Foundation Model Evaluation Library)
==============================================
WHAT: Open-source Python library from AWS for FM evaluation.
INSTALL: pip install fmeval
KEY FEATURES:
- Evaluate against standard benchmarks
- Supports Bedrock, SageMaker, and custom endpoints
- Built-in algorithms: factual_knowledge, qa_accuracy,
summarization_accuracy, toxicity, stereotyping
- Integrates with SageMaker for managed evaluation jobs
- Runs standalone (notebook, Lambda, CI/CD pipeline)
WHEN TO USE:
✓ Programmatic evaluation in CI/CD pipelines
✓ Custom evaluation logic beyond Bedrock Model Eval
✓ Evaluate models not on Bedrock (SageMaker, self-hosted)
BEDROCK MODEL EVAL vs FMEVAL:
Bedrock Model Eval = managed service, console or API, model comparison
FMEval = open-source library, programmatic, CI/CD integration
→ They complement each other
```
```python
from fmeval.eval import get_eval_algorithm
from fmeval.model_runners.bedrock_model_runner import BedrockModelRunner
# Set up model runner for Bedrock
model_runner = BedrockModelRunner(
model_id="anthropic.claude-3-5-sonnet-20241022-v2:0",
content_template='{"messages": [{"role": "user", "content": "$prompt"}]}',
output="content[0].text"
)
# Run QA accuracy evaluation
eval_algo = get_eval_algorithm("qa_accuracy")
eval_output = eval_algo.evaluate(
model=model_runner,
dataset_config=DataConfig(
dataset_name="my_qa_dataset",
dataset_uri="s3://my-bucket/eval/qa_dataset.jsonl",
dataset_mime_type="application/jsonlines",
model_input_location="question",
target_output_location="answer"
)
)
for eval_score in eval_output:
print(f"Metric: {eval_score.name}, Score: {eval_score.value}")
```
**[EXAM TIP]** FMEval is the answer for "evaluate models programmatically in a CI/CD pipeline" or "run automated evaluation as part of deployment." Bedrock Model Evaluation is the answer for "compare models using the AWS console or managed service."
### LLM-as-a-Judge
LLM-as-a-Judge uses a powerful foundation model to evaluate another model's outputs. It bridges the gap between fast-but-shallow automated metrics and accurate-but-slow human evaluation.
```
LLM-AS-A-JUDGE: TWO APPROACHES
=================================
1. SCORING (Rate a single output)
┌───────────────┐ ┌───────────────┐ ┌──────────┐
│ Question + ├────►│ Judge Model ├────►│ Score: │
│ Model Answer │ │ (strong FM) │ │ 1-5 per │
│ + Rubric │ │ │ │ criterion│
└───────────────┘ └───────────────┘ └──────────┘
2. PAIRWISE COMPARISON (Which is better?)
┌───────────────┐ ┌───────────────┐ ┌──────────┐
│ Question + ├────►│ Judge Model ├────►│ Winner: │
│ Answer A + │ │ (strong FM) │ │ A or B │
│ Answer B │ │ │ │ │
└───────────────┘ └───────────────┘ └──────────┘
```
#### Implementation Pattern
```python
import boto3, json
bedrock = boto3.client("bedrock-runtime")
judge_prompt = """You are an expert evaluator. Rate the following
answer on a scale of 1-5 for each criterion.
QUESTION: {question}
REFERENCE ANSWER: {reference}
MODEL ANSWER: {model_answer}
CRITERIA:
1. Accuracy (1-5): Is the answer factually correct?
2. Completeness (1-5): Does the answer fully address the question?
3. Relevance (1-5): Is the answer on-topic and not tangential?
Respond ONLY in JSON: {{"accuracy": N, "completeness": N, "relevance": N, "reasoning": "..."}}
"""
response = bedrock.converse(
modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
messages=[{
"role": "user",
"content": [{"text": judge_prompt.format(
question="What is Amazon Bedrock?",
reference="Amazon Bedrock is a fully managed service...",
model_answer="Bedrock is an AWS service for foundation models..."
)}]
}],
inferenceConfig={"temperature": 0.0} # Deterministic for evaluation
)
scores = json.loads(response["output"]["message"]["content"][0]["text"])
print(f"Accuracy: {scores['accuracy']}, Completeness: {scores['completeness']}")
```
#### Known Biases in LLM-as-a-Judge
| Bias | Description | Mitigation |
|------|-------------|------------|
| **Self-preference** | Model rates its own outputs higher | Use a different model family as judge |
| **Order bias** | In pairwise comparison, position affects choice | Run comparisons in BOTH orders (A vs B, B vs A) |
| **Verbosity bias** | Longer answers rated higher regardless of quality | Explicit rubric penalizing unnecessary length |
| **Cost** | Using expensive model as judge adds inference cost | Sample evaluation (judge a subset, not all) |
**[EXAM TIP]** "Evaluate model outputs at scale without human reviewers" → LLM-as-a-Judge. "Bias in LLM-as-a-Judge" → self-preference bias (most tested), order bias. Know both the TECHNIQUE and its LIMITATIONS. The exam tests whether you can identify these biases by description.
### RAG Evaluation
RAG evaluation is unique because you must evaluate BOTH the retrieval step AND the generation step:
```
RAG EVALUATION FRAMEWORK
===========================
┌──────────────┐
Query ─────►│ RETRIEVAL │──── Evaluate retrieval quality
│ (vector DB) │ Precision@K, Recall@K, MRR, NDCG
└──────┬───────┘
│
┌──────▼───────┐
│ GENERATION │──── Evaluate generation quality
│ (LLM) │ Faithfulness, Answer Relevance
└──────┬───────┘
│
┌──────▼───────┐
│ END-TO-END │──── Evaluate complete pipeline
│ (citations) │ Citation accuracy, Completeness
└──────────────┘
```
#### Retrieval Metrics
| Metric | What It Measures | Interpretation |
|--------|-----------------|----------------|
| **Precision@K** | Relevant docs in top K / K | Higher = less noise in results |
| **Recall@K** | Relevant docs in top K / total relevant | Higher = fewer missed relevant docs |
| **MRR** (Mean Reciprocal Rank) | 1 / rank of first relevant result | Higher = relevant docs ranked higher |
| **NDCG** (Normalized Discounted Cumulative Gain) | Ranking quality with position weighting | Higher = better overall ranking |
#### Generation Metrics (RAGAS Framework)
```
RAGAS EVALUATION DIMENSIONS
==============================
1. FAITHFULNESS (Grounding)
Question: Is the answer supported by the retrieved context?
Detects: Hallucination beyond retrieved documents
AWS tool: Bedrock Guardrails contextual grounding check
Score: 0-1 (1 = fully grounded)
2. ANSWER RELEVANCE
Question: Does the answer address the user's question?
Detects: Off-topic or tangential responses
Score: 0-1 (1 = fully relevant)
3. CONTEXT PRECISION
Question: Are the retrieved chunks actually relevant?
Detects: Noisy retrieval (irrelevant chunks diluting context)
Score: 0-1 (1 = all retrieved chunks are relevant)
4. CONTEXT RECALL
Question: Were ALL relevant chunks retrieved?
Detects: Missed information (retrieval gaps)
Score: 0-1 (1 = all relevant information retrieved)
```
**[EXAM TIP]** "RAG application generating answers not in the source documents" → low faithfulness score → enable Guardrails contextual grounding check. "RAG returning irrelevant documents" → low context precision → improve chunking, add metadata filters, enable re-ranking. Evaluate RAG end-to-end: retrieval + generation + citations.
#### RAG Evaluation Code Pattern
```python
# Evaluate RAG quality with a golden dataset
golden_qa_pairs = [
{
"question": "What is the refund policy?",
"expected_answer": "30-day full refund for unused products...",
"expected_source": "refund-policy.pdf"
},
{
"question": "How to contact support?",
"expected_answer": "Email [email protected] or call 1-800...",
"expected_source": "contact-info.pdf"
}
]
results = []
for pair in golden_qa_pairs:
response = bedrock_agent_runtime.retrieve_and_generate(
input={"text": pair["question"]},
retrieveAndGenerateConfiguration={
"type": "KNOWLEDGE_BASE",
"knowledgeBaseConfiguration": {
"knowledgeBaseId": "KB123",
"modelArn": "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-5-sonnet-20241022-v2:0"
}
}
)
generated = response["output"]["text"]
citations = response.get("citations", [])
results.append({
"question": pair["question"],
"generated_answer": generated,
"has_citations": len(citations) > 0,
"citation_sources": [c["retrievedReferences"][0]["location"]
for c in citations if c.get("retrievedReferences")]
})
# Compare with expected_answer using ROUGE or LLM-as-Judge
```
### Agent Evaluation
```
AGENT EVALUATION METRICS
===========================
1. TASK COMPLETION RATE
Did the agent successfully complete the task?
Binary: success/failure per test case
Target: > 90% for production readiness
2. TOOL SELECTION ACCURACY
Did the agent choose the CORRECT tool/action group?
Measures understanding of available tools
Test with: golden test cases with expected tool calls
3. STEP EFFICIENCY
How many steps did the agent take?
Fewer steps = better (less cost, less latency)
Compare: actual steps vs optimal steps
4. REASONING TRACE QUALITY
Is the agent's reasoning logical and coherent?
Review via: enableTrace=True in InvokeAgent
Assessment: LLM-as-Judge on the trace output
5. ERROR RECOVERY
Does the agent handle tool failures gracefully?
Does it retry or choose alternative paths?
Test with: intentionally failing action groups
```
```
AGENT EVALUATION PIPELINE
============================
Step 1: Define golden test cases
┌─────────────────────────────────────┐
│ Input: "What are our Q3 sales?" │
│ Expected tools: [QuerySalesDB] │
│ Expected answer: contains "$X.XM" │
└─────────────────────────────────────┘
Step 2: Run agent with enableTrace=True
→ Capture full reasoning trace
→ Record tools invoked + parameters passed
→ Record step count and total latency
Step 3: Compare actual vs expected
→ Correct tools selected? (tool selection accuracy)
→ Correct parameters passed? (parameter accuracy)
→ Final answer matches expected? (task completion)
Step 4: Measure efficiency
→ Steps taken vs optimal steps
→ Total latency and token consumption
→ Cost per completed task
```
**[PRODUCTION INSIGHT]** In practice, the most common agent failure mode is tool selection errors — the agent picks the wrong action group or passes incorrect parameters. I always start agent evaluation with tool selection accuracy testing before worrying about answer quality. If the agent is calling the wrong tools, no amount of prompt engineering will improve the final answer.
### User-Centered Evaluation
```
USER FEEDBACK PATTERNS
========================
1. IMPLICIT FEEDBACK
- Thumbs up / thumbs down on responses
- "Was this helpful?" Y/N
- Regeneration requests (user asks same question again)
- Session abandonment (user leaves without achieving goal)
2. EXPLICIT FEEDBACK
- Star rating (1-5 scale)
- Free-text feedback
- Report issues ("This is wrong," "Inappropriate content")
- Domain expert annotations
3. A/B TESTING PATTERNS
┌─────────────────────────────────────────┐
│ PROMPT A/B TEST: │
│ 50/50 traffic split → compare quality │
│ API Gateway weighted routing │
│ │
│ MODEL A/B TEST: │
│ Route between Sonnet vs Haiku │
│ Measure: quality, latency, cost │
│ │
│ CANARY DEPLOYMENT: │
│ 5% → 25% → 50% → 100% │
│ Monitor for quality regression │
│ Roll back if metrics degrade │
└─────────────────────────────────────────┘
```
### Quality Assurance Processes
#### Golden Dataset Management
```
GOLDEN DATASET BEST PRACTICES
================================
CREATION:
- Start with 50-100 curated question-answer pairs
- Cover all major use cases AND edge cases
- Include adversarial/tricky inputs
- Version in S3 with lifecycle tags
MAINTENANCE:
- Review and update quarterly
- Add new cases from production failures
- Remove outdated questions
- Track dataset version alongside model/prompt versions
REGRESSION TESTING:
- Run golden dataset after EVERY prompt change
- Run after EVERY knowledge base update
- Run after guardrail configuration changes
- Fail CI/CD if quality drops below threshold
```
#### CI/CD Quality Gates for GenAI
```
CI/CD QUALITY GATES
=====================
┌─────────────────────────────────────────────┐
│ Gate 1: PROMPT LINT │
│ Check syntax, variable names, structure │
│ → BLOCK deployment if fails │
├─────────────────────────────────────────────┤
│ Gate 2: UNIT TESTS │
│ Mocked model responses, logic validation │
│ → BLOCK deployment if fails │
├─────────────────────────────────────────────┤
│ Gate 3: GOLDEN DATASET EVALUATION │
│ Run against golden Q&A pairs │
│ Quality score must exceed threshold │
│ → BLOCK deployment if below threshold │
├─────────────────────────────────────────────┤
│ Gate 4: GUARDRAIL TESTS │
│ Adversarial inputs, safety scenarios │
│ Verify guardrails block as expected │
│ → BLOCK deployment if fails │
├─────────────────────────────────────────────┤
│ Gate 5: LATENCY BENCHMARK │
│ P99 must be under SLA limit │
│ → WARN if exceeds (may not block) │
├─────────────────────────────────────────────┤
│ Gate 6: COST ESTIMATION │
│ Per-request cost under budget │
│ → WARN if exceeds (may not block) │
└─────────────────────────────────────────────┘
```
**[EXAM TIP]** "Ensure quality doesn't regress after a prompt change" → golden dataset regression testing in CI/CD pipeline. "Automate quality checks before deployment" → CI/CD quality gates. The exam specifically tests whether you know to gate deployments on quality, not just functional tests.
#### Semantic Drift Detection
```
SEMANTIC DRIFT DETECTION
===========================
WHAT: Model outputs gradually change in meaning/style over time,
even though the prompt hasn't changed.
CAUSES:
- Model version updates (provider-side changes)
- Data distribution shifts (different user queries)
- Knowledge base content changes
DETECTION:
- Schedule golden dataset runs (daily/weekly)
- Compare output embeddings over time
- Track quality metrics trends in CloudWatch
- Alert when cosine similarity drops below threshold
RESPONSE:
- Investigate root cause (model update? data change?)
- Update golden dataset if business requirements changed
- Re-tune prompts to compensate for model changes
- Consider prompt versioning with Bedrock Prompt Management
```
### Bedrock Prompt Management
```
BEDROCK PROMPT MANAGEMENT
============================
PURPOSE: Version control for prompts — like Git for your GenAI prompts.
FEATURES:
- Create and store prompt templates
- Version management (v1, v2, v3...)
- Test prompts in Bedrock playground
- Rollback to previous versions
- Share across team members
WORKFLOW:
1. Create prompt template with variables
2. Test in playground
3. Publish as new version
4. Reference version in application code
5. Roll back if quality degrades
USE FOR: "How to manage prompt changes across environments"
"Version control for prompt templates"
"Roll back a prompt change that degraded quality"
```
### Practice Questions — Section 2: Evaluation Systems
**Q1.** A company wants to evaluate whether their summarization model captures the key points of source documents. They have reference summaries written by domain experts. Which metric is MOST appropriate?
A) BLEU
B) ROUGE-L
C) Perplexity
D) F1 Score
**Answer: B**
ROUGE-L measures the longest common subsequence between the generated summary and reference summary, making it the standard metric for summarization evaluation. It captures content overlap while allowing for some flexibility in phrasing. BLEU (A) is designed for translation tasks and measures precision. Perplexity (C) measures language fluency, not content quality. F1 (D) is for extractive QA where exact answers exist.
---
**Q2.** A team needs to detect when their RAG application generates answers that include information NOT present in the retrieved context. They want automated checking at scale in their CI/CD pipeline. Which approach is MOST effective?
A) Compute ROUGE scores against reference answers
B) Use Bedrock Guardrails contextual grounding check
C) Have human reviewers rate each answer
D) Measure retrieval latency
**Answer: B**
Contextual grounding checks validate whether the generated response is supported by the retrieved context — this is exactly the faithfulness dimension. It runs automatically and can be integrated into CI/CD pipelines. ROUGE (A) checks overlap with a reference answer but doesn't verify grounding against retrieved context. Human review (C) is accurate but doesn't scale for CI/CD. Latency (D) measures performance, not content quality.
---
**Q3.** A company uses LLM-as-a-Judge to evaluate their chatbot. They notice the judge model consistently rates its own model family's outputs higher than outputs from other models. What is this issue and how should they mitigate it?
A) Order bias — run comparisons in both orders
B) Self-preference bias — use a different model family as judge
C) Verbosity bias — penalize long answers in the rubric
D) Confirmation bias — increase the temperature of the judge
**Answer: B**
Self-preference bias occurs when a model rates its own model family's outputs higher. The mitigation is to use a different model family as judge (e.g., use Claude to judge Llama outputs, not Claude judging Claude). Order bias (A) relates to position in pairwise comparison, not model preference. Verbosity bias (C) relates to length preference. Confirmation bias (D) is not a standard LLM evaluation bias term.
---
**Q4.** A team has deployed a Bedrock Agent that sometimes calls the wrong action group for user requests. They want to systematically identify and fix these routing errors. What should they implement FIRST?
A) Fine-tune the agent's foundation model
B) Enable trace logging and create a golden test dataset with expected tool calls
C) Switch to a more powerful model
D) Add more action groups with overlapping functionality
**Answer: B**
Enable trace logging (`enableTrace=True`) to see the agent's reasoning and tool selection decisions, then create a golden test dataset with known inputs and expected tool selections. This identifies exactly which queries cause routing failures. Fine-tuning (A) is premature without diagnosis — you must understand the problem before attempting to fix it. Switching models (C) may not fix routing logic issues. Adding overlapping action groups (D) would make routing ambiguity worse.
---
**Q5.** A team wants to compare Claude 3.5 Sonnet and Claude 3 Haiku for their customer support chatbot. They need both automated metric scores and subjective quality ratings from domain experts. Which AWS service supports both?
A) SageMaker Model Monitor
B) Amazon Bedrock Model Evaluation
C) FMEval library
D) CloudWatch custom metrics
**Answer: B**
Bedrock Model Evaluation supports both automatic evaluation (automated metric scores, side-by-side comparison) and human evaluation (domain experts rate outputs on custom criteria). It's the managed AWS service designed specifically for model comparison. Model Monitor (A) detects drift in deployed models, not comparison. FMEval (C) supports programmatic automated evaluation but not managed human evaluation workflows. CloudWatch (D) monitors operational metrics, not model quality.
---
## SECTION 3: TROUBLESHOOTING GENAI APPLICATIONS (Task 5.2)
*Lines: ~350 | Priority: HIGH — practical troubleshooting tested as scenarios*
### Systematic Debugging Hierarchy
When a GenAI application fails or degrades, follow this 7-step hierarchy:
```
GENAI TROUBLESHOOTING — 7-STEP HIERARCHY
==========================================
Step 1: CHECK CLOUDWATCH METRICS
→ Error rates, latency spikes, throttling counts
→ Is this an infrastructure problem or a content problem?
Step 2: VALIDATE INPUTS
→ Check request format, token counts, model parameters
→ Is the input valid and within model limits?
Step 3: TRACE WITH X-RAY
→ End-to-end request path, subsegment latencies
→ WHERE in the pipeline is the issue?
Step 4: CHECK AGENT TRACES (if using agents)
→ enableTrace=True: reasoning, tool selection, parameters
→ Is the agent choosing the right tools?
Step 5: ISOLATE WITH PLAYGROUND
→ Test the exact prompt in Bedrock playground
→ Remove variables: is it the prompt or the system?
Step 6: CHECK VECTOR STORE HEALTH (if RAG)
→ Sync status, index health, retrieval relevance
→ Is the retrieval returning correct context?
Step 7: REVIEW GUARDRAILS LOGS
→ Is content being blocked unexpectedly?
→ Are filters too aggressive or too permissive?
```
**[EXAM TIP]** The exam presents troubleshooting scenarios and asks "what should you check FIRST?" The answer follows this hierarchy — start with CloudWatch (infrastructure), then validate inputs, then trace the request path. Don't jump to prompt changes before checking infrastructure.
### Content Issues
#### Hallucination
```
SYMPTOM: Model generates information not in source documents
DIAGNOSIS:
→ Is RAG enabled? If not, enable it.
→ Is contextual grounding enabled? If not, add it.
→ What is the temperature setting?
SOLUTIONS (ordered by impact):
1. Enable RAG with Bedrock Knowledge Bases
(ground responses in actual documents)
2. Enable Guardrails contextual grounding check
(filter responses not supported by context)
3. Lower temperature to 0.0-0.2
(reduce creative/inventive responses)
4. Add explicit instruction in system prompt:
"Only answer based on the provided context.
If the context doesn't contain the answer, say 'I don't know.'"
5. Improve retrieval quality (better chunks, re-ranking)
(give the model better context to ground on)
```
**[TRAP]** The exam may describe hallucination and offer "fine-tune the model" as an option. Fine-tuning does NOT fix hallucination — it teaches the model style/task patterns, not factual knowledge. RAG + contextual grounding is the correct approach for grounding in facts.
#### Context Window Overflow
```
SYMPTOM: Model returns errors or truncated/degraded output
with long inputs
DIAGNOSIS:
→ Check input token count vs model's context window limit
→ Is the RAG context too large (too many chunks)?
→ Is conversation history accumulating?
SOLUTIONS:
1. DYNAMIC CHUNKING: Retrieve fewer, more relevant chunks
(top-3 instead of top-10)
2. SUMMARIZATION: Summarize long documents before injection
(hierarchical summarization for very long docs)
3. SLIDING WINDOW: For conversations, keep only last N turns
(summarize older turns)
4. METADATA FILTERS: Pre-filter to reduce retrieval set
(filter by date, department, document type)
```
#### Output Format Issues
```
SYMPTOM: Model output format is inconsistent or wrong
SOLUTIONS:
1. Set temperature=0 for deterministic output
2. Add explicit output format in system prompt
("Respond ONLY in valid JSON with this schema: {...}")
3. Provide few-shot examples showing exact format
4. Add stopSequences to terminate at the right point
5. Use structured output parsing (e.g., JSON schema validation)
6. Consider fine-tuning for consistent format adherence
```
### API Errors and Resolutions
| Error | Cause | Resolution |
|-------|-------|-----------|
| **ThrottlingException** | Exceeded API rate limits | Exponential backoff with jitter; request quota increase; consider Provisioned Throughput; Cross-Region Inference |
| **ModelTimeoutException** | Response took too long | Reduce input tokens; use ConverseStream for long responses; increase Lambda timeout (Bedrock can take 30s+) |
| **ValidationException** | Invalid request format | Check request body matches model requirements; verify modelId correct and enabled; check maxTokens doesn't exceed limit |
| **AccessDeniedException** | Insufficient permissions | Verify IAM policy allows `bedrock:InvokeModel`; check model access enabled in console; verify region; check ARN format |
| **ResourceNotFoundException** | Model/resource not found | Verify modelId exists; check model available in region; verify Knowledge Base ID correct |
| **ServiceUnavailableException** | Bedrock service issue | Retry with backoff; check AWS Health Dashboard; failover via Cross-Region Inference profile |
**[EXAM TIP]** ThrottlingException is the most tested API error. The correct response is ALWAYS exponential backoff with jitter. If the scenario describes "immediate retry on throttling" as the current behavior, the answer is to implement exponential backoff. Do NOT select "increase Lambda memory" or "switch to a smaller model" — these don't address the retry strategy.
### Retrieval Issues (RAG Troubleshooting)
```
RAG RETRIEVAL TROUBLESHOOTING DECISION TREE
=============================================
"RAG results are poor"
│
┌────┴───────────────────────────┐
│ │
Low relevance? Missing results?
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ - Tune chunking │ │ - Check sync │
│ strategy │ │ completed? │
│ - Enable hybrid │ │ - Verify source │
│ search │ │ permissions │
│ - Add metadata │ │ - Supported doc │
│ filters │ │ format? │
│ - Enable │ │ - Trigger manual │
│ re-ranking │ │ sync │
└──────────────────┘ └──────────────────┘
│ │
Proper nouns Stale results?
not found? │
│ ▼
▼ ┌──────────────────┐
┌──────────────────┐ │ - Enable │
│ Enable HYBRID │ │ incremental │
│ search (keyword │ │ sync │
│ + semantic) │ │ - Set up sync │
│ │ │ schedule │
│ Keyword handles │ │ - Monitor with │
│ proper nouns │ │ CloudWatch │
│ better than │ └──────────────────┘
│ vector search │
└──────────────────┘
```
**[EXAM TIP]** "Proper nouns not found in RAG search" → enable hybrid search. Semantic (vector) search matches meaning but struggles with specific names, product codes, and identifiers. Keyword search handles these precisely. Combine both for comprehensive retrieval.
### Prompt Debugging
```
PROMPT DEBUGGING PATTERNS
============================
SYMPTOM: Inconsistent outputs across runs
Problem→ Non-deterministic generation
Solution→ Set temperature=0, add output format, use few-shot examples
SYMPTOM: Model misinterprets instructions
Problem→ Ambiguous or complex prompt
Solution→ Simplify language, use XML tags to structure sections,
add explicit constraints ("Do NOT include...")
SYMPTOM: Output too long / too short
Problem→ maxTokens misconfigured or instructions unclear
Solution→ Right-size maxTokens, add length guidance in prompt,
use stopSequences for early termination
SYMPTOM: Token limit exceeded
Problem→ Input too large for context window
Solution→ Compress context (summarize), reduce few-shot examples,
split into multiple requests, hierarchical summarization
SYMPTOM: Format drift over conversation
Problem→ Model "forgets" format in long conversations
Solution→ Reinforce format in system prompt (persists across turns),
use stopSequences, reset context periodically
```
### Prompt Maintenance and Version Control
```
PROMPT MAINTENANCE BEST PRACTICES
====================================
1. VERSION CONTROL (Bedrock Prompt Management)
- Store prompt templates with version numbers
- Test in playground before publishing
- Roll back to previous versions if quality degrades
- Track which version is deployed in each environment
2. TEMPLATE TESTING
- CloudWatch Logs Insights: query output patterns
- X-Ray: trace prompt performance over time
- Schema validation: verify output matches expected schema
3. OBSERVABILITY
- Log prompt version alongside model outputs
- Track quality metrics per prompt version
- Alert on quality regression after version change
4. CHANGE MANAGEMENT
- Golden dataset regression test before deployment
- Canary deployment (5% → 25% → 50% → 100%)
- Automated rollback if quality metrics drop
```
### Practice Questions — Section 3: Troubleshooting
**Q6.** After updating their knowledge base documents, a team notices that the RAG application returns stale information for some queries. The data source sync shows as completed. What is the MOST likely cause?
A) The embedding model has a knowledge cutoff date
B) The vector index needs to be completely rebuilt
C) The incremental sync didn't process all updated documents
D) The foundation model caches previous responses
**Answer: C**
Incremental sync may not have detected or processed all updated documents. The team should check sync job details and verify updated documents were re-chunked and re-embedded, or trigger a full sync. The embedding model (A) doesn't have a knowledge cutoff — it processes whatever text it receives. Vector indexes (B) update incrementally, they don't need full rebuilds. Foundation models (D) don't cache responses between separate requests.
---
**Q7.** A production GenAI application starts returning ThrottlingException errors during peak hours. The application currently retries immediately on failure, causing more throttling. What should the team implement?
A) Switch to Provisioned Throughput immediately
B) Implement exponential backoff with jitter
C) Increase Lambda memory allocation
D) Switch to a smaller, faster model
**Answer: B**
The immediate retry strategy is causing a "thundering herd" problem — failed requests retry immediately, adding to the congestion. Exponential backoff with jitter spaces out retries with randomized delays, reducing pressure on the service. Provisioned Throughput (A) is a longer-term solution but doesn't fix the retry storm happening now. Lambda memory (C) is unrelated to API throttling. A smaller model (D) may have different quotas but doesn't address the retry strategy that's amplifying the problem.
---
**Q8.** A customer support chatbot using RAG consistently fails to find product serial numbers when users ask about specific products. The knowledge base contains documents with serial numbers. What should the team change?
A) Use a larger embedding model with more dimensions
B) Enable hybrid search combining semantic and keyword search
C) Increase the number of retrieved chunks from 5 to 20
D) Fine-tune the foundation model on product data
**Answer: B**
Serial numbers and product identifiers are exact terms that semantic (vector) search handles poorly — vector embeddings capture meaning, not exact strings. Hybrid search adds keyword search alongside semantic search, enabling precise matching for proper nouns, codes, and identifiers. A larger embedding model (A) still encodes meaning, not exact strings. More chunks (C) increases noise without improving relevance. Fine-tuning (D) teaches the model style, not retrieval capabilities.
---
**Q9.** A development team notices their GenAI application produces correct responses in the Bedrock playground but gives incorrect responses in production. The prompt is identical. What should they investigate FIRST?
A) The foundation model was updated by the provider
B) Differences in inference parameters between playground and production code
C) Network latency causing response truncation
D) CloudWatch alarm configuration
**Answer: B**
When the same prompt works in playground but fails in production, the most likely cause is a discrepancy in inference parameters (temperature, maxTokens, topP) or system prompt differences between the two environments. The playground may use different defaults than the application code. A model update (A) would affect both playground and production equally. Network latency (C) doesn't cause incorrect content. CloudWatch (D) monitors but doesn't affect output quality.
---
## SECTION 4: DOMAIN 5 COMPREHENSIVE REVIEW
*Lines: ~250 | Priority: Review before exam*
### Top 8 Exam Traps
| # | Trap | Correct Understanding |
|---|------|-----------------------|
| 1 | "Use accuracy/F1 for GenAI evaluation" | These are traditional ML metrics; GenAI needs ROUGE, LLM-as-Judge, human eval |
| 2 | "Fine-tune to fix hallucination" | Fine-tuning teaches style/task patterns; RAG + contextual grounding fixes hallucination |
| 3 | "LLM-as-Judge is unbiased" | Has self-preference, order, and verbosity biases — use different model family as judge |
| 4 | "Model invocation logging is on by default" | It's OFF by default — must explicitly enable to see prompt/response content |
| 5 | "ROUGE measures precision" | ROUGE measures RECALL; BLEU measures PRECISION |
| 6 | "Retry immediately on throttling" | Use exponential backoff with jitter — immediate retry makes throttling worse |
| 7 | "Perplexity: higher is better" | Perplexity: LOWER is better (less "surprised" = more fluent) |
| 8 | "Vector search finds product codes" | Semantic search matches meaning, not exact strings — use hybrid search for identifiers |
### Troubleshooting Decision Tree
```
GENAI APPLICATION TROUBLESHOOTING
====================================
"Application is not working correctly"
│
┌────┴────────────────────────────────┐
│ │
Error responses? Poor quality output?
│ │
├─ ThrottlingException ├─ Hallucination
│ → Exponential backoff │ → RAG + contextual grounding
│ → Request quota increase │ → Lower temperature
│ │
├─ ModelTimeoutException ├─ Format inconsistency
│ → Reduce input tokens │ → temp=0 + few-shot + schema
│ → Use streaming │ → stopSequences
│ │
├─ ValidationException ├─ Irrelevant/off-topic
│ → Check request format │ → Improve system prompt
│ → Verify model ID + access │ → Add guardrail denied topics
│ │
├─ AccessDeniedException ├─ RAG: wrong results
│ → Check IAM policy │ → Hybrid search + re-ranking
│ → Verify model access enabled │ → Better chunking + metadata
│ │
└─ ServiceUnavailableException └─ RAG: stale results
→ Retry with backoff → Check sync status
→ Cross-Region Inference → Enable incremental sync
```
### Evaluation Metric Selection Guide
```
QUICK REFERENCE: WHICH EVALUATION METHOD?
============================================
"Compare two models on my task"
└──► Bedrock Model Evaluation (automatic mode)
"Get expert quality feedback on outputs"
└──► Bedrock Model Evaluation (human mode)
"Evaluate in CI/CD pipeline programmatically"
└──► FMEval library
"Scale evaluation without human reviewers"
└──► LLM-as-a-Judge
"Measure summarization quality"
└──► ROUGE-L (against reference summaries)
"Detect hallucination in RAG"
└──► Faithfulness check (RAGAS) or Guardrails contextual grounding
"Check if translation is accurate"
└──► BLEU score
"Verify semantic meaning preserved"
└──► BERTScore
"Test agent tool selection"
└──► Golden test cases with enableTrace=True
"Detect quality regression after changes"
└──► Golden dataset regression testing in CI/CD
```
### GenAI vs Traditional ML Evaluation Comparison
| Aspect | Traditional ML | Generative AI |
|--------|---------------|---------------|
| **Ground truth** | Single correct answer | Multiple valid outputs |
| **Metrics** | Accuracy, F1, precision, recall | ROUGE, BLEU, BERTScore, LLM-as-Judge |
| **Evaluation type** | Deterministic | Probabilistic + subjective |
| **Automation** | Fully automated | Automated + human + LLM-as-Judge |
| **Test data** | Labeled datasets | Golden datasets + evaluation rubrics |
| **Quality dimensions** | Accuracy | Accuracy + fluency + relevance + groundedness + safety |
| **Drift detection** | Statistical tests on features | Semantic drift detection on outputs |
| **CI/CD gates** | Accuracy threshold | Multi-metric quality gates |
| **AWS tool** | SageMaker Model Monitor | Bedrock Model Evaluation + FMEval |
### Rapid-Fire Exam Questions
**Q10.** Which metric is specifically designed for evaluating machine translation quality?
A) ROUGE-L
B) BLEU
C) BERTScore
D) Perplexity
**Answer: B** — BLEU (Bilingual Evaluation Understudy) was designed for translation. It measures n-gram precision against reference translations. ROUGE (A) is for summarization. BERTScore (C) measures semantic similarity. Perplexity (D) measures fluency.
---
**Q11.** A team wants to ensure that their production RAG application doesn't degrade in quality when they update the system prompt. What should they run before deploying the change?
A) CloudWatch Synthetics canary
B) Golden dataset regression test
C) X-Ray trace comparison
D) SageMaker Model Monitor baseline
**Answer: B** — Golden dataset regression testing compares output quality before and after the prompt change on a curated set of test cases. This catches quality regressions before they reach production. Synthetics (A) tests availability, not quality. X-Ray (C) compares performance traces. Model Monitor (D) detects drift in deployed models, not prompt changes.
---
**Q12.** A company's GenAI application works correctly in the development environment but returns AccessDeniedException in production. What should they check FIRST?
A) The model's context window limit
B) The IAM execution role's permissions for bedrock:InvokeModel
C) The model invocation logging configuration
D) The CloudWatch alarm thresholds
**Answer: B** — AccessDeniedException is an IAM permissions issue. The production environment's IAM execution role likely lacks the `bedrock:InvokeModel` permission, or the model access hasn't been enabled in the production account/region. Context window (A) causes ValidationException. Logging (C) is unrelated to access. Alarms (D) monitor but don't affect access.
---
**Q13.** When using LLM-as-a-Judge for pairwise comparison, a team notices the judge consistently prefers whichever answer is presented second. What bias is this, and how should they mitigate it?
A) Self-preference bias — use a different judge model
B) Verbosity bias — add length constraints to the rubric
C) Order bias — run each comparison in both orders and average
D) Anchoring bias — randomize the evaluation rubric
**Answer: C** — Position/order bias means the judge model favors a particular position (first or last). The standard mitigation is to run each comparison in both orders (A vs B, then B vs A) and average the results. Self-preference (A) is about model family, not position. Verbosity (B) is about length. Anchoring (D) is not a standard LLM evaluation bias.
---
**Q14.** A deployed Bedrock Knowledge Base returns correct information but the generated response doesn't cite its sources. What should the team check?
A) The guardrails configuration
B) The citation configuration in the RetrieveAndGenerate API response parsing
C) The OpenSearch index settings
D) The model invocation logging configuration
**Answer: B** — Bedrock Knowledge Bases include citations in the `RetrieveAndGenerate` API response. If citations aren't showing, the team should check whether they're properly extracting and displaying citation data from the response object (`citations` → `retrievedReferences` → `location`). Guardrails (A) filter content, not citations. Index settings (C) affect retrieval. Logging (D) captures content for audit.
---
**Q15.** Which AWS tool detects that a deployed SageMaker model's input data distribution has changed compared to the training data?
A) CloudWatch Metrics
B) AWS X-Ray
C) SageMaker Model Monitor (data quality monitoring)
D) Bedrock Model Evaluation
**Answer: C** — SageMaker Model Monitor with data quality monitoring detects input distribution drift — changes in the statistical properties of input data compared to the training baseline. CloudWatch (A) monitors operational metrics. X-Ray (B) traces request flow. Bedrock Model Evaluation (D) compares models on Bedrock, not SageMaker drift.
---
## EXAM DAY QUICK REFERENCE — DOMAIN 5
### Evaluation: Remember the Taxonomy
- Automated metrics → CI/CD, regression testing (fast, scalable, limited depth)
- Human evaluation → final quality validation (slow, expensive, gold standard)
- LLM-as-a-Judge → rapid iteration at scale (watch for biases)
### Metrics: Remember Task-to-Metric
- Summarization → ROUGE
- Translation → BLEU
- Semantic similarity → BERTScore
- Fluency → Perplexity (LOWER is better)
- Extractive QA → F1 / Exact Match
- Open-ended → LLM-as-Judge + Human
### RAG: Remember the Four Dimensions
- Faithfulness = is the answer grounded in context?
- Answer relevance = does it address the question?
- Context precision = were retrieved chunks relevant?
- Context recall = were all relevant chunks retrieved?
### Troubleshooting: Remember the Hierarchy
1. CloudWatch → 2. Input validation → 3. X-Ray → 4. Agent traces → 5. Playground → 6. Vector store → 7. Guardrails
### AWS Tools: Remember the Map
- Compare models → Bedrock Model Evaluation
- Programmatic evaluation → FMEval
- Prompt version control → Bedrock Prompt Management
- Deployed model drift → SageMaker Model Monitor
---
*Study guide generated by Dr. Priya Ramanathan*
*Domain 5: Testing, Validation, and Troubleshooting — 11% of AIP-C01*
*Lines: ~1,350 | Practice Questions: 15 (5 + 4 + 6 rapid-fire)*
*Skills referenced: 12_EVALUATION_AND_TESTING, 08_MLOPS_AND_DEPLOYMENT, 06_RESPONSIBLE_AI*
- Without a harness, you **can't compare** prompts, models, retrieval configs, or costs.
Evaluate, benchmark, and regression-test AI/LLM systems. Covers evaluation framework design, benchmark creation, human evaluation protocols, automated evaluation (LLM-as-judge), regression testing, statistical significance, and continuous evaluation pipelines.
<img width="1388" height="298" alt="full_diagram" src="https://github.com/user-attachments/assets/12a2371b-8be2-4219-9b48-90503eb43c69" />
A list of all public EEG-datasets. This list of EEG-resources is not exhaustive. If you find something new, or have explored any unfiltered link in depth, please update the repository.