Glossary

Terms and concepts used in the experiment documents. Grows with each phase.

Evaluation Dataset (Golden Dataset)

The evaluation dataset (also "golden dataset" or "ground truth") is the foundation of all retrieval metrics. It defines which documents are considered "relevant" for a given query. All Precision, Recall, MRR, and nDCG values are only as meaningful as the dataset they are measured against.

Structure

An evaluation dataset consists of query-document pairs:

Query: A natural-language question or search query
Expected Sources: List of documents considered relevant
Must-Contain Keywords: Keywords that must appear in the results
Reference Chunk Text: Reference text for semantic comparison

Quality Criteria

Criterion	Description	Risk if Violated
Coverage	Queries cover various topics and difficulty levels	Blind spots for certain document types
Completeness	All relevant documents per query are annotated	Recall is measured too high — unannotated relevant docs are ignored
Correctness	Only actually relevant documents are marked	Artificially low Precision (false positives)
Size	Enough queries for statistically reliable results	Random fluctuations dominate the metrics
Negative Tests	Queries for which no relevant documents exist	System robustness is not tested

Our Dataset

data/evaluation/aws_cert_eval_v1.json: 53 queries (50 positive, 3 negative) covering AWS Certification Docs. Validation uses three levels: Source Matching (OR) Keyword Matching (OR) Semantic Similarity.

Limitation: 53 queries are sufficient for relative comparisons between configurations, but too few for statistically significant statements about absolute quality. Small changes to a few queries can shift metrics by 1-2pp.

Retrieval Metrics

Precision@k

Proportion of relevant documents among the top-k results.

Formula: Precision@k = |relevant documents in top-k| / k

Example: At k=10, 10 documents are returned. 8 of them are relevant. Precision@10 = 8/10 = 0.80

Interpretation: High Precision = little noise in the results. Important when every returned document serves as context for the LLM and irrelevant documents can degrade answer quality.

Recall@k

Proportion of found relevant documents out of all relevant documents in the index.

Formula: Recall@k = |relevant documents in top-k| / |all relevant documents|

Example: There are 12 relevant documents in the index. At k=10, 9 of them are found. Recall@10 = 9/12 = 0.75

Interpretation: High Recall = few relevant documents are missed. In RAG systems particularly critical: what is not found cannot be compensated downstream. Precision can be improved through reranking, Recall cannot.

MRR (Mean Reciprocal Rank)

Average of the reciprocal rank of the first relevant result across all queries.

Formula: MRR = (1/N) * sum(1/rank_i) where rank_i is the rank of the first relevant result for query i.

Example:

Query 1: First relevant result at rank 1 → 1/1 = 1.0
Query 2: First relevant result at rank 3 → 1/3 = 0.33
Query 3: First relevant result at rank 1 → 1/1 = 1.0
MRR = (1.0 + 0.33 + 1.0) / 3 = 0.78

Interpretation: MRR close to 1.0 means the first relevant result almost always appears at rank 1. Particularly important for user-facing systems, where the user expects the first result to be relevant.

nDCG@k (Normalized Discounted Cumulative Gain)

Measures ranking quality: Are relevant documents ranked higher than irrelevant ones?

Idea: A relevant document at rank 1 is more valuable than at rank 10. The "gain" is logarithmically discounted with increasing position. Normalized to [0, 1] by dividing by the ideal ranking.

Formula:

DCG@k = sum(rel_i / log2(i + 1))  for i = 1..k
nDCG@k = DCG@k / IDCG@k           (IDCG = DCG of perfect ranking)

Example: At k=5 with relevance labels [1, 0, 1, 1, 0]:

DCG = 1/log2(2) + 0/log2(3) + 1/log2(4) + 1/log2(5) + 0/log2(6) = 1.0 + 0 + 0.5 + 0.43 + 0 = 1.93
Ideal ranking [1, 1, 1, 0, 0]: IDCG = 1.0 + 0.63 + 0.5 + 0 + 0 = 2.13
nDCG = 1.93 / 2.13 = 0.91

Interpretation: nDCG@k close to 1.0 means relevant documents are optimally placed. Combines Precision (how many are relevant?) with ranking quality (where are they placed?). Considered the best single metric for retrieval quality.

Metrics Comparison

Metric	Measures	Strength	Weakness
Precision@k	Proportion of relevant results	Simple, intuitive	Ignores ranking order
Recall@k	Coverage of all relevant docs	Important for RAG completeness	Ignores noise
MRR	Position of the first hit	Good for "need only 1 result"	Ignores all other hits
nDCG@k	Overall ranking quality	Best overall metric	More complex to interpret

Chunking

Recursive Chunking

Splits text along natural boundaries: first paragraphs, then sentences, then words. Attempts to preserve semantically coherent units.

Advantages: Respects document structure, fewer context breaks. Disadvantages: Uneven chunk sizes.

Fixed Chunking

Splits text into pieces of fixed length (in tokens), regardless of text structure.

Advantages: Even chunk sizes, more consistent embeddings. Disadvantages: Can split in the middle of a sentence or paragraph.

Overlap

Number of tokens that overlap between consecutive chunks. Prevents relevant information from being lost at chunk boundaries.

Typical values: 10-20% of chunk size (e.g., 50 for chunk size 512).

Embeddings

Embedding Dimension

Length of the vector that the embedding model produces per text chunk.

768d (e.g., BGE-base): Standard BERT size. Good for shorter texts.
1024d (e.g., BGE-large, E5-large): More capacity for semantic nuances. Benefits especially with longer chunks (1024+ tokens), as more context needs to be encoded.

Trade-offs: Higher dimensions improve retrieval quality but also increase embedding latency (larger model), memory consumption in the vector index (each vector occupies more bytes), and search latency (more dimensions per distance computation). At 100k chunks: 768d ~ 300 MB, 1024d ~ 400 MB (float32).

BGE (BAAI General Embedding)

Embedding models from BAAI (Beijing Academy of Artificial Intelligence). Variants: bge-base (768d), bge-large (1024d). Type prefix: "bge".

E5 (EmbEddings from bidirEctional Encoder rEpresentations)

Embedding models from Microsoft Research. E5-large-v2 has 1024 dimensions. Type prefix: "e5". Uses "[query]"/"[passage]" prefixes for asymmetric search.

Vector Search

HNSW (Hierarchical Navigable Small World)

Graph-based algorithm for Approximate Nearest Neighbor (ANN) search. Standard index in ChromaDB, Qdrant, Weaviate.

Key parameters:

ef_construction: Accuracy during index construction (higher = more accurate, slower)
M: Number of connections per node (higher = more accurate, more memory)

Cosine Similarity

Similarity measure between two vectors. Measures the angle, not the magnitude. Values from -1 (opposite) to 1 (identical). Standard metric for embedding search.

Dot Product (Scalar Product)

Mathematical operation between two vectors: a . b = sum(a_i * b_i). For sparse vectors, only the shared non-zero positions are multiplied and summed. In the context of BM25: the dot product between query sparse vector and document sparse vector yields the BM25 score.

Difference from Cosine Similarity: Cosine normalizes by vector length (measures direction), Dot Product also considers length (measures direction + magnitude).

Search Strategies

Dense Vector Search

Semantic search over embedding vectors. Each text is converted by an embedding model into a dense vector (fully-populated vector, all dimensions have values). Similar texts have similar vectors (measured via Cosine Similarity).

Strengths: Captures semantic similarity ("car" also finds "vehicle"). Weaknesses: Can miss exact keyword matches, especially with technical terms or proper nouns not seen during training.

Dense Vector

Fully-populated vector of fixed length (e.g., 1024 dimensions), where every position has a float value. Produced by embedding models like E5-large-v2 or BGE. Encodes the semantic meaning of the entire text.

Storage: 1024d * 4 bytes (float32) = 4 KB per vector. At 150k chunks: ~600 MB.

Sparse Vector

Sparsely-populated vector with high dimensionality (vocabulary size, e.g., 50,000+), but only a few non-zero entries (typically 20-100 per document). Stored as pairs of indices (which positions) and values (which weights).

Example: A document with the terms "Lambda", "serverless", "function" has a sparse vector with 3 non-zero entries at the positions of these terms in the vocabulary.

Storage: Only the non-zero entries are stored. At 50 terms per chunk: 50 * (4 + 4) bytes = 400 bytes per vector. At 150k chunks: ~60 MB.

BM25 (Best Match 25)

Classic keyword ranking algorithm (Robertson et al., 1994). Scores documents based on overlap with query terms, considering:

Term Frequency (TF): How often a query term appears in the document. BM25 uses saturated TF — the benefit of each additional occurrence diminishes:
```
tf_component = tf * (k1 + 1) / (tf + k1 * (1 - b + b * dl / avgdl))
```
- k1 (default 1.5): Controls TF saturation. Higher = more weight for frequent terms. k1=0 ignores TF completely.
- b (default 0.75): Controls length normalization. b=1 normalizes fully, b=0 ignores document length.
- dl: Document length (number of tokens)
- avgdl: Average document length in the corpus
Inverse Document Frequency (IDF): Weighting of rare terms. The rarer a term in the entire corpus, the more informative it is:
```
IDF(t) = log((N - df(t) + 0.5) / (df(t) + 0.5) + 1)
```
- N: Total number of documents in the corpus
- df(t): Number of documents containing term t

BM25 Score for a document d and query q:

BM25(d, q) = sum(IDF(t) * tf_component(t, d))  for all t in q

Strengths: Excellent for exact keyword matches, technical terms, proper nouns. Weaknesses: No semantic understanding ("car" does not find "vehicle").

In our system: BM25 weights are stored as sparse vectors in Qdrant. Document vectors contain the TF component (without IDF), query vectors contain the IDF values. The dot product yields the BM25 score.

Named Vectors (Qdrant)

Qdrant feature (since v1.7) that allows multiple vector spaces per collection. Each vector space has a name and its own parameters (dimension, distance metric).

Example of our hybrid collection:

Collection: "recursive_1024_100__e5-large-v2__hybrid"
├── "dense": 1024-dim, Cosine Similarity (E5-large-v2 embeddings)
└── "bm25":  Sparse Vector (BM25 TF weights)

Each document has both vectors. Queries can target a specific vector space (using="dense" or using="bm25") or combine both.

Hybrid Search

Combination of Dense (semantic) and Sparse (keyword-based) search. Goal: Combine the strengths of both approaches — semantic understanding (Dense) plus exact keyword matches (Sparse).

Process in our system:

Send query to dense index → top-2k results with scores
Send query to sparse index → top-2k results with scores
Normalize scores (Min-Max to [0,1])
Weighted combination: combined = α * dense + (1-α) * sparse
Sort by combined score, return top-k

Alpha (Hybrid Weight)

Weighting parameter for Hybrid Search. Determines the ratio between Dense (semantic) and Sparse (keyword) components.

Alpha	Dense Share	Sparse Share	Description
1.0	100%	0%	Pure Dense (= pure semantic search)
0.7	70%	30%	Semantics-dominant
0.5	50%	50%	Balanced
0.3	30%	70%	Keyword-dominant
0.0	0%	100%	Pure Sparse (= pure BM25)

Intuition: For specialized domains with specific terminology (like AWS certifications), a keyword component can help find exact service names and technical terms more reliably, while the semantic component covers conceptual similarities.

Score Fusion

Merging results from two (or more) search methods into a unified ranking list.

Alpha Blending (score-based): Normalizes the scores of both methods to [0,1] and weights them: α * score_A + (1-α) * score_B. Requires score normalization, since Dense (0-1) and Sparse (0-30+) have different scales.

RRF (Reciprocal Rank Fusion): Rank-based alternative: score = sum(1 / (k + rank_i)). No alpha parameter, treats all sources equally. Independent of score scales, but no weighting possible.

Our choice: Alpha Blending, because we want to experimentally control and compare the influence of Dense vs. Sparse.

Min-Max Normalization

Scaling values to the range [0, 1]:

normalized = (x - min) / (max - min)

Necessary for Score Fusion, since raw scores from different search methods have different scales:

Dense (Cosine Similarity): typically 0.3 - 0.95
Sparse (BM25 Dot Product): typically 0 - 30+

After normalization, both are on [0, 1] and can be combined with weights.

Edge case: If all scores are equal (max = min), all are set to 1.0.

Reranking

Reranking (Second-Stage Scoring)

Post-processing of initial retrieval results with a more precise model. Retrieval (Stage 1) quickly delivers a candidate list (e.g., top-100 from 150k), the reranker (Stage 2) scores these candidates more accurately and re-sorts them.

Impact on metrics:

Improves Precision (irrelevant results are pushed down)
Improves nDCG and MRR (better ranking of relevant results)
Does not improve Recall — what was not found in retrieval cannot be added by the reranker. Therefore, over-retrieval (fetching more candidates than finally needed) is critical.

Bi-Encoder

Model that encodes query and document independently — one separate embedding each. Similarity is computed via vector operations (e.g., Cosine Similarity).

Advantages: Document embeddings can be precomputed and stored in a vector index. Search over 150k chunks in milliseconds (ANN algorithms).

Disadvantages: The independent encoding cannot capture fine interactions between query and document terms.

In our system: E5-large-v2 (Phase 2) is a Bi-Encoder for retrieval.

Cross-Encoder

Model that encodes query and document jointly — both are concatenated as a pair and passed through a transformer. The model can directly model interactions between query and document terms.

Advantages: Significantly more accurate than Bi-Encoders, since the model can see which query terms appear in the document and in what context.

Disadvantages: O(n) expensive — each (query, document) pair requires a separate forward pass. With 100 candidates = 100 inference calls per query. Therefore only usable as a reranker for pre-selected candidates, not for full-index search.

Typical architecture: AutoModelForSequenceClassification (e.g., BGE-reranker-v2-m3). Input: [CLS] query [SEP] document [SEP] → Output: relevance score.

Bi-Encoder vs. Cross-Encoder

The standard pattern in modern RAG systems:

Full Index (150k Chunks)
    |
    v  Bi-Encoder (fast, O(1) via ANN)
Top-100 Candidates
    |
    v  Cross-Encoder (accurate, O(n))
Top-5 Results
    |
    v  LLM (Synthesis)
Answer

The Bi-Encoder quickly filters out the majority of irrelevant documents. The Cross-Encoder optimizes the ranking of the remaining candidates. This two-stage approach combines the speed of the Bi-Encoder with the accuracy of the Cross-Encoder.

Generative Reranker

Reranker based on a language model (e.g., Qwen2, Qwen3). Scores relevance through token generation instead of sequence classification.

Typical approach: The model receives query + document as a chat prompt and decides whether the document is relevant. The score is the probability of the "yes" token relative to the "no" token.

Difference from Cross-Encoder: Cross-Encoders have a dedicated classification head (linear layer). Generative rerankers use the language modeling head and extract scores from token probabilities.

Examples: Mixedbread mxbai-rerank-v2 (Qwen2-based), Qwen3-Reranker.

Over-Retrieval (retrieve_k)

Strategy of fetching more candidates from the index than finally needed, to give the reranker a broader selection.

Example: For k=5 final results, retrieve_k=100 candidates are fetched. The reranker scores all 100 and returns the best 5.

Typical ratios: 3-20x the final k. Too few candidates limit the reranker, too many increase latency (especially with LLM-based rerankers).

In our system: --retrieve-k 100 (CLI parameter in run_eval.py).

Response Synthesis

The LLM generates a natural-language answer from retrieved chunks. Different modes determine how chunks are presented to the LLM. The mode affects quality, latency, and token consumption.

Response Mode: tree_summarize

Hierarchical summarization: Chunks are summarized pairwise, the summaries summarized again → final answer. Requires multiple LLM calls (O(log n)), but delivers the best quality with many chunks, since every chunk is considered.

Response Mode: refine

Iterative refinement: First chunk → first answer. Each subsequent chunk refines the answer. N LLM calls (N = number of chunks). Good when the order of chunks matters.

Response Mode: compact

Fits as many chunks as possible into one prompt (up to context window limit). Typically 1-2 LLM calls. Faster and cheaper, but less thorough with many chunks.

Temperature

Sampling parameter of the LLM. Controls how deterministic or creative the answers are.

Value	Effect	Recommendation
0.0	Deterministic (always the most likely token)	Factual answers
0.3	Slight variation	Good compromise
0.7+	Creative/random	Not recommended for RAG

For RAG systems with factual answers: 0.0-0.3 recommended.

Synthesis Metrics

Faithfulness (LLM-as-Judge)

Measures whether the answer is based exclusively on the source chunks (no fabrications). LLM-as-Judge: An evaluator LLM checks every claim in the answer against the source texts.

Values: 0.0 (all claims fabricated) to 1.0 (all claims verifiable).

Status: Implemented in LLMMetrics.faithfulness(). Not invoked in Phase 6 (embedding-based metrics instead). Activatable from Phase 7 via --enable-faithfulness flag. Uses Claude Haiku 4.5 as evaluator.

LLM-as-Judge

Evaluation approach where one LLM assesses the quality of another LLM's output. Scalable and automatable — no manual annotations needed.

Risk: Self-evaluation bias — a model rates itself more favorably. Therefore: Use a different model as evaluator (e.g., Claude Haiku 4.5 evaluates GPT-4o-mini outputs).

In Phase 6: LLM-as-Judge (Faithfulness) was planned but deferred in favor of deterministic embedding-based metrics.

Answer Relevance (Embedding-Based)

Semantic similarity between query and answer, measured as embedding Cosine Similarity. A high value means the answer actually addresses the question, not just contains related material.

Method: Query embedding vs. Response embedding via E5-large-v2. Type: Deterministic, no LLM call.

Hallucination Score (Embedding-Based)

Per-sentence embedding comparison of the answer against all source chunks. Each sentence of the answer is encoded as an embedding and compared against the source chunk embeddings. Sentences with low similarity to all sources are potentially hallucinated.

Values: 0.0 (strong hallucination) to 1.0 (no hallucination detected). Type: Deterministic, no LLM call. Complementary to Faithfulness — measures the same aspect (source fidelity), but embedding-based instead of LLM-based.

Keyword Coverage (Rule-Based, Custom)

Proportion of expected keywords (must_contain_keywords from the eval dataset) in the generated answer. Custom metric for this project, since the eval dataset has no reference answers but defines keywords per query.

Formula: keyword_coverage = |found keywords| / |expected keywords| Method: Case-insensitive substring match.

Example: Keywords = ["SageMaker", "machine learning", "managed service"]. Answer contains "SageMaker" and "machine learning" → Coverage = 2/3 = 0.67.

Context: More practical than purely semantic metrics, because concrete facts are checked. At the same time limited: Synonyms and paraphrases are not recognized ("fully managed" does not match "managed service").

Source Attribution (Embedding-Based)

Measures which source chunks actually contributed to the answer. Embedding similarity between the answer and each source chunk.

Values: attribution_rate from 0.0 (no sources used) to 1.0 (all sources contributed). A high value means the LLM effectively uses the provided sources rather than relying on its own knowledge. Type: Deterministic, no LLM call.

Latency Percentiles (p50, p95, p99)

Statistical distribution measures for response times across all queries:

Percentile	Meaning
p50 (Median)	50% of queries are faster
p95	95% of queries are faster (tail latency)
p99	99% of queries are faster (worst case)

Why not just avg? The average obscures outliers. p95/p99 show how the system performs under load. For user-facing systems, p95 is often the more relevant metric than avg.

Quality Composite Score

Weighted score from all synthesis metrics. Normalized to [0, 1].

Weights (with Faithfulness):

Answer Relevance: 35%
Keyword Coverage: 25%
Source Attribution: 20%
Hallucination Penalty (1 - halluc_score): 10%
Faithfulness: 10%

Without Faithfulness: Relevance 40%, Keywords 25%, Attribution 20%, Halluc 15%.

Cost Efficiency

Ratio of quality to cost: quality_score / total_cost_usd. Higher = better price-performance ratio.

LLM Hosting

vLLM

Open-source LLM inference engine. Hosts open-source models on your own GPU infrastructure with an OpenAI-compatible API. Instead of token-based billing, you only pay for GPU compute time.

Key Features:

PagedAttention for efficient KV-Cache management
Continuous Batching for high throughput
Tensor Parallelism for large models across multiple GPUs
OpenAI-compatible API (/v1/chat/completions)

In our system: vLLM runs as a Kubernetes Deployment on GPU nodes. The model is configured via --served-model-name and accessible through an internal service (vllm-svc.ml-models.svc.cluster.local:8000). Important: The service is named vllm-svc (not vllm), because Kubernetes for a service named vllm automatically injects VLLM_PORT=tcp://..., which conflicts with vLLM's own VLLM_PORT env var.

GPU-Time Pricing

Cost model for self-hosted LLMs. Instead of token-based billing (like OpenAI), the GPU usage duration is charged:

cost = latency_ms / 1000 / 3600 * gpu_hourly_rate

Example: 500ms inference on g6.xlarge ($0.98/h): 0.5 / 3600 * 0.98 = $0.000136 per query

Advantage: Costs scale with inference time, not token count. Large prompts are not more expensive than small ones (at the same inference time).

Tensor Parallelism (TP)

Distribution of a model across multiple GPUs by splitting weight matrices along columns/rows. Each GPU holds only a part of the model.

When needed: When the model does not fit on a single GPU.

70B-Q4 (~35GB) does not fit on 1x A10G (24GB), but on 4x A10G (96GB) with TP=4
27B/32B BF16 fits on 1x L40S (48GB) without TP

Overhead: TP requires inter-GPU communication at every transformer layer. 1x L40S without TP is therefore often faster than 4x A10G with TP=4 for models that fit on a single GPU.

Quantization (AWQ, GPTQ)

Reduction of weight precision from BF16/FP16 (16 bit) to INT4 (4 bit). Halves VRAM requirements with minimal quality loss.

AWQ (Activation-aware Weight Quantization): Quantizes weights asymmetrically, keeps important weights (based on activation distribution) at higher precision. Natively supported by vLLM.

GPTQ (GPT Quantization): Older method, quantizes weights per layer with error correction. Also natively supported by vLLM.

TorchAO: PyTorch-native quantization format. Requires torchao>=0.10.0 — not pre-installed in vLLM v0.14. Models with TorchAO quantization (e.g., pytorch/gemma-3-27b-it-AWQ-INT4) do not work out-of-the-box.

Practical experience from Phase 7:

Model	BF16 VRAM	Quantized	VRAM (est.)	GPU
Phi-4 (14B)	~28GB	AWQ-INT4	~8GB	L4 (24GB)
Gemma-3-27B	~54GB	GPTQ-INT4	~14GB	L40S (48GB)
Qwen3-32B	~64GB	AWQ	~16GB	L40S (48GB)

Without quantization, mid-tier models (27B, 32B) do not fit on single GPUs. Quantization is not optional in practice, but a prerequisite.

Thinking Mode (Qwen3)

Qwen3 models generate <think>...</think> reasoning tokens by default before the actual answer. This leads to massively increased latency (~21s/query for 8B on L4 instead of expected ~2-3s).

Impact on quality: Despite high latency, Qwen3 delivers the best Keyword Coverage (0.929 for 32B-AWQ) — the thinking tokens appear to actually improve answer quality.

--override-generation-config '{"enable_thinking": false}' did NOT fix the issue in our tests. Latency remained unchanged.

End-to-End Evaluation (Phase 8)

Naive Baseline

The intentionally unoptimized reference pipeline, serving as the starting point. All components at default/beginner settings:

Chunking: fixed_256_25 (fixed 256-token chunks, 25 token overlap)
Embedding: bge-base-en-v1.5 (768 dimensions)
VectorDB: ChromaDB (in-process)
Search: Dense
Reranking: None

Per-Layer Impact Ranking

Method for quantifying each optimization layer's contribution. Compares retrieval metrics (MRR, nDCG, Precision, Recall) before and after each phase optimization. Sorted by absolute delta of the primary metric (e.g., MRR).

Cost-at-Volume Modelling

Cost projection for different query volumes (100/1k/10k/100k per month):

API costs: cost_per_query * volume (from Phase 7 eval results)
GPU costs: (avg_ms / 3_600_000) * gpu_hourly_rate * volume

Insight from Phase 8: GPT-4o-mini ($0.00025/query) is cheaper per query than self-hosted Qwen3-8B on L4 ($0.006/query at 21s latency due to Thinking Mode). The GPU advantage lies in data privacy and quality, not in cost.

Glossary

Glossary

Evaluation Dataset (Golden Dataset)

Structure

Quality Criteria

Our Dataset

Retrieval Metrics

Precision@k

Recall@k

MRR (Mean Reciprocal Rank)

nDCG@k (Normalized Discounted Cumulative Gain)

Metrics Comparison

Chunking

Recursive Chunking

Fixed Chunking

Overlap

Embeddings

Embedding Dimension

BGE (BAAI General Embedding)

E5 (EmbEddings from bidirEctional Encoder rEpresentations)

Vector Search

HNSW (Hierarchical Navigable Small World)

Cosine Similarity

Dot Product (Scalar Product)

Search Strategies

Dense Vector Search

Dense Vector

Sparse Vector

BM25 (Best Match 25)

Named Vectors (Qdrant)

Hybrid Search

Alpha (Hybrid Weight)

Score Fusion

Min-Max Normalization

Reranking

Reranking (Second-Stage Scoring)

Bi-Encoder

Cross-Encoder

Bi-Encoder vs. Cross-Encoder

Generative Reranker

Over-Retrieval (retrieve_k)

Response Synthesis

Response Synthesis

Response Mode: tree_summarize

Response Mode: refine

Response Mode: compact

Temperature

Synthesis Metrics

Faithfulness (LLM-as-Judge)

LLM-as-Judge

Answer Relevance (Embedding-Based)

Hallucination Score (Embedding-Based)

Keyword Coverage (Rule-Based, Custom)

Source Attribution (Embedding-Based)

Latency Percentiles (p50, p95, p99)

Quality Composite Score

Cost Efficiency

LLM Hosting

vLLM

GPU-Time Pricing

Tensor Parallelism (TP)

Quantization (AWQ, GPTQ)

Thinking Mode (Qwen3)

End-to-End Evaluation (Phase 8)

Naive Baseline

Per-Layer Impact Ranking

Cost-at-Volume Modelling

Related Documents

Glossary for *Artificial Intelligence: A Modern Approach*

ISTQB GLOSSARY V4

Glossary of Unicode Terms

专有词汇翻译记录

Glossary for Artificial Intelligence: A Modern Approach