Loading...
Loading...
Loading...
*[Deutsche Version](GLOSSARY_DE.md)*
# Glossary
Terms and concepts used in the experiment documents.
Grows with each phase.
---
## Evaluation Dataset (Golden Dataset)
The evaluation dataset (also "golden dataset" or "ground truth") is the foundation
of all retrieval metrics. It defines which documents are considered "relevant" for
a given query. All Precision, Recall, MRR, and nDCG values are only as meaningful
as the dataset they are measured against.
### Structure
An evaluation dataset consists of query-document pairs:
- **Query:** A natural-language question or search query
- **Expected Sources:** List of documents considered relevant
- **Must-Contain Keywords:** Keywords that must appear in the results
- **Reference Chunk Text:** Reference text for semantic comparison
### Quality Criteria
| Criterion | Description | Risk if Violated |
|-----------|-------------|------------------|
| Coverage | Queries cover various topics and difficulty levels | Blind spots for certain document types |
| Completeness | All relevant documents per query are annotated | Recall is measured too high — unannotated relevant docs are ignored |
| Correctness | Only actually relevant documents are marked | Artificially low Precision (false positives) |
| Size | Enough queries for statistically reliable results | Random fluctuations dominate the metrics |
| Negative Tests | Queries for which no relevant documents exist | System robustness is not tested |
### Our Dataset
`data/evaluation/aws_cert_eval_v1.json`: 53 queries (50 positive, 3 negative)
covering AWS Certification Docs. Validation uses three levels:
Source Matching (OR) Keyword Matching (OR) Semantic Similarity.
**Limitation:** 53 queries are sufficient for relative comparisons between
configurations, but too few for statistically significant statements about absolute
quality. Small changes to a few queries can shift metrics by 1-2pp.
---
## Retrieval Metrics
### Precision@k
Proportion of relevant documents among the top-k results.
**Formula:** `Precision@k = |relevant documents in top-k| / k`
**Example:** At k=10, 10 documents are returned. 8 of them are relevant.
Precision@10 = 8/10 = 0.80
**Interpretation:** High Precision = little noise in the results. Important when
every returned document serves as context for the LLM and irrelevant documents
can degrade answer quality.
### Recall@k
Proportion of found relevant documents out of all relevant documents in the index.
**Formula:** `Recall@k = |relevant documents in top-k| / |all relevant documents|`
**Example:** There are 12 relevant documents in the index. At k=10, 9 of them are found.
Recall@10 = 9/12 = 0.75
**Interpretation:** High Recall = few relevant documents are missed. In RAG systems
particularly critical: what is not found cannot be compensated downstream.
Precision can be improved through reranking, Recall cannot.
### MRR (Mean Reciprocal Rank)
Average of the reciprocal rank of the first relevant result across all queries.
**Formula:** `MRR = (1/N) * sum(1/rank_i)` where `rank_i` is the rank of the first
relevant result for query i.
**Example:**
- Query 1: First relevant result at rank 1 → 1/1 = 1.0
- Query 2: First relevant result at rank 3 → 1/3 = 0.33
- Query 3: First relevant result at rank 1 → 1/1 = 1.0
- MRR = (1.0 + 0.33 + 1.0) / 3 = 0.78
**Interpretation:** MRR close to 1.0 means the first relevant result almost always
appears at rank 1. Particularly important for user-facing systems, where the user
expects the first result to be relevant.
### nDCG@k (Normalized Discounted Cumulative Gain)
Measures ranking quality: Are relevant documents ranked higher than irrelevant ones?
**Idea:** A relevant document at rank 1 is more valuable than at rank 10.
The "gain" is logarithmically discounted with increasing position.
Normalized to [0, 1] by dividing by the ideal ranking.
**Formula:**
```
DCG@k = sum(rel_i / log2(i + 1)) for i = 1..k
nDCG@k = DCG@k / IDCG@k (IDCG = DCG of perfect ranking)
```
**Example:** At k=5 with relevance labels [1, 0, 1, 1, 0]:
- DCG = 1/log2(2) + 0/log2(3) + 1/log2(4) + 1/log2(5) + 0/log2(6) = 1.0 + 0 + 0.5 + 0.43 + 0 = 1.93
- Ideal ranking [1, 1, 1, 0, 0]: IDCG = 1.0 + 0.63 + 0.5 + 0 + 0 = 2.13
- nDCG = 1.93 / 2.13 = 0.91
**Interpretation:** nDCG@k close to 1.0 means relevant documents are optimally
placed. Combines Precision (how many are relevant?) with ranking quality
(where are they placed?). Considered the best single metric for retrieval quality.
### Metrics Comparison
| Metric | Measures | Strength | Weakness |
|--------|----------|----------|----------|
| Precision@k | Proportion of relevant results | Simple, intuitive | Ignores ranking order |
| Recall@k | Coverage of all relevant docs | Important for RAG completeness | Ignores noise |
| MRR | Position of the first hit | Good for "need only 1 result" | Ignores all other hits |
| nDCG@k | Overall ranking quality | Best overall metric | More complex to interpret |
---
## Chunking
### Recursive Chunking
Splits text along natural boundaries: first paragraphs, then sentences, then words.
Attempts to preserve semantically coherent units.
**Advantages:** Respects document structure, fewer context breaks.
**Disadvantages:** Uneven chunk sizes.
### Fixed Chunking
Splits text into pieces of fixed length (in tokens), regardless of text structure.
**Advantages:** Even chunk sizes, more consistent embeddings.
**Disadvantages:** Can split in the middle of a sentence or paragraph.
### Overlap
Number of tokens that overlap between consecutive chunks.
Prevents relevant information from being lost at chunk boundaries.
**Typical values:** 10-20% of chunk size (e.g., 50 for chunk size 512).
---
## Embeddings
### Embedding Dimension
Length of the vector that the embedding model produces per text chunk.
- **768d** (e.g., BGE-base): Standard BERT size. Good for shorter texts.
- **1024d** (e.g., BGE-large, E5-large): More capacity for semantic nuances.
Benefits especially with longer chunks (1024+ tokens), as more context
needs to be encoded.
**Trade-offs:** Higher dimensions improve retrieval quality but also increase
embedding latency (larger model), memory consumption in the vector index
(each vector occupies more bytes), and search latency (more dimensions per
distance computation). At 100k chunks: 768d ~ 300 MB, 1024d ~ 400 MB (float32).
### BGE (BAAI General Embedding)
Embedding models from BAAI (Beijing Academy of Artificial Intelligence).
Variants: bge-base (768d), bge-large (1024d). Type prefix: "bge".
### E5 (EmbEddings from bidirEctional Encoder rEpresentations)
Embedding models from Microsoft Research. E5-large-v2 has 1024 dimensions.
Type prefix: "e5". Uses "[query]"/"[passage]" prefixes for asymmetric search.
---
## Vector Search
### HNSW (Hierarchical Navigable Small World)
Graph-based algorithm for Approximate Nearest Neighbor (ANN) search.
Standard index in ChromaDB, Qdrant, Weaviate.
**Key parameters:**
- **ef_construction:** Accuracy during index construction (higher = more accurate, slower)
- **M:** Number of connections per node (higher = more accurate, more memory)
### Cosine Similarity
Similarity measure between two vectors. Measures the angle, not the magnitude.
Values from -1 (opposite) to 1 (identical). Standard metric for embedding search.
### Dot Product (Scalar Product)
Mathematical operation between two vectors: `a . b = sum(a_i * b_i)`.
For sparse vectors, only the shared non-zero positions are multiplied and
summed. In the context of BM25: the dot product between query sparse vector and
document sparse vector yields the BM25 score.
**Difference from Cosine Similarity:** Cosine normalizes by vector length (measures
direction), Dot Product also considers length (measures direction + magnitude).
---
## Search Strategies
### Dense Vector Search
Semantic search over embedding vectors. Each text is converted by an embedding model
into a dense vector (fully-populated vector, all dimensions have values).
Similar texts have similar vectors (measured via Cosine Similarity).
**Strengths:** Captures semantic similarity ("car" also finds "vehicle").
**Weaknesses:** Can miss exact keyword matches, especially with technical terms
or proper nouns not seen during training.
### Dense Vector
Fully-populated vector of fixed length (e.g., 1024 dimensions), where every position
has a float value. Produced by embedding models like E5-large-v2 or BGE.
Encodes the semantic meaning of the entire text.
**Storage:** 1024d * 4 bytes (float32) = 4 KB per vector. At 150k chunks: ~600 MB.
### Sparse Vector
Sparsely-populated vector with high dimensionality (vocabulary size, e.g., 50,000+),
but only a few non-zero entries (typically 20-100 per document). Stored as
pairs of `indices` (which positions) and `values` (which weights).
**Example:** A document with the terms "Lambda", "serverless", "function" has a
sparse vector with 3 non-zero entries at the positions of these terms in the vocabulary.
**Storage:** Only the non-zero entries are stored. At 50 terms per chunk:
50 * (4 + 4) bytes = 400 bytes per vector. At 150k chunks: ~60 MB.
### BM25 (Best Match 25)
Classic keyword ranking algorithm (Robertson et al., 1994). Scores documents
based on overlap with query terms, considering:
1. **Term Frequency (TF):** How often a query term appears in the document.
BM25 uses saturated TF — the benefit of each additional occurrence diminishes:
```
tf_component = tf * (k1 + 1) / (tf + k1 * (1 - b + b * dl / avgdl))
```
- `k1` (default 1.5): Controls TF saturation. Higher = more weight for
frequent terms. k1=0 ignores TF completely.
- `b` (default 0.75): Controls length normalization. b=1 normalizes fully,
b=0 ignores document length.
- `dl`: Document length (number of tokens)
- `avgdl`: Average document length in the corpus
2. **Inverse Document Frequency (IDF):** Weighting of rare terms. The rarer
a term in the entire corpus, the more informative it is:
```
IDF(t) = log((N - df(t) + 0.5) / (df(t) + 0.5) + 1)
```
- `N`: Total number of documents in the corpus
- `df(t)`: Number of documents containing term t
3. **BM25 Score** for a document d and query q:
```
BM25(d, q) = sum(IDF(t) * tf_component(t, d)) for all t in q
```
**Strengths:** Excellent for exact keyword matches, technical terms, proper nouns.
**Weaknesses:** No semantic understanding ("car" does not find "vehicle").
**In our system:** BM25 weights are stored as sparse vectors in Qdrant.
Document vectors contain the TF component (without IDF), query vectors contain
the IDF values. The dot product yields the BM25 score.
### Named Vectors (Qdrant)
Qdrant feature (since v1.7) that allows multiple vector spaces per collection.
Each vector space has a name and its own parameters (dimension, distance metric).
**Example of our hybrid collection:**
```
Collection: "recursive_1024_100__e5-large-v2__hybrid"
├── "dense": 1024-dim, Cosine Similarity (E5-large-v2 embeddings)
└── "bm25": Sparse Vector (BM25 TF weights)
```
Each document has both vectors. Queries can target a specific vector space
(`using="dense"` or `using="bm25"`) or combine both.
### Hybrid Search
Combination of Dense (semantic) and Sparse (keyword-based) search.
Goal: Combine the strengths of both approaches — semantic understanding (Dense)
plus exact keyword matches (Sparse).
**Process in our system:**
1. Send query to dense index → top-2k results with scores
2. Send query to sparse index → top-2k results with scores
3. Normalize scores (Min-Max to [0,1])
4. Weighted combination: `combined = α * dense + (1-α) * sparse`
5. Sort by combined score, return top-k
### Alpha (Hybrid Weight)
Weighting parameter for Hybrid Search. Determines the ratio between
Dense (semantic) and Sparse (keyword) components.
| Alpha | Dense Share | Sparse Share | Description |
|-------|------------|-------------|-------------|
| 1.0 | 100% | 0% | Pure Dense (= pure semantic search) |
| 0.7 | 70% | 30% | Semantics-dominant |
| 0.5 | 50% | 50% | Balanced |
| 0.3 | 30% | 70% | Keyword-dominant |
| 0.0 | 0% | 100% | Pure Sparse (= pure BM25) |
**Intuition:** For specialized domains with specific terminology (like AWS certifications),
a keyword component can help find exact service names and technical terms more reliably,
while the semantic component covers conceptual similarities.
### Score Fusion
Merging results from two (or more) search methods into a unified ranking list.
**Alpha Blending (score-based):** Normalizes the scores of both methods to
[0,1] and weights them: `α * score_A + (1-α) * score_B`. Requires score normalization,
since Dense (0-1) and Sparse (0-30+) have different scales.
**RRF (Reciprocal Rank Fusion):** Rank-based alternative: `score = sum(1 / (k + rank_i))`.
No alpha parameter, treats all sources equally. Independent of score scales,
but no weighting possible.
**Our choice:** Alpha Blending, because we want to experimentally control and
compare the influence of Dense vs. Sparse.
### Min-Max Normalization
Scaling values to the range [0, 1]:
```
normalized = (x - min) / (max - min)
```
Necessary for Score Fusion, since raw scores from different search methods
have different scales:
- Dense (Cosine Similarity): typically 0.3 - 0.95
- Sparse (BM25 Dot Product): typically 0 - 30+
After normalization, both are on [0, 1] and can be combined with weights.
**Edge case:** If all scores are equal (`max = min`), all are set to 1.0.
---
## Reranking
### Reranking (Second-Stage Scoring)
Post-processing of initial retrieval results with a more precise model.
Retrieval (Stage 1) quickly delivers a candidate list (e.g., top-100 from 150k),
the reranker (Stage 2) scores these candidates more accurately and re-sorts them.
**Impact on metrics:**
- Improves **Precision** (irrelevant results are pushed down)
- Improves **nDCG and MRR** (better ranking of relevant results)
- Does **not improve Recall** — what was not found in retrieval cannot be
added by the reranker. Therefore, over-retrieval (fetching more candidates
than finally needed) is critical.
### Bi-Encoder
Model that encodes query and document **independently** — one separate embedding each.
Similarity is computed via vector operations (e.g., Cosine Similarity).
**Advantages:** Document embeddings can be precomputed and stored in a vector index.
Search over 150k chunks in milliseconds (ANN algorithms).
**Disadvantages:** The independent encoding cannot capture fine interactions between
query and document terms.
**In our system:** E5-large-v2 (Phase 2) is a Bi-Encoder for retrieval.
### Cross-Encoder
Model that encodes query and document **jointly** — both are concatenated as a pair
and passed through a transformer. The model can directly model interactions
between query and document terms.
**Advantages:** Significantly more accurate than Bi-Encoders, since the model can see
which query terms appear in the document and in what context.
**Disadvantages:** O(n) expensive — each (query, document) pair requires a separate
forward pass. With 100 candidates = 100 inference calls per query. Therefore only
usable as a reranker for pre-selected candidates, not for full-index search.
**Typical architecture:** `AutoModelForSequenceClassification` (e.g., BGE-reranker-v2-m3).
Input: `[CLS] query [SEP] document [SEP]` → Output: relevance score.
### Bi-Encoder vs. Cross-Encoder
The standard pattern in modern RAG systems:
```
Full Index (150k Chunks)
|
v Bi-Encoder (fast, O(1) via ANN)
Top-100 Candidates
|
v Cross-Encoder (accurate, O(n))
Top-5 Results
|
v LLM (Synthesis)
Answer
```
The Bi-Encoder quickly filters out the majority of irrelevant documents.
The Cross-Encoder optimizes the ranking of the remaining candidates.
This two-stage approach combines the speed of the Bi-Encoder with the
accuracy of the Cross-Encoder.
### Generative Reranker
Reranker based on a language model (e.g., Qwen2, Qwen3). Scores relevance
through token generation instead of sequence classification.
**Typical approach:** The model receives query + document as a chat prompt and
decides whether the document is relevant. The score is the probability of the
"yes" token relative to the "no" token.
**Difference from Cross-Encoder:** Cross-Encoders have a dedicated classification
head (linear layer). Generative rerankers use the language modeling head and
extract scores from token probabilities.
**Examples:** Mixedbread mxbai-rerank-v2 (Qwen2-based), Qwen3-Reranker.
### Over-Retrieval (retrieve_k)
Strategy of fetching more candidates from the index than finally needed,
to give the reranker a broader selection.
**Example:** For k=5 final results, retrieve_k=100 candidates are fetched.
The reranker scores all 100 and returns the best 5.
**Typical ratios:** 3-20x the final k. Too few candidates limit the reranker,
too many increase latency (especially with LLM-based rerankers).
**In our system:** `--retrieve-k 100` (CLI parameter in `run_eval.py`).
---
## Response Synthesis
### Response Synthesis
The LLM generates a natural-language answer from retrieved chunks. Different
modes determine how chunks are presented to the LLM. The mode affects
quality, latency, and token consumption.
### Response Mode: tree_summarize
Hierarchical summarization: Chunks are summarized pairwise, the summaries
summarized again → final answer. Requires multiple LLM calls (O(log n)),
but delivers the best quality with many chunks, since every chunk is considered.
### Response Mode: refine
Iterative refinement: First chunk → first answer. Each subsequent chunk
refines the answer. N LLM calls (N = number of chunks). Good when the
order of chunks matters.
### Response Mode: compact
Fits as many chunks as possible into one prompt (up to context window limit).
Typically 1-2 LLM calls. Faster and cheaper, but less thorough with
many chunks.
### Temperature
Sampling parameter of the LLM. Controls how deterministic or creative the
answers are.
| Value | Effect | Recommendation |
|-------|--------|---------------|
| 0.0 | Deterministic (always the most likely token) | Factual answers |
| 0.3 | Slight variation | Good compromise |
| 0.7+ | Creative/random | Not recommended for RAG |
For RAG systems with factual answers: 0.0-0.3 recommended.
---
## Synthesis Metrics
### Faithfulness (LLM-as-Judge)
Measures whether the answer is based exclusively on the source chunks (no
fabrications). LLM-as-Judge: An evaluator LLM checks every claim in the
answer against the source texts.
**Values:** 0.0 (all claims fabricated) to 1.0 (all claims verifiable).
**Status:** Implemented in `LLMMetrics.faithfulness()`. Not invoked in Phase 6
(embedding-based metrics instead). Activatable from Phase 7 via
`--enable-faithfulness` flag. Uses Claude Haiku 4.5 as evaluator.
### LLM-as-Judge
Evaluation approach where one LLM assesses the quality of another LLM's output.
Scalable and automatable — no manual annotations needed.
**Risk:** Self-evaluation bias — a model rates itself more favorably.
Therefore: Use a different model as evaluator (e.g., Claude Haiku 4.5 evaluates
GPT-4o-mini outputs).
**In Phase 6:** LLM-as-Judge (Faithfulness) was planned but deferred in favor
of deterministic embedding-based metrics.
### Answer Relevance (Embedding-Based)
Semantic similarity between query and answer, measured as embedding
Cosine Similarity. A high value means the answer actually addresses the
question, not just contains related material.
**Method:** Query embedding vs. Response embedding via E5-large-v2.
**Type:** Deterministic, no LLM call.
### Hallucination Score (Embedding-Based)
Per-sentence embedding comparison of the answer against all source chunks. Each
sentence of the answer is encoded as an embedding and compared against the source
chunk embeddings. Sentences with low similarity to all sources are potentially
hallucinated.
**Values:** 0.0 (strong hallucination) to 1.0 (no hallucination detected).
**Type:** Deterministic, no LLM call. Complementary to Faithfulness — measures
the same aspect (source fidelity), but embedding-based instead of LLM-based.
### Keyword Coverage (Rule-Based, Custom)
Proportion of expected keywords (`must_contain_keywords` from the eval dataset)
in the generated answer. Custom metric for this project, since the eval dataset
has no reference answers but defines keywords per query.
**Formula:** `keyword_coverage = |found keywords| / |expected keywords|`
**Method:** Case-insensitive substring match.
**Example:** Keywords = ["SageMaker", "machine learning", "managed service"].
Answer contains "SageMaker" and "machine learning" → Coverage = 2/3 = 0.67.
**Context:** More practical than purely semantic metrics, because concrete facts
are checked. At the same time limited: Synonyms and paraphrases are not
recognized ("fully managed" does not match "managed service").
### Source Attribution (Embedding-Based)
Measures which source chunks actually contributed to the answer.
Embedding similarity between the answer and each source chunk.
**Values:** `attribution_rate` from 0.0 (no sources used) to 1.0 (all sources
contributed). A high value means the LLM effectively uses the provided sources
rather than relying on its own knowledge.
**Type:** Deterministic, no LLM call.
### Latency Percentiles (p50, p95, p99)
Statistical distribution measures for response times across all queries:
| Percentile | Meaning |
|------------|---------|
| p50 (Median) | 50% of queries are faster |
| p95 | 95% of queries are faster (tail latency) |
| p99 | 99% of queries are faster (worst case) |
**Why not just avg?** The average obscures outliers. p95/p99 show how the system
performs under load. For user-facing systems, p95 is often the more relevant
metric than avg.
### Quality Composite Score
Weighted score from all synthesis metrics. Normalized to [0, 1].
**Weights (with Faithfulness):**
- Answer Relevance: 35%
- Keyword Coverage: 25%
- Source Attribution: 20%
- Hallucination Penalty (1 - halluc_score): 10%
- Faithfulness: 10%
**Without Faithfulness:** Relevance 40%, Keywords 25%, Attribution 20%, Halluc 15%.
### Cost Efficiency
Ratio of quality to cost: `quality_score / total_cost_usd`.
Higher = better price-performance ratio.
---
## LLM Hosting
### vLLM
Open-source LLM inference engine. Hosts open-source models on your own GPU
infrastructure with an OpenAI-compatible API. Instead of token-based billing,
you only pay for GPU compute time.
**Key Features:**
- PagedAttention for efficient KV-Cache management
- Continuous Batching for high throughput
- Tensor Parallelism for large models across multiple GPUs
- OpenAI-compatible API (`/v1/chat/completions`)
**In our system:** vLLM runs as a Kubernetes Deployment on GPU nodes.
The model is configured via `--served-model-name` and accessible through
an internal service (`vllm-svc.ml-models.svc.cluster.local:8000`).
**Important:** The service is named `vllm-svc` (not `vllm`), because Kubernetes
for a service named `vllm` automatically injects `VLLM_PORT=tcp://...`, which
conflicts with vLLM's own `VLLM_PORT` env var.
### GPU-Time Pricing
Cost model for self-hosted LLMs. Instead of token-based billing (like OpenAI),
the GPU usage duration is charged:
```
cost = latency_ms / 1000 / 3600 * gpu_hourly_rate
```
**Example:** 500ms inference on g6.xlarge ($0.98/h):
`0.5 / 3600 * 0.98 = $0.000136` per query
**Advantage:** Costs scale with inference time, not token count.
Large prompts are not more expensive than small ones (at the same inference time).
### Tensor Parallelism (TP)
Distribution of a model across multiple GPUs by splitting weight matrices along
columns/rows. Each GPU holds only a part of the model.
**When needed:** When the model does not fit on a single GPU.
- 70B-Q4 (~35GB) does not fit on 1x A10G (24GB), but on 4x A10G (96GB) with TP=4
- 27B/32B BF16 fits on 1x L40S (48GB) without TP
**Overhead:** TP requires inter-GPU communication at every transformer layer.
1x L40S without TP is therefore often faster than 4x A10G with TP=4 for models
that fit on a single GPU.
### Quantization (AWQ, GPTQ)
Reduction of weight precision from BF16/FP16 (16 bit) to INT4 (4 bit).
Halves VRAM requirements with minimal quality loss.
**AWQ (Activation-aware Weight Quantization):**
Quantizes weights asymmetrically, keeps important weights (based on
activation distribution) at higher precision. Natively supported by vLLM.
**GPTQ (GPT Quantization):**
Older method, quantizes weights per layer with error correction.
Also natively supported by vLLM.
**TorchAO:**
PyTorch-native quantization format. Requires `torchao>=0.10.0` — not
pre-installed in vLLM v0.14. Models with TorchAO quantization (e.g.,
`pytorch/gemma-3-27b-it-AWQ-INT4`) do not work out-of-the-box.
**Practical experience from Phase 7:**
| Model | BF16 VRAM | Quantized | VRAM (est.) | GPU |
|-------|-----------|-----------|-------------|-----|
| Phi-4 (14B) | ~28GB | AWQ-INT4 | ~8GB | L4 (24GB) |
| Gemma-3-27B | ~54GB | GPTQ-INT4 | ~14GB | L40S (48GB) |
| Qwen3-32B | ~64GB | AWQ | ~16GB | L40S (48GB) |
Without quantization, mid-tier models (27B, 32B) do not fit on single GPUs.
Quantization is not optional in practice, but a prerequisite.
### Thinking Mode (Qwen3)
Qwen3 models generate `<think>...</think>` reasoning tokens by default
before the actual answer. This leads to massively increased latency
(~21s/query for 8B on L4 instead of expected ~2-3s).
**Impact on quality:** Despite high latency, Qwen3 delivers the best
Keyword Coverage (0.929 for 32B-AWQ) — the thinking tokens appear to
actually improve answer quality.
**`--override-generation-config '{"enable_thinking": false}'` did NOT fix
the issue in our tests.** Latency remained unchanged.
---
## End-to-End Evaluation (Phase 8)
### Naive Baseline
The intentionally unoptimized reference pipeline, serving as the starting point.
All components at default/beginner settings:
- Chunking: `fixed_256_25` (fixed 256-token chunks, 25 token overlap)
- Embedding: `bge-base-en-v1.5` (768 dimensions)
- VectorDB: ChromaDB (in-process)
- Search: Dense
- Reranking: None
### Per-Layer Impact Ranking
Method for quantifying each optimization layer's contribution.
Compares retrieval metrics (MRR, nDCG, Precision, Recall) before and
after each phase optimization. Sorted by absolute delta of the
primary metric (e.g., MRR).
### Cost-at-Volume Modelling
Cost projection for different query volumes (100/1k/10k/100k per month):
- **API costs:** `cost_per_query * volume` (from Phase 7 eval results)
- **GPU costs:** `(avg_ms / 3_600_000) * gpu_hourly_rate * volume`
Insight from Phase 8: GPT-4o-mini ($0.00025/query) is cheaper per query
than self-hosted Qwen3-8B on L4 ($0.006/query at 21s latency due to Thinking Mode).
The GPU advantage lies in data privacy and quality, not in cost.
title: OpenElections Glossary
*(Updated: December 31, 2025 – Expanded negative pole definitions and examples across all relational modes)*
Griptape Nodes is a toolkit that enables artists and creators to build AI-powered projects without the need for deep technical expertise. You can think of Griptape Nodes as a set of building blocks that you can connect together to create art, generate images, process text, or even build other workflow-centric applications.
| **Use when** | You encounter an unfamiliar term, or need to explain a concept to stakeholders |