My First RAG System Had No Evals. 40% of Answers Were Wrong.

--- title: My First RAG System Had No Evals. 40% of Answers Were Wrong. published: true description: How to actually measure and improve your RAG pipeline. Retrieval metrics, synthetic evals, hybrid search, and the fixes tags: ai, rag, machinelearning, programming cover_image: https://dev-to-uploads.s3.amazonaws.com/uploads/articles/tv8stu1zwsjzvku972y2.png # Use a ratio of 100:42 for best results. # published_at: 2026-04-13 16:00 +0000 --- When I started building production RAG systems, I noticed something: nobody was measuring retrieval quality. Teams would ship a system, ask users if it "felt good," and move on. No metrics. No baseline. No way to know if changes actually helped. So I started measuring everything. And the first thing I discovered: **most RAG failures aren't LLM failures. They're retrieval failures.** The documents that could answer the question aren't making it into the context window. The LLM is being asked to answer questions without the information it needs. No wonder it hallucinates. Here's what I've learned about measuring and fixing RAG systems after building them for B2B SaaS companies. --- ## The metric that actually matters: Recall@k Before I measure anything else on a new RAG system, I measure **Recall@k**. Recall@k answers a simple question: "Of all the documents that *should* have been retrieved, what percentage actually made it into the top k results?" ```python def recall_at_k(retrieved_ids: list, relevant_ids: list, k: int) -> float: """What % of relevant docs are in the top k results?""" top_k = set(retrieved_ids[:k]) relevant = set(relevant_ids) if not relevant: return 1.0 return len(top_k & relevant) / len(relevant) ``` On systems I've audited, Recall@10 is often around 60%. That means 40% of the time, the document that could answer the question isn't even in the context. The LLM never had a chance. Here's the math that drives everything: **P(correct answer) ≈ P(correct context retrieved)** If the right chunks aren't retrieved, the LLM can't answer correctly. This is why I always measure retrieval separately from answer quality. Otherwise you're debugging the wrong layer. --- ## You can start measuring today You don't need production traffic to build evals. Generate synthetic test data from your corpus: ```python def generate_synthetic_evals(chunks: list) -> list: """Generate question-answer pairs from your chunks.""" eval_pairs = [] for chunk in chunks: response = llm.generate(f""" Generate 3 questions that this text can answer. Make them specific. "What is this about?" doesn't test retrieval. Text: {chunk.text} Return JSON: [{{"question": "...", "chunk_id": "{chunk.id}"}}] """) eval_pairs.extend(parse_json(response)) return eval_pairs ``` 50-100 questions is enough to establish a baseline. Run your retriever, measure Recall@10, write down the number. Now you can actually tell if changes help. --- ## The two fixes that consistently move the needle I've tried a lot of retrieval improvements. Most make marginal differences. Two consistently deliver results. ### Fix 1: Hybrid search Embeddings are great at semantic similarity. "How do I reset my password?" matches "Steps to recover account access" even though they share no keywords. But embeddings are weak on: - **Numbers**: They don't understand that 49 is close to 50 - **Exact match**: Product codes, IDs, ticker symbols - **Rare terms**: Domain jargon not in the training data BM25 (keyword search) catches what embeddings miss. Combine them: ```python def hybrid_search(query: str, k: int = 10) -> list: """Combine embedding search and BM25 using RRF.""" embedding_results = embedding_index.search(query, k=20) bm25_results = bm25_index.search(query, k=20) # Reciprocal Rank Fusion scores = {} rrf_k = 60 for rank, doc_id in enumerate(embedding_results): scores[doc_id] = scores.get(doc_id, 0) + 1 / (rrf_k + rank + 1) for rank, doc_id in enumerate(bm25_results): scores[doc_id] = scores.get(doc_id, 0) + 1 / (rrf_k + rank + 1) ranked = sorted(scores.keys(), key=lambda x: scores[x], reverse=True) return ranked[:k] ``` Typical improvement: **5-15% recall boost** depending on query mix. ### Fix 2: Add a reranker Embedding models are bi-encoders. They encode query and documents separately, then compare. Fast, but imprecise. Cross-encoders (rerankers) look at the query and document together. Slower, but much more accurate. Use them as a second pass: ```python def search_with_rerank(query: str, k: int = 5) -> list: """Retrieve broadly, then rerank precisely.""" # Cast a wide net candidates = hybrid_search(query, k=20) # Rerank with cross-encoder pairs = [(query, get_content(doc_id)) for doc_id in candidates] scores = reranker.score(pairs) # Return top k after reranking ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True) return [doc_id for doc_id, score in ranked[:k]] ``` Typical improvement: **another 5-10%** on top of hybrid search. Combined, these two fixes often take a system from 60% to 80% recall. That's the difference between "works sometimes" and "works reliably." --- ## Chunking decisions that make or break retrieval Your chunking strategy matters more than your embedding model choice. A few things I always check: ### The "it" problem Chunks that start with "It also supports..." or "This feature allows..." are useless on their own. The word "it" has no meaning without the previous chunk. **Fix: Prepend context to every chunk.** ```python def chunk_with_context(doc) -> list: chunks = [] for section in doc.sections: # Prepend document and section info context = f"Document: {doc.title}\nSection: {section.header}\n\n" for chunk_text in split_section(section.content): chunks.append({ "content": context + chunk_text, "metadata": { "doc_title": doc.title, "section": section.header } }) return chunks ``` ### Other chunking rules I follow 1. **Never split mid-table.** A row without headers is meaningless. 2. **10-20% overlap** between consecutive chunks. 3. **Test multiple chunk sizes** (256, 512, 1024 tokens). Optimal depends on your queries. --- ## The workflow I use on every RAG project **Week 1-2: Establish baseline** 1. Parse documents (test multiple parsers for PDFs) 2. Chunk with context headers 3. Generate 50-100 synthetic eval questions 4. Build basic retriever 5. Measure Recall@10 6. Write down the number **Week 2-4: Apply standard fixes** 7. Add hybrid search (BM25 + embeddings) 8. Add reranker 9. Measure again 10. Compare to baseline **Week 4+: Debug specific failures** 11. Break down recall by query type 12. Find worst-performing segment 13. Fix that segment 14. Measure again The key: measure after every change. If you can't see improvement in numbers, you're guessing. --- ## When to measure answer quality Only after retrieval is solid. Once Recall@10 is above 80%, start measuring end-to-end: ```python def eval_answer(question: str, answer: str, context: list) -> dict: """Use LLM-as-judge for answer evaluation.""" result = llm.generate(f""" Evaluate this answer. Return JSON: - correct: true/false (factually accurate) - grounded: true/false (supported by the context) - complete: true/false (addresses the full question) Context: {format_context(context)} Question: {question} Answer: {answer} """) return parse_json(result) ``` But if retrieval is broken, this eval is noise. You're just measuring how well your LLM fills in gaps it shouldn't have to fill. --- ## The takeaway RAG quality is retrieval quality. Before you touch your prompts: 1. Generate synthetic evals from your corpus 2. Measure Recall@10 3. Add hybrid search 4. Add a reranker 5. Fix your chunking 6. Measure again The fixes are straightforward. The impact is not. --- *This is Part 1 of a series on production AI systems. Next: how to know when to fix your prompts vs. build an evaluator.* --- ## About me I help B2B SaaS companies ship production AI in 6 weeks. If you're building RAG and want a second set of eyes, I do free AI Teardowns — a 30-45 min video showing exactly where your pipeline is breaking and how to fix it. No pitch. Just clarity. {% embed https://animanovalabs.com %}

My First RAG System Had No Evals. 40% of Answers Were Wrong.

Tags

Comments

More Blog

How I'm using ASTs and Gemini to solve the "Codebase Onboarding" problem 🧠

Local AI Will Save Us All (The Math Says So, Trust Me)

Lost in the AI Hype, I Started Small

Building a Replay-Tested Interactive Brokers Client in Go

Playwright in Pictures: Fully Parallel Mode

Designing a CLI for Both Humans and Agents