My First RAG System Had No Evals. 40% of Answers Were Wrong. — DeepSeek Blog | Neura Market
    Neura MarketNeura Market/DeepSeek
    ChatGPTChatGPTClaudeClaudeGeminiGeminiCursorCursorGrokGrokPerplexityPerplexityDeepSeekDeepSeek
    CoPilotCoPilotStable DiffusionStable DiffusionMidjourneyMidjourney
    View All Directories
    OverviewRulesPromptsMCPsAgentsBlogVideosGuidesCoursesCommunityTrendingGenerate
    DeepSeekBlogMy First RAG System Had No Evals. 40% of Answers Were Wrong.
    Back to Blog
    My First RAG System Had No Evals. 40% of Answers Were Wrong.
    ai

    My First RAG System Had No Evals. 40% of Answers Were Wrong.

    Serhii Panchyshyn April 13, 2026
    0 views

    How to actually measure and improve your RAG pipeline. Retrieval metrics, synthetic evals, hybrid search, and the fixes

    --- title: My First RAG System Had No Evals. 40% of Answers Were Wrong. published: true description: How to actually measure and improve your RAG pipeline. Retrieval metrics, synthetic evals, hybrid search, and the fixes tags: ai, rag, machinelearning, programming cover_image: https://dev-to-uploads.s3.amazonaws.com/uploads/articles/tv8stu1zwsjzvku972y2.png # Use a ratio of 100:42 for best results. # published_at: 2026-04-13 16:00 +0000 --- When I started building production RAG systems, I noticed something: nobody was measuring retrieval quality. Teams would ship a system, ask users if it "felt good," and move on. No metrics. No baseline. No way to know if changes actually helped. So I started measuring everything. And the first thing I discovered: **most RAG failures aren't LLM failures. They're retrieval failures.** The documents that could answer the question aren't making it into the context window. The LLM is being asked to answer questions without the information it needs. No wonder it hallucinates. Here's what I've learned about measuring and fixing RAG systems after building them for B2B SaaS companies. --- ## The metric that actually matters: Recall@k Before I measure anything else on a new RAG system, I measure **Recall@k**. Recall@k answers a simple question: "Of all the documents that *should* have been retrieved, what percentage actually made it into the top k results?" ```python def recall_at_k(retrieved_ids: list, relevant_ids: list, k: int) -> float: """What % of relevant docs are in the top k results?""" top_k = set(retrieved_ids[:k]) relevant = set(relevant_ids) if not relevant: return 1.0 return len(top_k & relevant) / len(relevant) ``` On systems I've audited, Recall@10 is often around 60%. That means 40% of the time, the document that could answer the question isn't even in the context. The LLM never had a chance. Here's the math that drives everything: **P(correct answer) ≈ P(correct context retrieved)** If the right chunks aren't retrieved, the LLM can't answer correctly. This is why I always measure retrieval separately from answer quality. Otherwise you're debugging the wrong layer. --- ## You can start measuring today You don't need production traffic to build evals. Generate synthetic test data from your corpus: ```python def generate_synthetic_evals(chunks: list) -> list: """Generate question-answer pairs from your chunks.""" eval_pairs = [] for chunk in chunks: response = llm.generate(f""" Generate 3 questions that this text can answer. Make them specific. "What is this about?" doesn't test retrieval. Text: {chunk.text} Return JSON: [{{"question": "...", "chunk_id": "{chunk.id}"}}] """) eval_pairs.extend(parse_json(response)) return eval_pairs ``` 50-100 questions is enough to establish a baseline. Run your retriever, measure Recall@10, write down the number. Now you can actually tell if changes help. --- ## The two fixes that consistently move the needle I've tried a lot of retrieval improvements. Most make marginal differences. Two consistently deliver results. ### Fix 1: Hybrid search Embeddings are great at semantic similarity. "How do I reset my password?" matches "Steps to recover account access" even though they share no keywords. But embeddings are weak on: - **Numbers**: They don't understand that 49 is close to 50 - **Exact match**: Product codes, IDs, ticker symbols - **Rare terms**: Domain jargon not in the training data BM25 (keyword search) catches what embeddings miss. Combine them: ```python def hybrid_search(query: str, k: int = 10) -> list: """Combine embedding search and BM25 using RRF.""" embedding_results = embedding_index.search(query, k=20) bm25_results = bm25_index.search(query, k=20) # Reciprocal Rank Fusion scores = {} rrf_k = 60 for rank, doc_id in enumerate(embedding_results): scores[doc_id] = scores.get(doc_id, 0) + 1 / (rrf_k + rank + 1) for rank, doc_id in enumerate(bm25_results): scores[doc_id] = scores.get(doc_id, 0) + 1 / (rrf_k + rank + 1) ranked = sorted(scores.keys(), key=lambda x: scores[x], reverse=True) return ranked[:k] ``` Typical improvement: **5-15% recall boost** depending on query mix. ### Fix 2: Add a reranker Embedding models are bi-encoders. They encode query and documents separately, then compare. Fast, but imprecise. Cross-encoders (rerankers) look at the query and document together. Slower, but much more accurate. Use them as a second pass: ```python def search_with_rerank(query: str, k: int = 5) -> list: """Retrieve broadly, then rerank precisely.""" # Cast a wide net candidates = hybrid_search(query, k=20) # Rerank with cross-encoder pairs = [(query, get_content(doc_id)) for doc_id in candidates] scores = reranker.score(pairs) # Return top k after reranking ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True) return [doc_id for doc_id, score in ranked[:k]] ``` Typical improvement: **another 5-10%** on top of hybrid search. Combined, these two fixes often take a system from 60% to 80% recall. That's the difference between "works sometimes" and "works reliably." --- ## Chunking decisions that make or break retrieval Your chunking strategy matters more than your embedding model choice. A few things I always check: ### The "it" problem Chunks that start with "It also supports..." or "This feature allows..." are useless on their own. The word "it" has no meaning without the previous chunk. **Fix: Prepend context to every chunk.** ```python def chunk_with_context(doc) -> list: chunks = [] for section in doc.sections: # Prepend document and section info context = f"Document: {doc.title}\nSection: {section.header}\n\n" for chunk_text in split_section(section.content): chunks.append({ "content": context + chunk_text, "metadata": { "doc_title": doc.title, "section": section.header } }) return chunks ``` ### Other chunking rules I follow 1. **Never split mid-table.** A row without headers is meaningless. 2. **10-20% overlap** between consecutive chunks. 3. **Test multiple chunk sizes** (256, 512, 1024 tokens). Optimal depends on your queries. --- ## The workflow I use on every RAG project **Week 1-2: Establish baseline** 1. Parse documents (test multiple parsers for PDFs) 2. Chunk with context headers 3. Generate 50-100 synthetic eval questions 4. Build basic retriever 5. Measure Recall@10 6. Write down the number **Week 2-4: Apply standard fixes** 7. Add hybrid search (BM25 + embeddings) 8. Add reranker 9. Measure again 10. Compare to baseline **Week 4+: Debug specific failures** 11. Break down recall by query type 12. Find worst-performing segment 13. Fix that segment 14. Measure again The key: measure after every change. If you can't see improvement in numbers, you're guessing. --- ## When to measure answer quality Only after retrieval is solid. Once Recall@10 is above 80%, start measuring end-to-end: ```python def eval_answer(question: str, answer: str, context: list) -> dict: """Use LLM-as-judge for answer evaluation.""" result = llm.generate(f""" Evaluate this answer. Return JSON: - correct: true/false (factually accurate) - grounded: true/false (supported by the context) - complete: true/false (addresses the full question) Context: {format_context(context)} Question: {question} Answer: {answer} """) return parse_json(result) ``` But if retrieval is broken, this eval is noise. You're just measuring how well your LLM fills in gaps it shouldn't have to fill. --- ## The takeaway RAG quality is retrieval quality. Before you touch your prompts: 1. Generate synthetic evals from your corpus 2. Measure Recall@10 3. Add hybrid search 4. Add a reranker 5. Fix your chunking 6. Measure again The fixes are straightforward. The impact is not. --- *This is Part 1 of a series on production AI systems. Next: how to know when to fix your prompts vs. build an evaluator.* --- ## About me I help B2B SaaS companies ship production AI in 6 weeks. If you're building RAG and want a second set of eyes, I do free AI Teardowns — a 30-45 min video showing exactly where your pipeline is breaking and how to fix it. No pitch. Just clarity. {% embed https://animanovalabs.com %}

    Tags

    airagmachinelearningprogramming

    Comments

    More Blog

    View all
    How I'm using ASTs and Gemini to solve the "Codebase Onboarding" problem 🧠ai

    How I'm using ASTs and Gemini to solve the "Codebase Onboarding" problem 🧠

    Hi everyone! 👋 I’m Tara, a Senior Software Engineer and Consultant. Over the years, I've jumped...

    T
    tworrell
    Local AI Will Save Us All (The Math Says So, Trust Me)ai

    Local AI Will Save Us All (The Math Says So, Trust Me)

    Every few weeks a take goes viral in tech circles making the case for ditching cloud AI and running...

    S
    Sebastian Schürmann
    Lost in the AI Hype, I Started Smallai

    Lost in the AI Hype, I Started Small

    And it helped me get back into tech without drowning TL;DR at the end Coming back to...

    R
    Rohini Gaonkar
    Building a Replay-Tested Interactive Brokers Client in Gogo

    Building a Replay-Tested Interactive Brokers Client in Go

    I wanted an IBKR library that felt like Go and had testing I could trust. So I wrote one.

    T
    Thomas Marcelis
    Playwright in Pictures: Fully Parallel Modeplaywright

    Playwright in Pictures: Fully Parallel Mode

    Playwright’s fullyParallel mode is often treated as a simple performance switch. In practice, it...

    V
    Vitaliy Potapov
    Designing a CLI for Both Humans and Agentscli

    Designing a CLI for Both Humans and Agents

    Learn how Alpic designed its CLI for both human developers and AI agents — covering tradeoffs like polling, context windows, interactivity, and statelessness.

    J
    Julien Vallini

    Stay up to date

    Get the latest DeepSeek prompts, rules, and resources delivered to your inbox weekly.

    Neura Market LogoNeura Market

    Discover the best AI prompts, plugins, and resources for DeepSeek and more.

    Content Types

    • Rules
    • Prompts
    • MCPs
    • Agents
    • Guides

    Platforms

    • ChatGPT Directory
    • Claude Directory
    • Gemini Directory
    • Cursor Directory
    • Grok Directory
    • Perplexity Directory
    • DeepSeek Directory
    • CoPilot Directory
    • Stable Diffusion Directory
    • Midjourney Directory
    • All Directories

    Resources

    • Blog
    • Documentation
    • Help Center
    • Marketplace

    Legal

    • Privacy Policy
    • Terms of Service

    © 2026 Neura Market. All rights reserved.

    |

    Not affiliated with any AI platform vendors.