Loading...
Loading...
3,528 documents available
Before launching a large-scale training or tuning run, verify the following gates are closed.
STATUS: Non-authoritative research notes. Superseded where conflicts with architecture_source_of_truth.md and architecture_decisions_and_naming.md.
**Project:** ChatBot Application with LLM and RAG Integration
**The most effective code indexing systems combine hierarchical LLM-generated summaries with AST structural data and vector embeddings through hybrid retrieval—achieving up to 80% codebase reduction while maintaining high accuracy for AI coding agents.** Leading tools like Cursor, Sourcegraph Cody, and Continue.dev demonstrate that no single retrieval method suffices; production systems require semantic search, keyword matching, and structural queries working together. For evaluation, the field
**Authors:** [Author Names]
> This is the most practically important section for your current MLOps → AI Engineer transition. RAG powers 80%+ of enterprise LLM applications.
> Design document analyzing how user actions feed back into ML predictions,
title: lib-ai-app-community-rag
Welcome back everyone. Today we're jumping into something really fundamental for anyone building autonomous systems.
**Version:** 3.1.0-Hybrid
Tesis pregrado USAT (Escuela de Ingeniería de Sistemas y Computación). STI con RAG privado para curso **Aplicaciones Móviles** (Android/Kotlin) del IESTP "República Federal de Alemania", Chiclayo.
**Purpose:** Establish the problem, introduce the solution, build immediate credibility
Replace "sounds right" with "passes checks."
**What does this module do?**
layer: 06_ai_engineering
Evaluation is how you know whether a model is **fit for deployment** and whether a **new checkpoint** actually improves the behaviors you care about. Unlike classic supervised learning with a single held-out label distribution, LLMs are judged on **open-ended generation**, **multi-turn dialogue**, **tool use**, and **subjective** qualities like helpfulness. Without disciplined benchmarks, teams ship models that ace **proxy metrics** while failing **real users**—or regress silently when data mixt
title: 「Hello Agents 第12章」你的Agent真的好用吗?智能体评估体系完全指南
name: 'step-08-llm-evaluator'
> **TL;DR**: You can't improve what you can't measure, and measuring LLM quality is genuinely hard. Build an eval pipeline before you build your AI product, not after. Start with a 50-100 example golden dataset and a simple LLM-as-judge setup. Vibes-based evaluation is how products ship regressions silently.
**Authors**: Daniel Commey
Sources: Huyen (AI Engineering, ch. 3–4, 10), Brousseau & Sharp (LLMs in Production), Pydantic Evals documentation analysis, RAGAS framework documentation analysis, 2025–2026 production patterns
> **Last updated**: 2026-02-20
Source: https://potpie.ai/blog/the-agent-evaluation-gap
title: "Model Evaluation"