Loading...
Loading...
3,528 documents available
Replace "sounds right" with "passes checks."
Sources: Huyen (AI Engineering, ch. 3–4, 10), Brousseau & Sharp (LLMs in Production), Pydantic Evals documentation analysis, RAGAS framework documentation analysis, 2025–2026 production patterns
> Research compiled February 2026 for the **aiai** self-improving AI infrastructure project.
document_id: AIDHA-TASK-004
Summary of a single run of both pipelines for presentation. Re-run the evals to refresh numbers.
phase: 13-retrieval-evaluation-framework
**What does this module do?**
Superpipe Studio is a free and open-source observability and experimentation app for the Superpipe SDK. It can help you:
> 屬於 [research/](./README.md)。涵蓋 LLM-as-Judge、Reasoning Model、評估維度、Judge 設計原則。
title: "Model Evaluation"
> **Last updated**: 2026-02-20
Source: https://potpie.ai/blog/the-agent-evaluation-gap
> **Track**: Shared | **Difficulty**: 🟢 Junior → 🔴 Senior
> Frontier-lab interview-grade reference on evaluating LLMs and LLM-powered products.
title: "Air-Gapped RAG: Grounding, Citations, and Evaluation"
Generative models produce **open-ended text** — there is rarely a single “correct” string. Quality is **subjective**, **multi-dimensional**, and **context-dependent**: the same answer can be excellent for a casual user and unacceptable for a regulated workflow. Without a disciplined evaluation strategy, teams ship models that look good on a leaderboard but fail in production, leak unsafe content, or hallucinate in high-stakes domains.
This document provides a practical reference for using the Knowledge MCP to research evaluation placement, methods, and anti-patterns. It shows which queries to run and what to expect from each.
**Participants**: Ryotaro, Masaki Adachi
**Target folder:** `testing/judges/`, `testing/fixtures/`, CI via `just check`
> 70+ active-recall questions. Pair with `LLM_EVALUATION_DEEP_DIVE.md`.
**Status**: 🟡 In Progress
[Task Definitions] → [Model Interface] → [Inference Execution] → [Scoring] → [Reports]
A comprehensive deep-dive into RAG evaluation metrics, frameworks, and best practices for production deployment.
**Research Focus**: Grounding verification metrics and evaluation methodologies for retrieval-augmented generation systems