All Documents

Chapter 12 — Verification: evaluation inside the loop

Replace "sounds right" with "passes checks."

aiagentprompt

dustinober1-archive

MONITORING.md

Evaluation and Observability

Sources: Huyen (AI Engineering, ch. 3–4, 10), Brousseau & Sharp (LLMs in Production), Pydantic Evals documentation analysis, RAGAS framework documentation analysis, 2025–2026 production patterns

tankpkg

Evaluating AI Agent Systems: Metrics, Benchmarks, and Quality Assurance (2024-2026)

> Research compiled February 2026 for the **aiai** self-improving AI infrastructure project.

oddurs

Claim Extraction Evaluation Matrix

document_id: AIDHA-TASK-004

GitCmurf

Evaluation Findings (Sample Run)

Summary of a single run of both pipelines for presentation. Re-run the evals to refresh numbers.

interactive-decision-support-system

13-02-PLAN

phase: 13-retrieval-evaluation-framework

aievalclaude

sebc-dev

ARCHITECTURE.md

Study Guide: RAG Evaluation (RAGAS-Lite)

**What does this module do?**

jadenitishraj

Superpipe Studio

Superpipe Studio is a free and open-source observability and experimentation app for the Superpipe SDK. It can help you:

eval

villagecomputing

MONITORING.md

評估系統

> 屬於 [research/](./README.md)。涵蓋 LLM-as-Judge、Reasoning Model、評估維度、Judge 設計原則。

aillmeval

sosreader

Traditional function - easy to test

title: "Model Evaluation"

aillmeval

josephstreeter

Voice AI Leaderboards, Benchmarks, and Evaluation Gaps (Jan 2025 -- Feb 2026)

> **Last updated**: 2026-02-20

petteriTeikari

Repository Intelligence: Building the Next Generation of Agent Evaluation Data

Source: https://potpie.ai/blog/the-agent-evaluation-gap

aiagenteval

kriegcloud

AI System Evaluation & Testing / Đánh Giá và Kiểm Thử Hệ Thống AI

> **Track**: Shared | **Difficulty**: 🟢 Junior → 🔴 Senior

Nhi4912

LLM Evaluation — Deep Dive

> Frontier-lab interview-grade reference on evaluating LLMs and LLM-powered products.

ffaisal93

Air-Gapped RAG: Grounding, Citations, and Evaluation

title: "Air-Gapped RAG: Grounding, Citations, and Evaluation"

agentpatterns-ai

Generative models produce **open-ended text** — there is rarely a single “correct” string. Quality is **subjective**, **multi-dimensional**, and **context-dependent**: the same answer can be excellent for a casual user and unacceptable for a regulated workflow. Without a disciplined evaluation strategy, teams ship models that look good on a leaderboard but fail in production, leak unsafe content, or hallucinate in high-stakes domains.

spawn08

Knowledge MCP Query Reference for Evaluation Timing

This document provides a practical reference for using the Knowledge MCP to research evaluation placement, methods, and anti-patterns. It shows which queries to run and what to expect from each.

aillmeval

philbeliveau