Recently Added

Evaluation of RAG Systems + Presentation Outline

How do we evaluate our RAG system?

ayanahye

RAG.md

Glossar

Begriffe und Konzepte, die in den Experiment-Dokumenten verwendet werden.

hanasobi

Project Memory

**Last Updated:** 2026-01-29 22:00

shah-data-scientist

Golden Dataset Guidelines

A golden dataset is a curated collection of examples with known-correct answers that you use to:

aieval

natnew

SKILL.md

LLM Evaluation

LLM output evaluation — automated metrics, LLM-as-judge, A/B testing, regression testing. Use when measuring LLM output quality, comparing prompt or model versions, building an automated eval pipeline, setting up regression tests for prompt changes, or evaluating RAG systems and bias/safety.

projectious-work

DEPLOYMENT.md

SunCube AI - Comprehensive Documentation

- [Overview](#overview)

aiworkflowsafety

SunenaB3504

CLAUDE.md

PROJECT PLAN STARTER PACK - Complete Index

**Last Updated:** January 20, 2026

airag

owenlim225

Understanding the Sources of Uncertainty - and Why Our Evals are Biased

Part 4 of *Optimizing in the Dark:

aiagentrag

reliableai

Using Performance Metrics to Evaluate RAG Systems

title: "Data-Driven RAG Evaluation: Testing Qdrant Apps with Relari AI"

AlexisBalayre

MONITORING.md

🧠 Big Picture

18/75 Question A digital content company is building a generative AI (GenAI) application that summarizes news articles. The application needs to route requests to different LLMs based on language and content types. For regulatory compliance, certain content types must use specific model providers. A GenAI developer must create a solution that can switch between model providers without code changes. The model providers include Amazon Bedrock and third-party APIs. The solution must securely store

emilyg888

AI Tester Interview Preparation Guide

1. [Core Competencies Overview](#core-competencies)

aillmprompt

k21academyuk

Domain 5: Testing, Validation, and Troubleshooting

**AIP-C01 Study Guide — Dr. Priya Ramanathan**

aiagentllm

rahulbhavani-il

Evaluation Framework

This document describes how Agent Invest measures quality, detects regressions, and ensures safety. The system uses three evaluation layers: online scoring (every production run), offline evaluation (golden dataset), and guardrails (real-time safety checks).

aiagentllm

yussaaa

Research Report: Using LLMs as Oracle for Entity Matching Ground Truth

Comprehensive research on using Large Language Models (particularly DeepSeek, GPT-4, and Claude) for entity matching ground truth generation. This report covers LLM accuracy benchmarks, prompt engineering best practices, multi-LLM ensemble approaches, cost-benefit analysis, validation strategies, and patterns for converting LLM labels into regression tests.

ClaudioLutz

[BEE-30004] Evaluating and Testing LLM Applications

title: Evaluating and Testing LLM Applications

alivedise

RAG System Testing Methodologies: A Comprehensive Guide

**Document Version:** 1.0