K

.md Directory

All Documents New Popular

K

All Documents — .md Directory | Neura Market

All Documents

3,528 documents available

SKILL.md

Evaluation Framework Reference

[Task Definitions] → [Model Interface] → [Inference Execution] → [Scoring] → [Reports]

aiagentllm

LauraFlorentin

EVALS.md

Evalyn Roadmap

This document tracks planned features and completed work. Future roadmap items are listed first, followed by completed features.

aiagentllm

shihongDev

EVALS.md

Design an Evaluation Pipeline for an LLM-Based Product

Design an **end-to-end evaluation pipeline** for a production **LLM-based product** (assistant, RAG app, code copilot, or agent). The pipeline must answer: **“Did this model / prompt / retrieval change make the product better, safer, or cheaper — and can we prove it?”** It spans **offline** lab benchmarks, **task-specific metrics**, **LLM-as-judge**, **human preference** studies, **safety** testing, **golden-set regression**, and **online** A/B experimentation — with **dashboards and alerting**

aiagentllm

spawn08

EVALS.md

Air-Gapped RAG: Grounding, Citations, and Evaluation

title: "Air-Gapped RAG: Grounding, Citations, and Evaluation"

aillmrag

agentpatterns-ai

EVALS.md

LLM Evaluation & Benchmarking

Generative models produce **open-ended text** — there is rarely a single “correct” string. Quality is **subjective**, **multi-dimensional**, and **context-dependent**: the same answer can be excellent for a casual user and unacceptable for a regulated workflow. Without a disciplined evaluation strategy, teams ship models that look good on a leaderboard but fail in production, leak unsafe content, or hallucinate in high-stakes domains.

aillmrag

spawn08

EVALS.md

Discussion Report: Flat Band Evaluation Methodology & Strategic Direction

**Participants**: Ryotaro, Masaki Adachi

aillmeval

RyotaroOKabe

EVALS.md

Exact match

title: "LLM Evaluation Cheat Sheet"

aillmprompt

tslateman

EVALS.md

Knowledge MCP Query Reference for Evaluation Timing

This document provides a practical reference for using the Knowledge MCP to research evaluation placement, methods, and anti-patterns. It shows which queries to run and what to expect from each.

aillmeval

philbeliveau

EVALS.md

Lesson 01: Evaluation Frameworks Overview

**Module 07: Evaluation and Testing**

aillmrag

ribatshepo

EVALS.md

LLM Evaluation — Interview Grill

> 70+ active-recall questions. Pair with `LLM_EVALUATION_DEEP_DIVE.md`.

aiagentllm

ffaisal93

EVALS.md

Evaluation & Safety

**Target folder:** `testing/judges/`, `testing/fixtures/`, CI via `just check`

aiagentllm

dmooney

EVALS.md

Evaluation Discipline: The Missing Loss Function of the Humanities

In the Second Renaissance, the greatest failure of the amateur is the **fetishization of the first completion.** We reject the culture of ninety-percent building and ten-percent evaluation. This ratio is a recipe for **institutional model collapse.** Building a system that produces a plausible-looking output is a trivial act. Building a system whose failure modes are bounded, quantified, and recoverable is the concretion of **engineering sovereignty.**

aiagentllm

kaw393939

EVALS.md

evaluation/ — Evaluation Skill

> If it's not measured, it's not accurate. Ship blind and children pay the price.

aiprompteval

Vimalk0703

EVALS.md

5_Evaluation

1. [Importance and Challenges of Evaluation](#why-is-evaluation-so-critical-when-developing-search-and-rag-systems-with-embeddings-and-rerankers-and-what-are-the-main-challenges-involved)

airageval

navneetkrc

EVALS.md

Context-Aware RAG Agent: Flow & Evaluation Plan

This document details the exact execution flow of the system and the offline validation and evaluation framework implemented using RAGAs.

aiagentllm

nithin-pulla

EVALS.md

ProtoExtract — Evaluation Approach Using OmniDocBench Methodology

Define a rigorous, evidence-based evaluation framework for the ProtoExtract

aieval

cryogenic22

RAG.md

Evaluating Retrieval Augmented Generation - a framework for assessment

*In this first article of a three-part (monthly) series, we introduce RAG evaluation, outline its challenges, propose an effective evaluation framework, and provide a rough overview of the various tools and approaches you can use to evaluate your RAG application.*

aillmrag

superlinked

EVALS.md

EVAL-001: Evaluation Contract — Flash Evaluations (FEATURE-053)

**Author:** Mike Chavez

aieval

mikechavez

EVALS.md

Evaluation and Testing LLM Systems

> Testing LLMs is not like unit testing software. There's no single correct output, no simple assertion to make, and behavior drifts without any code change. Here's how to build a testing culture that actually catches real problems.

aillmeval

NirajKulkarnii

EVALS.md

Comprehensive Evaluation Plan - MVP

This document outlines the complete evaluation strategy for Aegis AI Video Censoring Platform MVP, covering core metrics, test suites, user testing protocols, regression testing, and a measurement timeline from Weeks 4-15.

aievalsafety

daatoo

CHECKLIST.md

UX Review Guide for SaaS

This guide provides a systematic approach for auditing User Experience (UX) in commercial SaaS applications, rooted in heuristic evaluation and modern design systems.

aievalworkflow

Lithium-Prime

PROMPTS.md

🤖 機械学習・データ可視化プレビューガイド

- ✅ TensorFlow MNIST CNN モデル訓練

suetaketakaya

RUNBOOK.md

UI Preview Guide

This guide explains different ways to preview the UI components before deploying your application.

BLKOUTUK

EVALS.md

LLM as a Judge

title: 'LLM as a Judge Evaluation Guide'

aillmrag

promptfoo

Page 51 of 147