AgentVendorVerifier

Name: AgentVendorVerifier
Author: Prism-Shadow

Prism-Shadow December 24, 2025

4 copies 0 downloads

Agent Vendor Verifier is a tool for validating the fidelity and reliability of tool-calling LLMs across different vendors.

Agent Vendor Verifier

Agent Vendor Verifier is a benchmarking framework for evaluating Agent tool-call effectiveness.

The benchmark aggregates these dimensions into a single, comparable fusion score (IRF), enabling fair cross-vendor comparison for agent-style tool usage.

Agent Vendor Verifier is built upon K2-Vendor-Verifier.

Why Agent Vendor Verifier?

Multi-dimensional metrics: Go beyond “was a tool called” by measuring:
- correctness,
- schema compliance,
- request success and stability,
- latency and throughput.
Comparable fusion score: combine heterogeneous metrics into a single score for ranking and model/vendor selection.

Metrics

For each sample, the benchmark records the finish_reason (e.g. tool_calls, stop, others) and optional tool-call validation results.

Metric	What it Evaluates	Direction
F1 Score	Whether a model triggers tool calls on the right samples, compared against a designated baseline vendor	Higher is better
Success Rate	Whether requests successfully complete without API or runtime errors	Higher is better
Schema Accuracy	Whether generated tool-call arguments conform to the declared JSON Schema	Higher is better
Avg Token	Token usage efficiency per request (prompt + completion)	Lower is better
Avg TTFT	Responsiveness: time from request to first token (ms)	Lower is better
TPS	Generation performance during decoding (e.g. tokens/s)	Higher is better

F1 Score

F1 score measures whether a model triggers tool calls on the correct samples, compared against a designated baseline vendor for the same model.

A higher F1 indicates closer alignment with the baseline on when to issue tool calls.

Model	Baseline Vendor
Gemini	Anthropic

Comments

More Agents

View all

agentic-ai

Agentsmith

Universal, model-agnostic operating harness for AI agents (Claude, Codex, Gemini, …) — a lean core + work-type profiles assembled by one setup script.

PromptPartner

308

agent-skills

Awesome Gamedev Agent Skills

Game-development Agent Skills for AI coding agents: install once and a master router loads the right skill for your engine and task. 66 original, version-pinned skills (plus a master router) in the portable SKILL.md format that runs across Claude Code, Cursor, Codex, Copilot, Gemini CLI and more, for Godot, Unity, Unreal, web and beyond.

gamedev-skills

303

ai-agents

Agentpet

A desktop pet for macOS & Windows that monitors your AI coding agents (Claude Code, Codex, Cursor, Gemini...) in real time, and grows as you code, feed it tokens, level it up, climb the leaderboard.

ntd4996

279

ai-agent

UltraGameStudio

UltraGameStudio - AI coding agent for game development: engine workflows, gameplay code, and asset generation.

wellingfeng

260

Zero

The coding agent that answers to you, your model, your machine, your rules.

Gitlawb

1,099

agent-bridge

Lucarne

Stop babysitting local AI agents. Just notifications, approve, and resume your Codex,Pi,Grok, or Claude code sessions anywhere. 0-Intrusion mobile control bridge via Telegram/微信/飞书. No hooks, no skills, no MCP.

tuchg

314