Agent Vendor Verifier is a tool for validating the fidelity and reliability of tool-calling LLMs across different vendors.
# Agent Vendor Verifier **Agent Vendor Verifier** is a benchmarking framework for evaluating **Agent tool-call effectiveness**. The benchmark aggregates these dimensions into a single, comparable **fusion score (IRF)**, enabling fair **cross-vendor comparison** for agent-style tool usage. Agent Vendor Verifier is built upon [K2-Vendor-Verifier](https://github.com/MoonshotAI/K2-Vendor-Verifier). ## Why Agent Vendor Verifier? * **Multi-dimensional metrics**: Go beyond “was a tool called” by measuring: - correctness, - schema compliance, - request success and stability, - latency and throughput. * **Comparable fusion score**: combine heterogeneous metrics into a single score for ranking and model/vendor selection. ## Metrics For each sample, the benchmark records the `finish_reason` (e.g. `tool_calls`, `stop`, `others`) and optional tool-call validation results. | Metric | What it Evaluates | Direction | |------|------------------|------------------------| | **F1 Score** | Whether a model **triggers tool calls on the right samples**, compared against a designated baseline vendor | Higher is better | | **Success Rate** | Whether requests **successfully complete without API or runtime errors** | Higher is better | | **Schema Accuracy** | Whether **generated tool-call arguments conform to the declared JSON Schema** | Higher is better | | **Avg Token** | **Token usage efficiency** per request (prompt + completion) | Lower is better | | **Avg TTFT** | **Responsiveness**: time from request to first token (ms) | Lower is better | | **TPS** | **Generation performance** during decoding (e.g. tokens/s) | Higher is better | ### F1 Score F1 score measures whether a model **triggers tool calls on the correct samples**, compared against a designated **baseline vendor** for the same model. - A higher F1 indicates closer alignment with the baseline on **when to issue tool calls**. | Model | Baseline Vendor | | ----- | --------------- | | Gemini | Anthropic | |
Google's AI-powered research notebook that ingests your documents and becomes an expert on your content. Generates audio overviews, study guides, FAQs, and interactive discussions from uploaded sources.
Google DeepMind's experimental AI agent that can navigate websites, fill forms, and complete multi-step browser tasks autonomously. Uses Gemini's multimodal understanding to interact with web interfaces.
Google DeepMind's universal AI assistant prototype that can see, hear, and respond in real-time through your device camera and microphone. Demonstrates the future of multimodal AI interaction.
Google Cloud's enterprise platform for building, deploying, and managing AI agents powered by Gemini. Supports multi-agent orchestration, tool integration, and enterprise governance.
Gemini's agentic research capability that autonomously browses the web, synthesizes information from dozens of sources, and produces comprehensive research reports on any topic.
Interactive coding and content creation agent that generates, previews, and iterates on code, documents, and interactive applications in a side panel. Supports HTML/CSS/JS, Python, and more.