Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure. A multi-player “step-race” that challenges LLMs to engage in public conversation before secretly picking a move (1, 3, or 5 steps). Whenever two or more players choose the same number, all colliding players fail to advance.
# Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure A three-player “step-race” that challenges LLMs to engage in **public conversation** before picking a move (**1**, **3**, or **5** steps). Whenever two or more players choose the same number, all colliding players fail to advance. The first LLM to reach or surpass **16–24** steps wins outright, or if multiple cross simultaneously, the highest total steps takes it (ties share victory). This design moves beyond static Q&A. Winning requires live social reasoning: reading opponents, offering half-truths, gauging trust, deciding when to cooperate, and knowing when to lie. Over thousands of matches we see patterns emerge: large frontier models charm first, then knife their partners late, many agents overplay the maximal 5, causing long jams that punish impatience. A few discover subtle linguistic tells—echoed phrasings, timing shifts—that reveal an opponent’s plan a turn early. The dataset opens fresh questions. Can we predict a model’s next move from its last sentence? Which phrases cloak a bluff? Do temporary alliances ever stick? How fast does an agent abandon a losing script? --- ## Animation https://github.com/user-attachments/assets/f07abbd8-a780-440a-8fae-66f7154cf010 Longer video: [](https://www.youtube.com/watch?v=AnPKfrIPAgQ) We generate a **frame-by-frame** and a **summary** replay of each game, illustrating: 1. Conversation sub-rounds with highlighted quotes 2. Secret moves (1,3,5) and collisions 3. Real-time positions on the track 4. A dynamic scoreboard (TrueSkill ratings, partial-win tallies) The animation reveals how LLMs strategize, stall, sabotage, or cooperate, culminating in final rankings. It shows how their talk transla
HAL 分层混合模型工作流 — 强模型(Claude)负责理解/拆解/验收,低成本模型(DeepSeek)负责检索/提取/清洗。Hermes Agent skill。
An LLM agent fine-tuned on DeepSeek for spaced repetition, dynamically integrating knowledge points based on the Ebbinghaus forgetting curve.
基于 STM32F103 构建的端到端 AI 智能手表生态。自研“零重定位”原生机器码动态加载引擎与页面栈式 UI 框架;集成生产级 OTA 回滚保护机制与高带宽(921600 baud)串口协议栈。通过 Node.js 中继实现 DeepSeek AI 语义控制及 ASRPRO 语音全双工交互,是一个集成了分布式计算、现代存储管理与 AI Agent 的嵌入式全栈工程。
A Meta-Agent-Driven Self-Evolving Multi-Agent System for UAV Detection and Tracking
One command to run Hermes AI Agent with a browser UI. Zero prerequisites. 一行命令,AI 就位。
网页应用Agent,接入DeepSeek、Mimo等模型