step_game

Name: step_game
Author: lechmazur

lechmazur January 21, 2025

84 copies 0 downloads

Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure. A multi-player “step-race” that challenges LLMs to engage in public conversation before secretly picking a move (1, 3, or 5 steps). Whenever two or more players choose the same number, all colliding players fail to advance.

Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure

A three-player “step-race” that challenges LLMs to engage in public conversation before picking a move (1, 3, or 5 steps). Whenever two or more players choose the same number, all colliding players fail to advance. The first LLM to reach or surpass 16–24 steps wins outright, or if multiple cross simultaneously, the highest total steps takes it (ties share victory).

This design moves beyond static Q&A. Winning requires live social reasoning: reading opponents, offering half-truths, gauging trust, deciding when to cooperate, and knowing when to lie. Over thousands of matches we see patterns emerge: large frontier models charm first, then knife their partners late, many agents overplay the maximal 5, causing long jams that punish impatience. A few discover subtle linguistic tells—echoed phrasings, timing shifts—that reveal an opponent’s plan a turn early.

The dataset opens fresh questions. Can we predict a model’s next move from its last sentence? Which phrases cloak a bluff? Do temporary alliances ever stick? How fast does an agent abandon a losing script?

Animation

https://github.com/user-attachments/assets/f07abbd8-a780-440a-8fae-66f7154cf010

Longer video:

We generate a frame-by-frame and a summary replay of each game, illustrating:

Conversation sub-rounds with highlighted quotes
Secret moves (1,3,5) and collisions
Real-time positions on the track
A dynamic scoreboard (TrueSkill ratings, partial-win tallies)

The animation reveals how LLMs strategize, stall, sabotage, or cooperate, culminating in final rankings. It shows how their talk transla

step_game

Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure

Animation

Tags

Comments

More Agents

Klaatcode

Agentmaker

Api Model Playground Cookbook

Agent Ecologies

Private Agent

Loom Novel

Ready-made automations for this