Agents — DeepSeek Directory

agent

TIME

[NeurIPS 2025 D&B (Spotlight🌟)] TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenario

S

sylvain-wei

30

benchmark

step_game

Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure. A multi-player “step-race” that challenges LLMs to engage in public conversation before secretly picking a move (1, 3, or 5 steps). Whenever two or more players choose the same number, all colliding players fail to advance.

L

lechmazur

84

benchmark

Step Game

Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure. A multi-player “step-race” that challenges LLMs to engage in public conversation before secretly picking a move (1, 3, or 5 steps). Whenever two or more players choose the same number, all colliding players fail to advance.

L

lechmazur

88

agent

bigcodebench

[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI

B

bigcode-project

495

agent

Bigcodebench

[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI

B

bigcode-project

515