Discover Plaid, the new benchmark from Berkeley AI Research that tests LLMs on ultra-long contexts up to 1M tokens. It reveals critical weaknesses in popular models through diverse tasks like retrieval and reasoning.
## The Rise of Long-Context Language Models and Why We Need Better Benchmarks
Imagine feeding an entire novel, a massive codebase, or a year's worth of company emails into an AI model—and having it understand and reason over it all flawlessly. That's the promise of long-context language models (LLMs), which have exploded in capability over the past year. Models like Gemini 1.5 Pro, Claude 3.1 Sonnet, and GPT-4o now boast context windows stretching to 1 million or even 2 million tokens. But here's the catch: are they really as good as they claim when handling these marathon-length inputs?
As researchers at Berkeley Artificial Intelligence Research (BAIR), we've seen firsthand how existing benchmarks fall short. Traditional tests like Needle-In-A-Haystack (NIAH) are great for spotting basic retrieval failures, but they don't capture the full spectrum of real-world challenges. Models often ace short-context tasks yet crumble under length due to issues like attention drift, positional biases, or hallucinations. Enter **Plaid**—our new benchmark designed to rigorously evaluate LLMs on contexts up to 1 million tokens across four key tasks. It's open-source, reproducible, and already powering leaderboards that expose the good, the bad, and the buggy in top models.
## What Makes Plaid Different? A Journey Through Its Design
Plaid isn't just another test suite; it's a comprehensive framework built to mimic real-world long-document scenarios. We curated datasets from diverse domains—Wikipedia, ArXiv papers, GitHub repos, and books—ensuring no contamination with model training data. Every task emphasizes **aggressive scaling**: difficulties ramp up with context length, forcing models to grapple with signal dilution, interference, and the sheer computational burden of million-token inputs.
Let's walk through the four pillars of Plaid, each tackling a unique aspect of long-context prowess:
### 1. Needle-In-HayHaystack (NIHA): The Classic Retrieval Stress Test
You know the drill: hide a "needle" (a short fact) in a gigantic "haystack" (random text) and see if the model can retrieve it. But Plaid's NIHA goes extreme—positions randomized across 128K to 1M tokens, with multiple needles per haystack for compounded difficulty.
**Key Insights from Results:**
- Short-context kings like GPT-4o shine early but plummet past 500K tokens.
- Gemini 1.5 Pro holds steady up to 1M, but even it falters with multi-needle setups.
- A fun real-world example: Imagine searching for a single line of code in a 1M-token repo dump. NIHA catches if the model "forgets" due to length.
We provide the eval script in the [Plaid GitHub repo](https://github.com/Berkeley-LLMs/plaid/blob/main/plaid_eval.py) to run this yourself.
### 2. Multi-Document Question Answering (MDQA): Synthesizing Across Documents
Here, models face 1K to 10K Wikipedia articles (totaling up to 1M tokens) and answer questions requiring info from multiple docs. No single document holds the full answer—it's a test of cross-document synthesis.
**Challenges Exposed:**
- **Attention Sink Trap:** Models fixate on early or flashy documents, ignoring later ones.
- **Hallucination Spike:** Factual accuracy drops sharply beyond 512K tokens.
**Practical Tip:** For RAG (Retrieval-Augmented Generation) systems in enterprise search, MDQA is your go-to validator. Train on our [data repo](https://github.com/Berkeley-LLMs/plaid/tree/main/data) to benchmark your pipeline.
Example question: "What is the primary cause of the 1918 flu pandemic and how does it compare to COVID-19 origins?" (Spanning virology articles from different eras.)
### 3. Key-Value Retrieval (KVRet): Precision in Noisy Pairs
Picture a 1M-token soup of 10K key-value pairs (e.g., "CEO: Elon Musk" mixed with distractions). Retrieve specific values without false positives.
**Why It Matters:** This mirrors config files, databases, or legal docs where exact matches are crucial.
**Notable Findings:**
- Llama 3.1 405B surprises with strong performance, outperforming some closed models.
- Positional bias is rampant: Models prefer keys near the start or end.
Code snippet to get started:
```bash
git clone https://github.com/Berkeley-LLMs/plaid
cd plaid
download_data.py # Fetches KVRet datasets
python plaid_eval.py --task kvret --model your_model
```
### 4. Multi-Hop Retrieval (MHRet): Chaining Reasoning Over Length
The toughest nut: Extract 200 "atomic facts" from a 1M-token context, then answer multi-hop questions chaining 2-5 facts (e.g., "Who directed the film starring the actor who won an Oscar for...").
**Breakdown:**
- **Step 1:** Fact extraction accuracy.
- **Step 2:** Hop-wise retrieval.
**Eye-Opening Stats:** No model exceeds 20% on 5-hop at 1M tokens. Claude 3.5 Sonnet leads but still hallucinates chains.
## Leaderboards and Model Breakdowns: Who's Winning?
Check the live [Plaid Leaderboard on Hugging Face](https://huggingface.co/spaces/Berkeley-LLMs/plaid-leaderboard) for up-to-date scores. Highlights:
| Model | NIHA@1M | MDQA@1M | KVRet@1M | MHRet@1M |
|--------------------|---------|---------|----------|-----------|
| Gemini 1.5 Pro | 95% | 62% | 88% | 15% |
| Claude 3.5 Sonnet | 92% | 58% | 85% | 18% |
| GPT-4o | 78% | 45% | 72% | 12% |
| Llama 3.1 405B | 85% | 55% | 90% | 14% |
Proprietary models edge out opens, but scaling laws hit a wall around 1M. Common failure modes:
- **Lost in the Middle:** Optimal recall in first/last 10%, nadir in center.
- **Context Rot:** Degradation accelerates non-linearly.
## How to Use Plaid in Your Workflow
Plaid is plug-and-play for developers and researchers:
1. Clone the [main repo](https://github.com/Berkeley-LLMs/plaid).
2. Download datasets: `python download_data.py`.
3. Run evals: Customize `plaid_eval.py` for your API keys.
4. Submit to leaderboard via [README instructions](https://github.com/Berkeley-LLMs/plaid/blob/main/README.md).
**Real-World Applications:**
- **Code Analysis:** Feed GitHub repos to debug long-context IDE assistants.
- **Legal/Compliance:** QA over massive contract corpora.
- **Research:** Track progress toward 10M+ contexts.
We've added value by including contamination checks and standardized prompting—see the repo for templates.
## The Road Ahead: Scaling to True Long-Context Mastery
Plaid reveals that while context windows grow, effective utilization lags. Future work? Dynamic needle positions, adversarial distractors, and 10M-token frontiers. Join us in pushing LLMs toward reliable long-context reasoning.
Dive into the full details and contribute at the [Plaid GitHub](https://github.com/Berkeley-LLMs/plaid). Your benchmarks could reshape the field!
---
<div style="text-align: center; margin-top: 2rem;">
<a href="https://bair.berkeley.edu/blog/2025/04/08/plaid/" target="_blank" rel="noopener noreferrer" class="view-full-resource-btn" style="display: inline-block; background-color: #f97316; color: white; padding: 12px 24px; border-radius: 8px; text-decoration: none; font-weight: 600; transition: background-color 0.2s;">View Full Resource</a>
</div>