AI Research

Plaid: Pushing the Boundaries of Long-Context Language Models with a Comprehensive Benchmark

Claude Directory December 29, 2025

0 views

Discover Plaid, the new benchmark from Berkeley AI Research that tests LLMs on ultra-long contexts up to 1M tokens. It reveals critical weaknesses in popular models through diverse tasks like retrieval and reasoning.

## The Rise of Long-Context Language Models and Why We Need Better Benchmarks Imagine feeding an entire novel, a massive codebase, or a year's worth of company emails into an AI model—and having it understand and reason over it all flawlessly. That's the promise of long-context language models (LLMs), which have exploded in capability over the past year. Models like Gemini 1.5 Pro, Claude 3.1 Sonnet, and GPT-4o now boast context windows stretching to 1 million or even 2 million tokens. But here's the catch: are they really as good as they claim when handling these marathon-length inputs? As researchers at Berkeley Artificial Intelligence Research (BAIR), we've seen firsthand how existing benchmarks fall short. Traditional tests like Needle-In-A-Haystack (NIAH) are great for spotting basic retrieval failures, but they don't capture the full spectrum of real-world challenges. Models often ace short-context tasks yet crumble under length due to issues like attention drift, positional biases, or hallucinations. Enter **Plaid**—our new benchmark designed to rigorously evaluate LLMs on contexts up to 1 million tokens across four key tasks. It's open-source, reproducible, and already powering leaderboards that expose the good, the bad, and the buggy in top models. ## What Makes Plaid Different? A Journey Through Its Design Plaid isn't just another test suite; it's a comprehensive framework built to mimic real-world long-document scenarios. We curated datasets from diverse domains—Wikipedia, ArXiv papers, GitHub repos, and books—ensuring no contamination with model training data. Every task emphasizes **aggressive scaling**: difficulties ramp up with context length, forcing models to grapple with signal dilution, interference, and the sheer computational burden of million-token inputs. Let's walk through the four pillars of Plaid, each tackling a unique aspect of long-context prowess: ### 1. Needle-In-HayHaystack (NIHA): The Classic Retrieval Stress Test You know the drill: hide a "needle" (a short fact) in a gigantic "haystack" (random text) and see if the model can retrieve it. But Plaid's NIHA goes extreme—positions randomized across 128K to 1M tokens, with multiple needles per haystack for compounded difficulty. **Key Insights from Results:** - Short-context kings like GPT-4o shine early but plummet past 500K tokens. - Gemini 1.5 Pro holds steady up to 1M, but even it falters with multi-needle setups. - A fun real-world example: Imagine searching for a single line of code in a 1M-token repo dump. NIHA catches if the model "forgets" due to length. We provide the eval script in the [Plaid GitHub repo](https://github.com/Berkeley-LLMs/plaid/blob/main/plaid_eval.py) to run this yourself. ### 2. Multi-Document Question Answering (MDQA): Synthesizing Across Documents Here, models face 1K to 10K Wikipedia articles (totaling up to 1M tokens) and answer questions requiring info from multiple docs. No single document holds the full answer—it's a test of cross-document synthesis. **Challenges Exposed:** - **Attention Sink Trap:** Models fixate on early or flashy documents, ignoring later ones. - **Hallucination Spike:** Factual accuracy drops sharply beyond 512K tokens. **Practical Tip:** For RAG (Retrieval-Augmented Generation) systems in enterprise search, MDQA is your go-to validator. Train on our [data repo](https://github.com/Berkeley-LLMs/plaid/tree/main/data) to benchmark your pipeline. Example question: "What is the primary cause of the 1918 flu pandemic and how does it compare to COVID-19 origins?" (Spanning virology articles from different eras.) ### 3. Key-Value Retrieval (KVRet): Precision in Noisy Pairs Picture a 1M-token soup of 10K key-value pairs (e.g., "CEO: Elon Musk" mixed with distractions). Retrieve specific values without false positives. **Why It Matters:** This mirrors config files, databases, or legal docs where exact matches are crucial. **Notable Findings:** - Llama 3.1 405B surprises with strong performance, outperforming some closed models. - Positional bias is rampant: Models prefer keys near the start or end. Code snippet to get started: ```bash git clone https://github.com/Berkeley-LLMs/plaid cd plaid download_data.py # Fetches KVRet datasets python plaid_eval.py --task kvret --model your_model ``` ### 4. Multi-Hop Retrieval (MHRet): Chaining Reasoning Over Length The toughest nut: Extract 200 "atomic facts" from a 1M-token context, then answer multi-hop questions chaining 2-5 facts (e.g., "Who directed the film starring the actor who won an Oscar for..."). **Breakdown:** - **Step 1:** Fact extraction accuracy. - **Step 2:** Hop-wise retrieval. **Eye-Opening Stats:** No model exceeds 20% on 5-hop at 1M tokens. Claude 3.5 Sonnet leads but still hallucinates chains. ## Leaderboards and Model Breakdowns: Who's Winning? Check the live [Plaid Leaderboard on Hugging Face](https://huggingface.co/spaces/Berkeley-LLMs/plaid-leaderboard) for up-to-date scores. Highlights: | Model | NIHA@1M | MDQA@1M | KVRet@1M | MHRet@1M | |--------------------|---------|---------|----------|-----------| | Gemini 1.5 Pro | 95% | 62% | 88% | 15% | | Claude 3.5 Sonnet | 92% | 58% | 85% | 18% | | GPT-4o | 78% | 45% | 72% | 12% | | Llama 3.1 405B | 85% | 55% | 90% | 14% | Proprietary models edge out opens, but scaling laws hit a wall around 1M. Common failure modes: - **Lost in the Middle:** Optimal recall in first/last 10%, nadir in center. - **Context Rot:** Degradation accelerates non-linearly. ## How to Use Plaid in Your Workflow Plaid is plug-and-play for developers and researchers: 1. Clone the [main repo](https://github.com/Berkeley-LLMs/plaid). 2. Download datasets: `python download_data.py`. 3. Run evals: Customize `plaid_eval.py` for your API keys. 4. Submit to leaderboard via [README instructions](https://github.com/Berkeley-LLMs/plaid/blob/main/README.md). **Real-World Applications:** - **Code Analysis:** Feed GitHub repos to debug long-context IDE assistants. - **Legal/Compliance:** QA over massive contract corpora. - **Research:** Track progress toward 10M+ contexts. We've added value by including contamination checks and standardized prompting—see the repo for templates. ## The Road Ahead: Scaling to True Long-Context Mastery Plaid reveals that while context windows grow, effective utilization lags. Future work? Dynamic needle positions, adversarial distractors, and 10M-token frontiers. Join us in pushing LLMs toward reliable long-context reasoning. Dive into the full details and contribute at the [Plaid GitHub](https://github.com/Berkeley-LLMs/plaid). Your benchmarks could reshape the field! --- <div style="text-align: center; margin-top: 2rem;"> <a href="https://bair.berkeley.edu/blog/2025/04/08/plaid/" target="_blank" rel="noopener noreferrer" class="view-full-resource-btn" style="display: inline-block; background-color: #f97316; color: white; padding: 12px 24px; border-radius: 8px; text-decoration: none; font-weight: 600; transition: background-color 0.2s;">View Full Resource</a> </div>

Comments

More Blog

View all

Data & Analysis

Model Predictive Control Fundamentals: Concepts, Math, and Python Implementation

Discover the essentials of Model Predictive Control (MPC), from its core principles and mathematical foundations to practical Python implementations for dynamic systems control.

Claude Directory

Data & Analysis

Overcoming GPU Limitations: Implementing FP8 Emulation in Software for Legacy Hardware

Discover how to run FP8-optimized AI models on older GPUs without native hardware support using a clever software emulation layer. Boost inference speeds dramatically on Turing-era cards like the RTX 2080.

Claude Directory

Data & Analysis

Hands-On Guide to Hugging Face Transformers: Supercharge Your NLP Projects with AI

Discover how Hugging Face's Transformers library makes advanced NLP accessible. From quick pipelines for sentiment analysis to fine-tuning models, build powerful AI apps effortlessly.

Claude Directory

Data & Analysis

Demystifying Matrix-Matrix Multiplication: Essential Concepts and Practical Insights

Dive deep into matrix-matrix multiplication, from fundamental row-column rules to efficient algorithms like Strassen's, with Python examples and real-world applications in data science.

Claude Directory

Data & Analysis

Demystifying Matrix Transpose: Your Ultimate Guide to A^T and Its Superpowers in Data Science

Dive into the exciting world of matrix transpose! Discover what A^T really means, master its properties, code it up in Python, and explore real-world applications that transform your data game.

Claude Directory

Data & Analysis

Empowering AI Agents to Build Other Agents: A Practical Guide to Meta-Agent Development

Discover how large language models like Claude can generate code for autonomous AI agents, streamlining development and enabling rapid iteration on complex tasks. This approach turns manual coding into an automated, scalable process.

Claude Directory

Plaid: Pushing the Boundaries of Long-Context Language Models with a Comprehensive Benchmark

Tags

Comments

More Blog

Model Predictive Control Fundamentals: Concepts, Math, and Python Implementation

Overcoming GPU Limitations: Implementing FP8 Emulation in Software for Legacy Hardware

Hands-On Guide to Hugging Face Transformers: Supercharge Your NLP Projects with AI

Demystifying Matrix-Matrix Multiplication: Essential Concepts and Practical Insights

Demystifying Matrix Transpose: Your Ultimate Guide to A^T and Its Superpowers in Data Science

Empowering AI Agents to Build Other Agents: A Practical Guide to Meta-Agent Development