Loading...
Loading...

# IR-Copilot — Incident Response AI Assistant

A complete agentic RAG system with multi-tenant chat interface, document ingestion pipeline, hybrid search, metadata filtering, LLM-based tool calling, reranking, and subagent delegation capabilities — with baked-in **CI** via [GitHub Actions](https://github.com/giladresisi/ir-copilot/actions) and out-of-the-box **LLM observability** via [LangSmith](https://smith.langchain.com) and **evaluation** via [RAGAS](https://docs.ragas.io/).
See it in action:

<p align="center"><a href="https://frontend-eosin-six-81.vercel.app/"><strong><big>See Live Demo >>></big></strong></a></p>
---
## Problem Statement
Traditional RAG often fails at complex, multi-hop queries. IR-Copilot solves this by introducing agentic delegation, allowing the system to decompose complex incidents into sub-tasks, query specialized indices, and synthesize reliable incident responses.
<details>
<summary><strong>When to use RAG (vector search):</strong></summary>
<br>
- Your data is unstructured (documents, PDFs, manuals, reports)
- Information is scattered across many files
- Questions require semantic understanding, not exact keyword matches
- Content changes frequently (new documents added regularly)
</details>
<details>
<summary><strong>When NOT to use RAG:</strong></summary>
<br>
For structured data sources (codebases, API documentation with organized folders), consider **agentic search** instead. Modern LLMs can efficiently navigate folder structures and table-of-contents files without the overhead of chunking, embedding, and vector search. This approach has less infrastructure complexity and works better when data is already well-organized.
</details>
---
## Architecture

---
## Features
- **Multi-tenant chat interface** - User auth, threaded streamed chats, model selection (OpenAI, OpenRouter, local via LM Studio)
- **Document ingestion pipeline** - Multi-format support, processing status tracking, content hashing + deduplication
- **Advanced RAG pipeline** - Chunking, embeddings, pgvector storage, hybrid search (vector + keyword), reranking
- **Agentic capabilities** - LLM tool selection with four tools (see below), subagent delegation for complex analysis
- **Built-in observability** - Every LLM call, tool invocation, and subagent trace captured in [LangSmith](https://smith.langchain.com) with zero instrumentation
- **RAG evaluation** - Reproducible quality benchmarking via [RAGAS](https://docs.ragas.io)
### Agentic Thought Trace
```jsonc
// "Analyze the ZetaCorp outage and list recent deployments"
{ "turn_1": { "thought": "needs docs + DB",
"tools": ["retrieve_documents", "query_deployments_database"],
"answer": "Outage caused by DB exhaustion; failed rollback 2h prior..." } }
// "Extract all action items from that postmortem"
{ "turn_2": { "thought": "full doc needed — delegate",
"tools": ["retrieve_documents", "analyze_document_with_subagent"],
"subagent": "loads full document → extracts action items",
"answer": "7 action items found with owners and due dates..." } }
```
### Available Tools
| Tool | Purpose | Implementation |
| :--- | :--- | :--- |
| **Document Retrieval** | Answer questions from uploaded documents | Hybrid search (vector + keyword, RRF fusion) with optional cross-encoder reranking via pgvector |
| **Text-to-SQL** | Query structured deployment history | LLM-generated SQL executed via Supabase RPC with allowlist safety validation |
| **Web Search** | Real-time info not in documents | Tavily API; only invoked after document retrieval fails to answer |
| **Document Subagent** | Deep full-document analysis | Spawns an isolated sub-agent with full document context for summarization and extraction |
---
## Tech Stack
| Layer | Technology |
|-------|-----------|
| Frontend | React, TypeScript, Vite, Tailwind CSS, shadcn/ui |
| Backend | Python, FastAPI |
| Database | Supabase (Postgres + pgvector + Auth + Storage + Realtime) |
| Document Processing | Docling |
| AI Models | OpenAI, OpenRouter, LM Studio (local) |
| Observability | LangSmith |
| Evaluation | RAGAS |
---
## Getting Started
**Prerequisites:** Python 3.10+, [uv](https://docs.astral.sh/uv/), Node.js 18+, Supabase account, OpenAI API key. Optional: LangSmith (observability), Cohere (reranking), Tavily (web search), OpenRouter/LM Studio.
### Option 1: 1-Click Setup (Recommended)
After cloning, fill in `backend/.env` and `frontend/.env` (copy from the `.env.example` files), then run:
```bash
bash setup.sh
```
The script installs all dependencies, pre-downloads Docling parsing models, links your Supabase project, and applies all database migrations — following a guided checklist. The whole process takes ~5 minutes plus model download time on first run.
### Option 2: Manual Setup
Follow **[SETUP.md](./SETUP.md)** step by step for full control, detailed explanations, and troubleshooting guidance.
---
## Usage
### Example Queries
The system automatically selects the appropriate tool based on your question:
**Document Retrieval Tool** - Searches your uploaded documents using hybrid search (vector + keyword):
```
"What is the training code for TechFlow training?"
"What is the qubit stability rate in the Zenith project?"
"What is the IT support extension number?"
```
**Text-to-SQL Tool** - Queries the structured incidents database:
```
"Show all P1 incidents from the last 30 days"
"Which service had the most outages this year?"
"Average resolution time for database-related incidents"
```
**Web Search Tool** - Falls back to real-time web search when documents don't have the answer:
```
"What is the current weather in London right now today?"
"What are the latest technology news headlines today?"
"What happened in the tech industry this week?"
```
**Subagent Delegation** - Spawns an isolated subagent with its own context for deep document analysis:
```
"Please analyze the document zetacorp_annual_report.txt and extract the quarterly revenue breakdown with growth rates."
"Analyze project_alpha.txt and extract all the project details."
"Summarize the key findings from research_paper.pdf."
```
The LLM intelligently routes your question to the right tool(s) and can combine multiple tools in a single conversation.
---
## More Details
<br>
<details>
<summary><strong>📚 Documentation</strong></summary>
<br>
- **[PRD.md](./PRD.md)** - Product requirements and detailed module breakdown
- **[CLAUDE.md](./CLAUDE.md)** - Project context for Claude Code (development guidelines)
- **[PROGRESS.md](./PROGRESS.md)** - Build progress tracking, completion status, challenges and solutions
- **[SETUP.md](./SETUP.md)** - Installation and setup instructions
- **[.agents/plans/](./.agents/plans/)** - Detailed implementation plans for each module
- **[.agents/execution-reports/](./.agents/execution-reports/)** - Post-execution summaries and metrics
- **[.agents/claude-pr-reviews/](./.agents/claude-pr-reviews/)** - Code review feedback from Claude
- **[.agents/system-reviews/](./.agents/system-reviews/)** - Process improvement analysis
</details>
<details>
<summary><strong>⚙️ Dev Skills & Workflows</strong></summary>
<br>
This project was developed mostly using the skills in the [al-dev-env](https://github.com/giladresisi/ai-dev-env) Claude Code plugin.
---
Automated code review and fix workflows for AI-assisted development (located in [`.github/workflows/`](./.github/workflows/)):
- `claude-code-review.yml` - Automatic Claude code review on every PR
- `integration-tests.yml` - Backend integration tests on every PR touching `backend/**` (optional, see [SETUP.md](./SETUP.md))
- `deploy.yml` - Auto-deploys backend to Cloud Run on merge to `main`
- `claude-review.yml` / `claude-fix.yml` - Claude Code integration for on-demand PR reviews and fixes
- `codex-review.yml` / `codex-fix.yml` - OpenAI Codex integration
- `cursor-review.yml` / `cursor-fix.yml` - Cursor IDE integration
- `release-notes.yml` - Automated release notes generation
Customize workflow prompts via [`.github/workflows/prompts/`](./.github/workflows/prompts/) to tailor the review and fix processes to your project's needs.
</details>
<details>
<summary><strong>🚀 Future Enhancements</strong></summary>
<br>
Beyond the current implementation, several advanced RAG techniques could further improve retrieval quality and answer accuracy:
### 1. Graph RAG
Graph-based retrieval augmentation creates knowledge graphs from documents to capture relationships and enable multi-hop reasoning. Instead of treating chunks as isolated text snippets, Graph RAG builds entity relationships and semantic connections that improve contextual understanding. Tools like **Microsoft's GraphRAG**, **LlamaIndex's Knowledge Graph Index**, and **Neo4j with LangChain** provide out-of-the-box implementations. This approach excels at answering questions requiring multi-document synthesis or relationship traversal (e.g., "How are these three research papers connected?").
### 2. Extra LLM Passes
Multiple LLM passes can enhance both retrieval precision and answer quality. Examples include:
- **Question generation per chunk**: Store questions each chunk answers well, enabling question-to-question matching during retrieval
- **Answer validation with retries**: Verify the LLM's response against retrieved context, retry with expanded context if confidence is low
- **Web search fallback validation**: Cross-reference answers with real-time web search to detect outdated or contradictory information
- **Multi-agent verification**: Use separate LLM instances to critique and refine answers before presenting to users
These techniques trade latency for accuracy, making them suitable for high-stakes use cases where correctness outweighs speed.
### 3. Advanced Chunking
Current fixed-size chunking (1000 chars with 200 char overlap) is simple but ignores document structure and semantic boundaries. Advanced approaches include:
- **Semantic chunking** (LangChain): Split documents at natural semantic boundaries using embedding similarity to detect topic shifts
- **Hybrid chunking** (Docling): Combine structural parsing (headings, sections) with content-aware splitting to preserve document hierarchy
- **Agentic chunking**: Use LLMs to dynamically determine optimal chunk boundaries based on content density and question patterns
- **Context-preserving chunking**: Prepend section headers or document metadata to each chunk for better standalone comprehension
These methods improve retrieval relevance by ensuring chunks represent coherent, self-contained units of meaning.
### 4. Fully Containerized Setup
The current setup requires manual environment configuration across multiple services. A fully containerized approach would include:
- **Local Docker Compose stack**: Bundle backend, frontend, and a local Supabase instance (Postgres + pgvector + Storage + Auth) into a single `docker compose up` with no external accounts required
- **CI testing environment**: Backend integration tests already run on every PR via `docker exec` into the production image (no reinstall of heavy deps like torch/docling). Remaining enhancements: frontend E2E tests in CI, test doubles or stubs for external APIs (currently uses real OpenAI/Supabase in CI), and structured JSON log assertions
- **Automated 1-click setup**: Extend `setup.sh` to detect a Docker environment, skip manual Supabase steps, and wire credentials automatically — reducing new-user setup from ~15 manual steps to a single command
### 5. Hallucination Resistance Scoring
The current RAGAS golden dataset only covers in-distribution questions. Adding out-of-distribution queries (with ground truth "This information is not available") would give RAGAS a quantified hallucination resistance score alongside the existing retrieval quality metrics.
### 6. Security Upgrades
Several known limitations should be addressed before real-user deployment (see [SECURITY.md](./SECURITY.md) for the full list): no prompt injection protection on user inputs passed to the LLM, no rate limiting on API endpoints, no antivirus scanning on uploaded documents, MFA not enforced, secrets passed as plain environment variables, and no continuous dependency auditing. Addressing these would require input sanitization middleware, a reverse proxy or WAF, Supabase MFA enforcement, migration to a secrets manager, and Dependabot/pip-audit integration.
</details>
<details>
<summary><strong>⚠️ Known Limitations</strong></summary>
<br>
While all 8 modules have been implemented and core functionality is working, several areas need attention before production deployment:
1. **Metadata-Enhanced Retrieval Not Implemented** - Module 4 extracts and stores document metadata (summary, document_type, key_topics) but the retrieval pipeline does not yet use this metadata for filtering or boosting. Documents are retrieved purely by vector/hybrid search score. Metadata-filtered retrieval (e.g. "search only within PDFs" or "find chunks from documents about finance") is a genuine unimplemented gap.
2. **Provider Settings Not Persisted Across Sessions** - The model provider configuration (chat model, embeddings model) is stored in React in-memory state only (`useModelConfig` hook, `useState`). Settings reset to backend defaults every time the browser is refreshed or a new session starts. There is no backend persistence or localStorage for user provider preferences.
3. **Agentic Flow Refinement** - The LLM's multi-step retrieval flow (triggering document retrieval → subagent analysis) needs further testing and system prompt refinement. Both tools work when invoked separately, but the orchestration pattern requires validation and prompt optimization.
4. **Frontend Enhancements** - Several UX improvements would upgrade the look & feel:
- Display tool calls in conversation history (collapsible boxes)
- Persist tool calls as messages in the database
- Show LLM "thinking" responses in the UI
- Improve visual feedback for multi-step agentic workflows
5. **Production Hardening** - Additional validation and security updates needed:
- Comprehensive input validation and sanitization
- Rate limiting and abuse prevention
- Error handling and recovery patterns
- Security audit of RLS policies
- Performance optimization and caching strategies
- Monitoring and alerting infrastructure
- Load testing and scalability validation
</details>
<details>
<summary><strong>🏆 Main Challenges Overcome</strong></summary>
<br>
- **LangSmith traces not closing** — Caught independently via dashboard inspection, diagnosed as an async generator cleanup bug, then converted the one-time finding into automated Playwright tests that poll the LangSmith API to verify trace closure on every run.
- **API lock-in spotted before it compounded** — Recognized mid-build that the OpenAI Responses API would block multi-provider support in the next module; drove the migration to stateless completions before the constraint became structural debt.
- **Bugs only real files could expose** — Synthetic PDFs passed; real-world uploads (including Hebrew filenames and complex multi-column layouts) surfaced 9 separate bugs across two sessions, all caught through hands-on validation.
- **Clean-slate QA pass** — After all modules shipped, set up the project from scratch as a first-time user, found 40 silently skipped tests and multiple broken selectors, and drove fixes to 86/86 backend and 39/39 E2E tests before closing.
- **RAG quality benchmarked from the ground up** — Initiated RAGAS evaluation for LLM qualitative validation, diagnosed near-zero first-run scores agents had missed, and drove the golden dataset and system prompt redesign that brought evaluation scores to where they are now, see [here](https://github.com/giladresisi/ir-copilot?tab=readme-ov-file#evaluation).
</details>
<details>
<summary><strong>💡 Learnings & Conclusions</strong></summary>
<br>
- AI-driven dev works great with clear and meaningful context and requirements, without them it goes astray and doesn't fully cover what you wanted
- The setup for AI-driven dev must always be improved, I've built and improved my [ai-dev-env](https://github.com/giladresisi/ai-dev-env) plugin while building this project
- Take the time when validating AI-driven dev, it's up to you to check if the AI fully covered all relevant scenarios and to help it complete the coverage if it didn't
- Split the work you give AI to features / phases / fixes so it has a better chance of completing them well and you have stable versions to deploy & revert to
</details>
<details>
<summary><strong>🌟 Inspiration</strong></summary>
<br>
This project was inspired by the **[Claude Code RAG Masterclass](https://www.youtube.com/watch?v=xgPWCuqLoek)**. The original masterclass covered 8 modules of a RAG architecture, as detailed in [PRD.md](./PRD.md). On top of that foundation, many extras were added that weren't part of the original course — including RAG evaluation (RAGAS), a 1-click setup script, Cloud Run deployment, automated frontend (Playwright) and backend (pytest) tests, AI-assisted code review workflows, and observability via LangSmith.
</details>
---
## Evaluation
The project ships with three [RAGAS](https://docs.ragas.io) eval pipelines covering the full quality stack — from simplified retrieval through tool routing to end-to-end chat quality — with results pushed to LangSmith.
**Scores**
| Pipeline | Faithfulness | Answer Relevancy | Context Precision | Context Recall |
|----------|-------------|-----------------|-------------------|----------------|
| `evaluate.py` — simplified RAG | **0.950** | **0.883** | 0.567 | **0.878** |
| `evaluate_chat_quality.py` — full agentic loop | **0.941** | **0.969** | 0.622 | **0.800** |
| Pipeline | Routing Accuracy | Multi-turn Sequence | Arg Quality |
|----------|-----------------|---------------------|-------------|
| `evaluate_tool_selection.py` | **1.000 (12/12)** | **1.000 (3/3)** | **0.917** |
👉 **See [backend/eval/README.md](./backend/eval/README.md)** for prerequisites, per-pipeline details, and known metric quirks.
---
## About This Project
This is a POC-level implementation demonstrating the full agentic RAG architecture, built through collaboration with Claude Code. The repository demonstrates the full capabilities of building complex AI applications with AI coding tools.
Use it as a learning resource and reference implementation, but apply production-grade engineering practices before deploying to real users, see 'Known Limitations' above for more info.
---
## License
MIT License
FHD uses keywords to create unique run-specific settings. This dictionary describes the purpose of each keyword, as well as their logic or applicable ranges. Some keywords can override others, which is also documentated. The FHD default is listed when applicable, which can be overriden by a top-level script.
[← Back: Cost Model](05_cost_model.md) | [Back to Project →](README.md)
A tool to aid researchers in assessing whether research papers adhere to scientific best practices. This application uses AI to automatically generate falsification forms, helping researchers verify the scientific robustness of their work across disciplines including social sciences and natural sciences.
This is the source code of the EMNLP 2019 paper [**Event Detection with Trigger-Aware Lattice Neural Network**](https://www.aclweb.org/anthology/D19-1033.pdf) . TLNN model aims to address the issues of trigger-word mismatch and trigger polysemy. In this project, the event detection is a sequence labeling task. For more information, please read the paper.