A modular multi-agent AI system that performs deep scientific research using a supervisor-worker architecture. It combines foundational and specialized language models to reason, plan, and execute tasks for document and chart analysis in scientific domains.
# Agentic Document Intelligence with GPT-4 & SmolDocling This project is an agentic AI pipeline that uses GPT-4 as the primary agent and integrates SmolDocling as a tool to deeply analyze uploaded documents. It allows users to upload various document formats and extract structured content automatically, with an intelligent evaluation and feedback loop. --- ## š Features * Upload documents in multiple formats (PDF, Word, etc.) * Automatic conversion to PDF if needed * Dual extraction using GPT-4 and SmolDocling * Evaluation of extracted content using BLEU, overlap, and Jaccard similarity * Iterative feedback to SmolDocling to improve accuracy * Final structured output in Word and PDF formats * User prompt execution on final extracted document * Streamlit UI for ease of use * Dockerized for simple deployment --- ## š§ Pipeline Overview 1. **Upload Document**: User uploads a file and provides a prompt. 2. **Preprocessing**: The document is converted to PDF (if not already). 3. **GPT-4 Extraction**: Extracts text, tables, images, and structural elements. 4. **SmolDocling Extraction**: Sends static prompt to SmolDocling backend for extraction. 5. **Evaluation**: Compares GPT-4 vs SmolDocling outputs using: * Textual overlap ratio * BLEU score * Jaccard similarity 6. **Consistency Check**: * ā If consistent: build final doc and apply user prompt. * ā If inconsistent: identify differences and retry SmolDocling with feedback. 7. **Final Output**: Assemble and export the cleaned, structured document. --- ## šļø Folder Structure ``` . āāā app.py # Streamlit UI āāā Dockerfile # Docker configuration āāā requirements.txt ā āāā graph/ # LangGraph logic ā āāā graph_builder.py ā āāā nodes/ # Nodes in the pipeline ā āāā user_input.py ā āāā preprocess_doc.py ā āāā gpt_extract.py ā āāā smoldocling_call.py ā āāā evaluate.py ā āāā retry_node.py ā āāā fina
Google's AI-powered research notebook that ingests your documents and becomes an expert on your content. Generates audio overviews, study guides, FAQs, and interactive discussions from uploaded sources.
Google DeepMind's experimental AI agent that can navigate websites, fill forms, and complete multi-step browser tasks autonomously. Uses Gemini's multimodal understanding to interact with web interfaces.
Google DeepMind's universal AI assistant prototype that can see, hear, and respond in real-time through your device camera and microphone. Demonstrates the future of multimodal AI interaction.
Google Cloud's enterprise platform for building, deploying, and managing AI agents powered by Gemini. Supports multi-agent orchestration, tool integration, and enterprise governance.
Gemini's agentic research capability that autonomously browses the web, synthesizes information from dozens of sources, and produces comprehensive research reports on any topic.
Interactive coding and content creation agent that generates, previews, and iterates on code, documents, and interactive applications in a side panel. Supports HTML/CSS/JS, Python, and more.