Schema-Driven Document Extraction Service (Gently Take-Home)

Building Option 1 from Gently's take-home challenge: a service that ingests documents, classifies them against user-defined schemas, extracts structured data, and makes it queryable. Framed for ERP domain (invoices, purchase orders, receipts, contracts).

benjgrad

May 2, 2026

0 upvotes

0 downloads

0 views

ai llm prompt openai

View source

# Schema-Driven Document Extraction Service (Gently Take-Home) ## Context Building Option 1 from Gently's take-home challenge: a service that ingests documents, classifies them against user-defined schemas, extracts structured data, and makes it queryable. Framed for ERP domain (invoices, purchase orders, receipts, contracts). **Core differentiator**: Document types are **data, not code** -- adding a new type requires zero code changes. POST a schema, and the system immediately classifies, extracts, and indexes against it. **Stack**: Express + TypeScript + Postgres (pgvector + tsvector) + OpenAI Key patterns adapted from the existing writing-ui project: - Async queue with `FOR UPDATE SKIP LOCKED` job claiming (from `0005_extraction_queue.sql`) - Content-hash deduplication (from `0009_improved_deduplication.sql`) - LLM extraction with structured JSON output (from `api/ai/extract/route.ts`) - Dynamic prompt building from runtime data (from `lib/ai/prompts/nvq-extraction.ts`) --- ## Project Structure ``` gently-extract/ README.md package.json / tsconfig.json docker-compose.yml # Postgres 16 + pgvector src/ app.ts # Express setup (exported for tests) index.ts # Server startup + worker (not imported by tests) config.ts # Typed env config db/ pool.ts # pg Pool singleton migrate.ts # Run .sql files in order seed.ts # Seed 3 ERP doc types + example docs migrations/ 001_extensions.sql # uuid-ossp, pgcrypto, vector 002_schema.sql # All tables routes/ schemas.ts # CRUD for document types + examples documents.ts # Document ingestion + status extractions.ts # Query extracted data + search queue.ts # Queue visibility health.ts services/ ingestion.ts # Hash, dedup, queue classifier.ts # Multi-signal document classification extractor.ts # Schema-driven field extraction chunker.ts # Structure-aware chunking embedder.ts # Embedding generation worker.ts # Polling loop, claims + processes jobs ai/ client.ts # OpenAI wrapper prompts/ classify.ts # Few-shot classification prompt builder extract.ts # Schema-driven extraction prompt builder middleware/ error-handler.ts request-id.ts types/ index.ts tests/ unit/ chunker.test.ts ingestion.test.ts classifier.test.ts # Mocked OpenAI extractor.test.ts # Mocked OpenAI integration/ pipeline.test.ts # Full submit -> extract -> query fixtures/ sample-invoice.txt sample-purchase-order.txt sample-receipt.txt messy-invoice.txt # Missing fields, OCR-quality helpers/ mock-openai.ts test-db.ts client/ # Bare-bones React (Vite) index.html src/ App.tsx components/ SchemaBuilder.tsx DocumentUploader.tsx ExtractionViewer.tsx ``` --- ## Database Schema ### `document_types` (Schema Registry) Replaces the weak `classification_hints` with a structured `classification_config` that provides multiple signals for the LLM classifier. The `extraction_schema` defines what fields to extract. ```sql CREATE TABLE document_types ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), name TEXT NOT NULL UNIQUE, -- 'invoice', 'purchase_order' description TEXT NOT NULL, -- Human-readable description version INTEGER NOT NULL DEFAULT 1, -- Bumped on extraction_schema change is_active BOOLEAN NOT NULL DEFAULT true, -- Rich classification config (multiple signal types for the LLM) classification_config JSONB NOT NULL DEFAULT '{}', -- { -- "keywords": ["invoice", "bill to", "amount due", "payment terms"], -- "structural_patterns": [ -- "contains a table of line items with description, quantity, price columns", -- "has a header section with vendor/supplier info and invoice number", -- "includes payment terms and due date" -- ], -- "negative_indicators": ["purchase order", "PO number", "ship to"], -- "expected_sections": ["vendor_info", "line_items", "totals", "payment_info"] -- } -- The extraction schema: what fields to pull out extraction_schema JSONB NOT NULL, -- { -- "fields": [ -- {"name": "invoice_number", "type": "string", "required": true, "description": "Unique invoice ID"}, -- {"name": "vendor_name", "type": "string", "required": true, "description": "Vendor/supplier name"}, -- {"name": "total_amount", "type": "number", "required": true, "description": "Total amount due"}, -- {"name": "line_items", "type": "array", "required": true, "description": "Billed items", -- "items": { -- "fields": [ -- {"name": "description", "type": "string", "required": true}, -- {"name": "quantity", "type": "number", "required": true}, -- {"name": "unit_price", "type": "number", "required": true}, -- {"name": "amount", "type": "number", "required": true} -- ] -- } -- } -- ] -- } created_at TIMESTAMPTZ DEFAULT NOW(), updated_at TIMESTAMPTZ DEFAULT NOW() ); CREATE INDEX ON document_types(is_active) WHERE is_active = true; ``` ### `document_type_examples` (Few-Shot Classification) Store example documents per type so the classifier can use few-shot learning. This is what makes classification work beyond a proof of concept. ```sql CREATE TABLE document_type_examples ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), document_type_id UUID NOT NULL REFERENCES document_types(id) ON DELETE CASCADE, content TEXT NOT NULL, -- Example document text (or representative snippet) label TEXT NOT NULL DEFAULT 'positive' CHECK (label IN ('positive', 'negative')), -- positive = "this IS an invoice" notes TEXT, -- Why this is a good example created_at TIMESTAMPTZ DEFAULT NOW() ); CREATE INDEX ON document_type_examples(document_type_id); ``` ### `documents` ```sql CREATE TABLE documents ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), original_text TEXT NOT NULL, content_hash TEXT NOT NULL, source_filename TEXT, -- Classification result document_type_id UUID REFERENCES document_types(id), classification_confidence REAL, classification_reasoning TEXT, -- LLM explanation of why this type -- Processing status status TEXT NOT NULL DEFAULT 'pending' CHECK (status IN ('pending','processing','classified','extracted','failed')), error_message TEXT, -- Document profile (lightweight version of ChatGPT's "Document Model") word_count INTEGER NOT NULL, section_count INTEGER, has_tables BOOLEAN, language TEXT, -- Full-text search (BM25 via tsvector -- hybrid search alongside pgvector) search_vector tsvector GENERATED ALWAYS AS ( to_tsvector('english', coalesce(original_text, '')) ) STORED, created_at TIMESTAMPTZ DEFAULT NOW(), updated_at TIMESTAMPTZ DEFAULT NOW() ); CREATE UNIQUE INDEX ON documents(content_hash); CREATE INDEX ON documents(status); CREATE INDEX ON documents(document_type_id); CREATE INDEX ON documents USING GIN (search_vector); -- BM25 keyword search ``` ### `extraction_queue` ```sql CREATE TABLE extraction_queue ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), document_id UUID NOT NULL REFERENCES documents(id) ON DELETE CASCADE UNIQUE, status TEXT NOT NULL DEFAULT 'pending' CHECK (status IN ('pending','processing','completed','failed')), priority INTEGER NOT NULL DEFAULT 0, attempts INTEGER NOT NULL DEFAULT 0, max_attempts INTEGER NOT NULL DEFAULT 3, error_message TEXT, created_at TIMESTAMPTZ DEFAULT NOW(), started_at TIMESTAMPTZ, completed_at TIMESTAMPTZ ); CREATE INDEX ON extraction_queue(priority DESC, created_at ASC) WHERE status = 'pending'; ``` ### `extractions` ```sql CREATE TABLE extractions ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), document_id UUID NOT NULL REFERENCES documents(id) ON DELETE CASCADE, document_type_id UUID NOT NULL REFERENCES document_types(id), schema_version INTEGER NOT NULL, -- Extracted structured data extracted_data JSONB NOT NULL, -- {fields: {invoice_number: "INV-001", ...}} confidence REAL, -- Overall extraction confidence field_confidences JSONB, -- {"invoice_number": 0.98, "vendor_name": 0.85} extraction_notes TEXT, -- LLM-reported quality issues -- Provenance: where each field was found in the document field_provenance JSONB, -- {"invoice_number": {"section": "header", "paragraph_index": 0, "text_snippet": "Invoice #INV-001"}, -- "vendor_name": {"section": "header", "paragraph_index": 1, "text_snippet": "From: Acme Corp"}} -- Semantic search vector embedding vector(1536), created_at TIMESTAMPTZ DEFAULT NOW() ); CREATE INDEX ON extractions(document_id); CREATE INDEX ON extractions(document_type_id); CREATE INDEX ON extractions USING hnsw (embedding vector_cosine_ops); ``` ### `extracted_fields` (denormalized for fast structured queries) ```sql CREATE TABLE extracted_fields ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), extraction_id UUID NOT NULL REFERENCES extractions(id) ON DELETE CASCADE, document_id UUID NOT NULL REFERENCES documents(id) ON DELETE CASCADE, field_name TEXT NOT NULL, field_value TEXT, field_type TEXT NOT NULL, confidence REAL, source_snippet TEXT, -- Provenance: text where this was found created_at TIMESTAMPTZ DEFAULT NOW() ); CREATE INDEX ON extracted_fields(field_name, field_value); CREATE INDEX ON extracted_fields(document_id); ``` --- ## API Routes ### Schemas - `POST /api/schemas` -- Create document type (classification_config + extraction_schema) - `GET /api/schemas` -- List all active schemas - `GET /api/schemas/:id` -- Get schema with examples - `PUT /api/schemas/:id` -- Update schema (bumps version on extraction_schema change) - `DELETE /api/schemas/:id` -- Soft delete - `POST /api/schemas/:id/examples` -- Add example document for few-shot classification - `GET /api/schemas/:id/examples` -- List examples ### Documents - `POST /api/documents` -- Submit text; returns 202 (new) or 200 (duplicate) - `GET /api/documents` -- List with filters (?status=, ?type=, ?page=, ?limit=) - `GET /api/documents/:id` -- Document with extraction results - `POST /api/documents/:id/reprocess` -- Re-queue after schema change ### Search (hybrid: keyword + semantic + structured) - `GET /api/search` -- Hybrid search: `?q=acme invoice` uses BM25 + vector - `GET /api/extractions` -- Structured query: `?type=invoice&field=vendor_name&value=Acme` - `GET /api/extractions/:id` -- Single extraction with provenance ### Queue - `GET /api/queue` -- Status counts + recent jobs - `POST /api/queue/retry-failed` -- Reset failed jobs to pending ### Health - `GET /api/health` -- DB + queue stats --- ## Processing Pipeline ### 1. Ingestion (`POST /api/documents`) 1. Compute SHA-256 content hash 2. Check for duplicate (UNIQUE index on content_hash) 3. If duplicate: return existing document (idempotent, 200) 4. If new: INSERT document, INSERT extraction_queue job, return 202 ### 2. Worker Claims Job ```sql UPDATE extraction_queue SET status='processing', started_at=NOW(), attempts=attempts+1 WHERE id = ( SELECT id FROM extraction_queue WHERE status='pending' ORDER BY priority DESC, created_at ASC LIMIT 1 FOR UPDATE SKIP LOCKED ) RETURNING * ``` ### 3. Document Profiling (lightweight Document Model) Before classification, analyze the raw text: - Split into sections (by double newlines, horizontal rules, header patterns) - Detect if tables are present (look for aligned columns, tab-separated data) - Count sections, estimate structure - Store profile on the `documents` row (section_count, has_tables, language) ### 4. Classification (multi-signal, few-shot) Build the classification prompt from: 1. **Type descriptions** from `document_types.description` 2. **Keywords + structural patterns** from `classification_config` 3. **Few-shot examples** from `document_type_examples` (positive + negative) 4. **Negative indicators** to avoid confusion between similar types (invoice vs PO) 5. **Document profile** (section count, has tables, etc.) The LLM returns: `{document_type_id, confidence, reasoning}` If confidence < threshold (e.g., 0.6), mark as "unclassified" for human review rather than guessing. ### 5. Chunking (if needed) - Estimate tokens (~text.length/4) - If >3000 tokens: structure-aware split (section boundaries > paragraph boundaries) - Overlap between chunks (200 tokens), but never overlap across sections - Extract from each chunk independently, then merge results - Merge: highest-confidence wins for scalars; concatenate + deduplicate for arrays ### 6. Extraction (schema-driven) - Load `document_type.extraction_schema` - Build prompt with field definitions injected from schema - Ask LLM to also return **provenance** per field (which section/paragraph the data came from) - LLM returns: `{fields: {...}, field_confidences: {...}, field_provenance: {...}, extraction_notes: "..."}` - Validate required fields present ### 7. Storage + Indexing - INSERT `extractions` row (with embedding from text-embedding-3-small) - INSERT denormalized `extracted_fields` rows (for structured queries) - tsvector auto-generated on documents (for BM25 keyword search) - Update document status to 'extracted' - Mark queue job 'completed' ### 8. Error Handling - On failure: if attempts < max_attempts, reset to 'pending' for retry - If exhausted: mark 'failed', set error_message on both queue and document --- ## LLM Prompts ### Classification prompt (few-shot, multi-signal) ``` You are a document classifier for an ERP system. ## Available Document Types ### Invoice Description: A bill from a vendor for goods or services rendered. Keywords: invoice, bill to, amount due, payment terms, net 30 Structural patterns: - Contains a table of line items with description, quantity, price columns - Has a header with vendor info and invoice number - Includes payment terms and due date Negative indicators: purchase order, PO number, ship to address ### Purchase Order Description: A request to purchase goods or services from a vendor. Keywords: purchase order, PO, ship to, deliver by, requisition ... [built dynamically from document_types + classification_config] ## Examples ### Example: Invoice (positive) """ INVOICE #INV-2024-001 From: Acme Corp | 123 Main St Bill To: Widget Inc Item Qty Price Total Widget A 10 $5.00 $50.00 Widget B 5 $10.00 $50.00 Subtotal: $100.00 Tax (8%): $8.00 Total Due: $108.00 Payment Terms: Net 30 """ ### Example: Purchase Order (positive) ... [built from document_type_examples table] ## Document Profile Sections: 4 | Has tables: yes | Word count: 150 ## Instructions Classify the document below. Respond with JSON: { "document_type_id": "uuid or null if unknown", "type_name": "name or 'unknown'", "confidence": 0.0 to 1.0, "reasoning": "Brief explanation" } Be conservative. If ambiguous, use confidence < 0.7. If no type matches, set type_name to "unknown". ``` ### Extraction prompt (schema-driven + provenance) ``` You are an ERP data extractor. Extract structured data from a [Invoice] document. ## Fields to Extract - invoice_number (string, REQUIRED): Unique invoice identifier - vendor_name (string, REQUIRED): Name of the vendor/supplier - total_amount (number, REQUIRED): Total amount due - line_items (array, REQUIRED): Individual line items - description (string, required) - quantity (number, required) - unit_price (number, required) - amount (number, required) - due_date (date, optional): Payment due date - currency (string, optional): Currency code [built dynamically from extraction_schema JSONB] ## Rules 1. Extract ONLY fields listed above. 2. For REQUIRED fields that are missing, set to null and note in extraction_notes. 3. Dates in ISO 8601 (YYYY-MM-DD). Numbers without currency symbols. 4. For each field, note WHERE in the document you found it (provenance). ## Response Format { "fields": {"invoice_number": "INV-2024-001", ...}, "field_confidences": {"invoice_number": 0.98, ...}, "field_provenance": { "invoice_number": {"section": "header", "text_snippet": "INVOICE #INV-2024-001"}, "vendor_name": {"section": "header", "text_snippet": "From: Acme Corp"} }, "extraction_notes": "Any quality issues, missing data, or ambiguities" } ``` --- ## Search Strategy (Hybrid) Three query paths from one pipeline, addressing different user needs: 1. **Keyword search** (BM25 via Postgres tsvector): `GET /api/search?q=INV-2024-001` -- exact invoice numbers, vendor names, specific terms Uses `ts_rank` + `to_tsquery` on `documents.search_vector` 2. **Semantic search** (pgvector): `GET /api/search?q=large invoices from technology vendors&mode=semantic` Embeds query, finds nearest extraction embeddings 3. **Structured query** (field-level): `GET /api/extractions?type=invoice&field=vendor_name&value=Acme&field=total_amount&gt=10000` Queries `extracted_fields` table directly Default `/api/search` uses hybrid: runs both BM25 and vector, merges results with configurable weighting. --- ## Seed Data Pre-seed 3 document types with full classification_config + extraction_schema + 2 examples each: 1. **Invoice** -- vendor_name, invoice_number, issue_date, due_date, line_items[], subtotal, tax, total_amount, currency, payment_terms 2. **Purchase Order** -- po_number, vendor, ship_to, items[], total, requested_delivery_date, approved_by 3. **Receipt** -- merchant_name, date, items[], subtotal, tax, total, payment_method Each type gets 2 example documents (positive) seeded in `document_type_examples`. Sample text fixtures in `tests/fixtures/` for testing. --- ## What to Simplify / Mock | Area | Simplification | Walkthrough talking point | |------|---------------|--------------------------| | Auth | Skip entirely | "Per instructions. Production: JWT + tenant_id + RLS." | | File upload | Text-only POST body | "Would add multipart + pdf-parse/Textract. Pipeline has pre-processing hook." | | OCR | Not implemented | "Would integrate Google Document AI as a pre-classification step." | | Worker | Single-process polling | "FOR UPDATE SKIP LOCKED scales to N workers with no code changes." | | Schema validation | Basic type checks | "Production: ajv for full JSON Schema validation." | | Reranking | Not implemented | "Would add cross-encoder reranking for hybrid search quality." | --- ## What Makes This Impressive 1. **Schema as data, not code** -- POST a new document type, system immediately classifies + extracts. Live-demo by creating a "Shipping Manifest" type during walkthrough. 2. **Few-shot classification** -- Example documents per type make classification actually work well, not just keyword matching. This is the kind of detail that separates "built a prototype" from "thought about production." 3. **Hybrid search** -- BM25 catches exact IDs/names, vector catches semantic meaning. Shows understanding that different users query differently (accountant by invoice#, executive by concept). 4. **Provenance tracking** -- Know exactly where in the document each field was found. Critical for ERP auditing, enables "click to see source" in UI. 5. **Natural idempotency** -- Content hash dedup makes the API safe to retry. 6. **Production-grade queue** -- FOR UPDATE SKIP LOCKED, retry tracking, observable status. 7. **Schema versioning** -- Extractions track schema version; re-process with updated schemas. 8. **Per-field confidence + extraction_notes** -- LLM reports quality issues, enabling human-in-the-loop review. --- ## Implementation Sequence 1. Project setup, Docker Compose, migrations, seed data 2. Express skeleton, health route, schema CRUD routes (including examples endpoint) 3. Document ingestion with dedup + queue 4. Worker skeleton with FOR UPDATE SKIP LOCKED claiming 5. Document profiling (section detection, table detection) 6. Classification service + few-shot prompt builder 7. Extraction service + schema-driven prompt builder (with provenance) 8. Chunking for large documents 9. Embedding generation + hybrid search route (BM25 + vector) 10. Structured extraction query routes (field-level) 11. Tests (unit: chunker, dedup, mocked LLM; integration: full pipeline) 12. Frontend (schema builder with example upload, document uploader, extraction viewer with provenance highlights) 13. README with architecture diagram + design decisions --- ## Verification - `docker-compose up` starts Postgres with pgvector - `npm run migrate && npm run seed` creates tables + 3 ERP doc types with examples - `npm run dev` starts Express + worker - Submit sample invoice via `POST /api/documents` -> 202 - Poll `GET /api/documents/:id` -> pending -> processing -> extracted - View extraction with provenance: `GET /api/extractions/:id` - Keyword search: `GET /api/search?q=INV-2024-001` finds by invoice number - Semantic search: `GET /api/search?q=large invoices from tech vendors&mode=semantic` - Structured: `GET /api/extractions?type=invoice&field=vendor_name&value=Acme` - Create new type: `POST /api/schemas` + add example + submit matching doc -> auto-classified - Resubmit same document -> 200 with is_duplicate=true - Reprocess after schema update: `POST /api/documents/:id/reprocess` --- ## Critical Source Files to Reference From existing writing-ui project (patterns to adapt, not copy): - `supabase/migrations/0005_extraction_queue.sql` -- Queue with FOR UPDATE SKIP LOCKED - `src/app/api/ai/extract/route.ts` -- LLM extraction + quality scoring + refinement loop - `src/lib/ai/providers/openai.ts` -- OpenAI client with `response_format: {type: 'json_object'}` - `src/lib/ai/prompts/nvq-extraction.ts` -- Dynamic prompt building from runtime data

Related Documents

cheap-RAG Development Roadmap

Semblance AI — Development Roadmap

Changelog

Toasty — AI Triage & Responsible Disclosure Assistant (2026 — 350 hours)