Loading...
Loading...
Building Option 1 from Gently's take-home challenge: a service that ingests documents, classifies them against user-defined schemas, extracts structured data, and makes it queryable. Framed for ERP domain (invoices, purchase orders, receipts, contracts).
# Schema-Driven Document Extraction Service (Gently Take-Home)
## Context
Building Option 1 from Gently's take-home challenge: a service that ingests documents, classifies them against user-defined schemas, extracts structured data, and makes it queryable. Framed for ERP domain (invoices, purchase orders, receipts, contracts).
**Core differentiator**: Document types are **data, not code** -- adding a new type requires zero code changes. POST a schema, and the system immediately classifies, extracts, and indexes against it.
**Stack**: Express + TypeScript + Postgres (pgvector + tsvector) + OpenAI
Key patterns adapted from the existing writing-ui project:
- Async queue with `FOR UPDATE SKIP LOCKED` job claiming (from `0005_extraction_queue.sql`)
- Content-hash deduplication (from `0009_improved_deduplication.sql`)
- LLM extraction with structured JSON output (from `api/ai/extract/route.ts`)
- Dynamic prompt building from runtime data (from `lib/ai/prompts/nvq-extraction.ts`)
---
## Project Structure
```
gently-extract/
README.md
package.json / tsconfig.json
docker-compose.yml # Postgres 16 + pgvector
src/
app.ts # Express setup (exported for tests)
index.ts # Server startup + worker (not imported by tests)
config.ts # Typed env config
db/
pool.ts # pg Pool singleton
migrate.ts # Run .sql files in order
seed.ts # Seed 3 ERP doc types + example docs
migrations/
001_extensions.sql # uuid-ossp, pgcrypto, vector
002_schema.sql # All tables
routes/
schemas.ts # CRUD for document types + examples
documents.ts # Document ingestion + status
extractions.ts # Query extracted data + search
queue.ts # Queue visibility
health.ts
services/
ingestion.ts # Hash, dedup, queue
classifier.ts # Multi-signal document classification
extractor.ts # Schema-driven field extraction
chunker.ts # Structure-aware chunking
embedder.ts # Embedding generation
worker.ts # Polling loop, claims + processes jobs
ai/
client.ts # OpenAI wrapper
prompts/
classify.ts # Few-shot classification prompt builder
extract.ts # Schema-driven extraction prompt builder
middleware/
error-handler.ts
request-id.ts
types/
index.ts
tests/
unit/
chunker.test.ts
ingestion.test.ts
classifier.test.ts # Mocked OpenAI
extractor.test.ts # Mocked OpenAI
integration/
pipeline.test.ts # Full submit -> extract -> query
fixtures/
sample-invoice.txt
sample-purchase-order.txt
sample-receipt.txt
messy-invoice.txt # Missing fields, OCR-quality
helpers/
mock-openai.ts
test-db.ts
client/ # Bare-bones React (Vite)
index.html
src/
App.tsx
components/
SchemaBuilder.tsx
DocumentUploader.tsx
ExtractionViewer.tsx
```
---
## Database Schema
### `document_types` (Schema Registry)
Replaces the weak `classification_hints` with a structured `classification_config` that provides multiple signals for the LLM classifier. The `extraction_schema` defines what fields to extract.
```sql
CREATE TABLE document_types (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name TEXT NOT NULL UNIQUE, -- 'invoice', 'purchase_order'
description TEXT NOT NULL, -- Human-readable description
version INTEGER NOT NULL DEFAULT 1, -- Bumped on extraction_schema change
is_active BOOLEAN NOT NULL DEFAULT true,
-- Rich classification config (multiple signal types for the LLM)
classification_config JSONB NOT NULL DEFAULT '{}',
-- {
-- "keywords": ["invoice", "bill to", "amount due", "payment terms"],
-- "structural_patterns": [
-- "contains a table of line items with description, quantity, price columns",
-- "has a header section with vendor/supplier info and invoice number",
-- "includes payment terms and due date"
-- ],
-- "negative_indicators": ["purchase order", "PO number", "ship to"],
-- "expected_sections": ["vendor_info", "line_items", "totals", "payment_info"]
-- }
-- The extraction schema: what fields to pull out
extraction_schema JSONB NOT NULL,
-- {
-- "fields": [
-- {"name": "invoice_number", "type": "string", "required": true, "description": "Unique invoice ID"},
-- {"name": "vendor_name", "type": "string", "required": true, "description": "Vendor/supplier name"},
-- {"name": "total_amount", "type": "number", "required": true, "description": "Total amount due"},
-- {"name": "line_items", "type": "array", "required": true, "description": "Billed items",
-- "items": {
-- "fields": [
-- {"name": "description", "type": "string", "required": true},
-- {"name": "quantity", "type": "number", "required": true},
-- {"name": "unit_price", "type": "number", "required": true},
-- {"name": "amount", "type": "number", "required": true}
-- ]
-- }
-- }
-- ]
-- }
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX ON document_types(is_active) WHERE is_active = true;
```
### `document_type_examples` (Few-Shot Classification)
Store example documents per type so the classifier can use few-shot learning. This is what makes classification work beyond a proof of concept.
```sql
CREATE TABLE document_type_examples (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
document_type_id UUID NOT NULL REFERENCES document_types(id) ON DELETE CASCADE,
content TEXT NOT NULL, -- Example document text (or representative snippet)
label TEXT NOT NULL DEFAULT 'positive'
CHECK (label IN ('positive', 'negative')), -- positive = "this IS an invoice"
notes TEXT, -- Why this is a good example
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX ON document_type_examples(document_type_id);
```
### `documents`
```sql
CREATE TABLE documents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
original_text TEXT NOT NULL,
content_hash TEXT NOT NULL,
source_filename TEXT,
-- Classification result
document_type_id UUID REFERENCES document_types(id),
classification_confidence REAL,
classification_reasoning TEXT, -- LLM explanation of why this type
-- Processing status
status TEXT NOT NULL DEFAULT 'pending'
CHECK (status IN ('pending','processing','classified','extracted','failed')),
error_message TEXT,
-- Document profile (lightweight version of ChatGPT's "Document Model")
word_count INTEGER NOT NULL,
section_count INTEGER,
has_tables BOOLEAN,
language TEXT,
-- Full-text search (BM25 via tsvector -- hybrid search alongside pgvector)
search_vector tsvector GENERATED ALWAYS AS (
to_tsvector('english', coalesce(original_text, ''))
) STORED,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE UNIQUE INDEX ON documents(content_hash);
CREATE INDEX ON documents(status);
CREATE INDEX ON documents(document_type_id);
CREATE INDEX ON documents USING GIN (search_vector); -- BM25 keyword search
```
### `extraction_queue`
```sql
CREATE TABLE extraction_queue (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
document_id UUID NOT NULL REFERENCES documents(id) ON DELETE CASCADE UNIQUE,
status TEXT NOT NULL DEFAULT 'pending'
CHECK (status IN ('pending','processing','completed','failed')),
priority INTEGER NOT NULL DEFAULT 0,
attempts INTEGER NOT NULL DEFAULT 0,
max_attempts INTEGER NOT NULL DEFAULT 3,
error_message TEXT,
created_at TIMESTAMPTZ DEFAULT NOW(),
started_at TIMESTAMPTZ,
completed_at TIMESTAMPTZ
);
CREATE INDEX ON extraction_queue(priority DESC, created_at ASC) WHERE status = 'pending';
```
### `extractions`
```sql
CREATE TABLE extractions (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
document_id UUID NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
document_type_id UUID NOT NULL REFERENCES document_types(id),
schema_version INTEGER NOT NULL,
-- Extracted structured data
extracted_data JSONB NOT NULL, -- {fields: {invoice_number: "INV-001", ...}}
confidence REAL, -- Overall extraction confidence
field_confidences JSONB, -- {"invoice_number": 0.98, "vendor_name": 0.85}
extraction_notes TEXT, -- LLM-reported quality issues
-- Provenance: where each field was found in the document
field_provenance JSONB,
-- {"invoice_number": {"section": "header", "paragraph_index": 0, "text_snippet": "Invoice #INV-001"},
-- "vendor_name": {"section": "header", "paragraph_index": 1, "text_snippet": "From: Acme Corp"}}
-- Semantic search vector
embedding vector(1536),
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX ON extractions(document_id);
CREATE INDEX ON extractions(document_type_id);
CREATE INDEX ON extractions USING hnsw (embedding vector_cosine_ops);
```
### `extracted_fields` (denormalized for fast structured queries)
```sql
CREATE TABLE extracted_fields (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
extraction_id UUID NOT NULL REFERENCES extractions(id) ON DELETE CASCADE,
document_id UUID NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
field_name TEXT NOT NULL,
field_value TEXT,
field_type TEXT NOT NULL,
confidence REAL,
source_snippet TEXT, -- Provenance: text where this was found
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX ON extracted_fields(field_name, field_value);
CREATE INDEX ON extracted_fields(document_id);
```
---
## API Routes
### Schemas
- `POST /api/schemas` -- Create document type (classification_config + extraction_schema)
- `GET /api/schemas` -- List all active schemas
- `GET /api/schemas/:id` -- Get schema with examples
- `PUT /api/schemas/:id` -- Update schema (bumps version on extraction_schema change)
- `DELETE /api/schemas/:id` -- Soft delete
- `POST /api/schemas/:id/examples` -- Add example document for few-shot classification
- `GET /api/schemas/:id/examples` -- List examples
### Documents
- `POST /api/documents` -- Submit text; returns 202 (new) or 200 (duplicate)
- `GET /api/documents` -- List with filters (?status=, ?type=, ?page=, ?limit=)
- `GET /api/documents/:id` -- Document with extraction results
- `POST /api/documents/:id/reprocess` -- Re-queue after schema change
### Search (hybrid: keyword + semantic + structured)
- `GET /api/search` -- Hybrid search: `?q=acme invoice` uses BM25 + vector
- `GET /api/extractions` -- Structured query: `?type=invoice&field=vendor_name&value=Acme`
- `GET /api/extractions/:id` -- Single extraction with provenance
### Queue
- `GET /api/queue` -- Status counts + recent jobs
- `POST /api/queue/retry-failed` -- Reset failed jobs to pending
### Health
- `GET /api/health` -- DB + queue stats
---
## Processing Pipeline
### 1. Ingestion (`POST /api/documents`)
1. Compute SHA-256 content hash
2. Check for duplicate (UNIQUE index on content_hash)
3. If duplicate: return existing document (idempotent, 200)
4. If new: INSERT document, INSERT extraction_queue job, return 202
### 2. Worker Claims Job
```sql
UPDATE extraction_queue SET status='processing', started_at=NOW(), attempts=attempts+1
WHERE id = (
SELECT id FROM extraction_queue WHERE status='pending'
ORDER BY priority DESC, created_at ASC
LIMIT 1
FOR UPDATE SKIP LOCKED
) RETURNING *
```
### 3. Document Profiling (lightweight Document Model)
Before classification, analyze the raw text:
- Split into sections (by double newlines, horizontal rules, header patterns)
- Detect if tables are present (look for aligned columns, tab-separated data)
- Count sections, estimate structure
- Store profile on the `documents` row (section_count, has_tables, language)
### 4. Classification (multi-signal, few-shot)
Build the classification prompt from:
1. **Type descriptions** from `document_types.description`
2. **Keywords + structural patterns** from `classification_config`
3. **Few-shot examples** from `document_type_examples` (positive + negative)
4. **Negative indicators** to avoid confusion between similar types (invoice vs PO)
5. **Document profile** (section count, has tables, etc.)
The LLM returns: `{document_type_id, confidence, reasoning}`
If confidence < threshold (e.g., 0.6), mark as "unclassified" for human review rather than guessing.
### 5. Chunking (if needed)
- Estimate tokens (~text.length/4)
- If >3000 tokens: structure-aware split (section boundaries > paragraph boundaries)
- Overlap between chunks (200 tokens), but never overlap across sections
- Extract from each chunk independently, then merge results
- Merge: highest-confidence wins for scalars; concatenate + deduplicate for arrays
### 6. Extraction (schema-driven)
- Load `document_type.extraction_schema`
- Build prompt with field definitions injected from schema
- Ask LLM to also return **provenance** per field (which section/paragraph the data came from)
- LLM returns: `{fields: {...}, field_confidences: {...}, field_provenance: {...}, extraction_notes: "..."}`
- Validate required fields present
### 7. Storage + Indexing
- INSERT `extractions` row (with embedding from text-embedding-3-small)
- INSERT denormalized `extracted_fields` rows (for structured queries)
- tsvector auto-generated on documents (for BM25 keyword search)
- Update document status to 'extracted'
- Mark queue job 'completed'
### 8. Error Handling
- On failure: if attempts < max_attempts, reset to 'pending' for retry
- If exhausted: mark 'failed', set error_message on both queue and document
---
## LLM Prompts
### Classification prompt (few-shot, multi-signal)
```
You are a document classifier for an ERP system.
## Available Document Types
### Invoice
Description: A bill from a vendor for goods or services rendered.
Keywords: invoice, bill to, amount due, payment terms, net 30
Structural patterns:
- Contains a table of line items with description, quantity, price columns
- Has a header with vendor info and invoice number
- Includes payment terms and due date
Negative indicators: purchase order, PO number, ship to address
### Purchase Order
Description: A request to purchase goods or services from a vendor.
Keywords: purchase order, PO, ship to, deliver by, requisition
...
[built dynamically from document_types + classification_config]
## Examples
### Example: Invoice (positive)
"""
INVOICE #INV-2024-001
From: Acme Corp | 123 Main St
Bill To: Widget Inc
Item Qty Price Total
Widget A 10 $5.00 $50.00
Widget B 5 $10.00 $50.00
Subtotal: $100.00
Tax (8%): $8.00
Total Due: $108.00
Payment Terms: Net 30
"""
### Example: Purchase Order (positive)
...
[built from document_type_examples table]
## Document Profile
Sections: 4 | Has tables: yes | Word count: 150
## Instructions
Classify the document below. Respond with JSON:
{
"document_type_id": "uuid or null if unknown",
"type_name": "name or 'unknown'",
"confidence": 0.0 to 1.0,
"reasoning": "Brief explanation"
}
Be conservative. If ambiguous, use confidence < 0.7.
If no type matches, set type_name to "unknown".
```
### Extraction prompt (schema-driven + provenance)
```
You are an ERP data extractor. Extract structured data from a [Invoice] document.
## Fields to Extract
- invoice_number (string, REQUIRED): Unique invoice identifier
- vendor_name (string, REQUIRED): Name of the vendor/supplier
- total_amount (number, REQUIRED): Total amount due
- line_items (array, REQUIRED): Individual line items
- description (string, required)
- quantity (number, required)
- unit_price (number, required)
- amount (number, required)
- due_date (date, optional): Payment due date
- currency (string, optional): Currency code
[built dynamically from extraction_schema JSONB]
## Rules
1. Extract ONLY fields listed above.
2. For REQUIRED fields that are missing, set to null and note in extraction_notes.
3. Dates in ISO 8601 (YYYY-MM-DD). Numbers without currency symbols.
4. For each field, note WHERE in the document you found it (provenance).
## Response Format
{
"fields": {"invoice_number": "INV-2024-001", ...},
"field_confidences": {"invoice_number": 0.98, ...},
"field_provenance": {
"invoice_number": {"section": "header", "text_snippet": "INVOICE #INV-2024-001"},
"vendor_name": {"section": "header", "text_snippet": "From: Acme Corp"}
},
"extraction_notes": "Any quality issues, missing data, or ambiguities"
}
```
---
## Search Strategy (Hybrid)
Three query paths from one pipeline, addressing different user needs:
1. **Keyword search** (BM25 via Postgres tsvector):
`GET /api/search?q=INV-2024-001` -- exact invoice numbers, vendor names, specific terms
Uses `ts_rank` + `to_tsquery` on `documents.search_vector`
2. **Semantic search** (pgvector):
`GET /api/search?q=large invoices from technology vendors&mode=semantic`
Embeds query, finds nearest extraction embeddings
3. **Structured query** (field-level):
`GET /api/extractions?type=invoice&field=vendor_name&value=Acme&field=total_amount>=10000`
Queries `extracted_fields` table directly
Default `/api/search` uses hybrid: runs both BM25 and vector, merges results with configurable weighting.
---
## Seed Data
Pre-seed 3 document types with full classification_config + extraction_schema + 2 examples each:
1. **Invoice** -- vendor_name, invoice_number, issue_date, due_date, line_items[], subtotal, tax, total_amount, currency, payment_terms
2. **Purchase Order** -- po_number, vendor, ship_to, items[], total, requested_delivery_date, approved_by
3. **Receipt** -- merchant_name, date, items[], subtotal, tax, total, payment_method
Each type gets 2 example documents (positive) seeded in `document_type_examples`.
Sample text fixtures in `tests/fixtures/` for testing.
---
## What to Simplify / Mock
| Area | Simplification | Walkthrough talking point |
|------|---------------|--------------------------|
| Auth | Skip entirely | "Per instructions. Production: JWT + tenant_id + RLS." |
| File upload | Text-only POST body | "Would add multipart + pdf-parse/Textract. Pipeline has pre-processing hook." |
| OCR | Not implemented | "Would integrate Google Document AI as a pre-classification step." |
| Worker | Single-process polling | "FOR UPDATE SKIP LOCKED scales to N workers with no code changes." |
| Schema validation | Basic type checks | "Production: ajv for full JSON Schema validation." |
| Reranking | Not implemented | "Would add cross-encoder reranking for hybrid search quality." |
---
## What Makes This Impressive
1. **Schema as data, not code** -- POST a new document type, system immediately classifies + extracts. Live-demo by creating a "Shipping Manifest" type during walkthrough.
2. **Few-shot classification** -- Example documents per type make classification actually work well, not just keyword matching. This is the kind of detail that separates "built a prototype" from "thought about production."
3. **Hybrid search** -- BM25 catches exact IDs/names, vector catches semantic meaning. Shows understanding that different users query differently (accountant by invoice#, executive by concept).
4. **Provenance tracking** -- Know exactly where in the document each field was found. Critical for ERP auditing, enables "click to see source" in UI.
5. **Natural idempotency** -- Content hash dedup makes the API safe to retry.
6. **Production-grade queue** -- FOR UPDATE SKIP LOCKED, retry tracking, observable status.
7. **Schema versioning** -- Extractions track schema version; re-process with updated schemas.
8. **Per-field confidence + extraction_notes** -- LLM reports quality issues, enabling human-in-the-loop review.
---
## Implementation Sequence
1. Project setup, Docker Compose, migrations, seed data
2. Express skeleton, health route, schema CRUD routes (including examples endpoint)
3. Document ingestion with dedup + queue
4. Worker skeleton with FOR UPDATE SKIP LOCKED claiming
5. Document profiling (section detection, table detection)
6. Classification service + few-shot prompt builder
7. Extraction service + schema-driven prompt builder (with provenance)
8. Chunking for large documents
9. Embedding generation + hybrid search route (BM25 + vector)
10. Structured extraction query routes (field-level)
11. Tests (unit: chunker, dedup, mocked LLM; integration: full pipeline)
12. Frontend (schema builder with example upload, document uploader, extraction viewer with provenance highlights)
13. README with architecture diagram + design decisions
---
## Verification
- `docker-compose up` starts Postgres with pgvector
- `npm run migrate && npm run seed` creates tables + 3 ERP doc types with examples
- `npm run dev` starts Express + worker
- Submit sample invoice via `POST /api/documents` -> 202
- Poll `GET /api/documents/:id` -> pending -> processing -> extracted
- View extraction with provenance: `GET /api/extractions/:id`
- Keyword search: `GET /api/search?q=INV-2024-001` finds by invoice number
- Semantic search: `GET /api/search?q=large invoices from tech vendors&mode=semantic`
- Structured: `GET /api/extractions?type=invoice&field=vendor_name&value=Acme`
- Create new type: `POST /api/schemas` + add example + submit matching doc -> auto-classified
- Resubmit same document -> 200 with is_duplicate=true
- Reprocess after schema update: `POST /api/documents/:id/reprocess`
---
## Critical Source Files to Reference
From existing writing-ui project (patterns to adapt, not copy):
- `supabase/migrations/0005_extraction_queue.sql` -- Queue with FOR UPDATE SKIP LOCKED
- `src/app/api/ai/extract/route.ts` -- LLM extraction + quality scoring + refinement loop
- `src/lib/ai/providers/openai.ts` -- OpenAI client with `response_format: {type: 'json_object'}`
- `src/lib/ai/prompts/nvq-extraction.ts` -- Dynamic prompt building from runtime data
This roadmap outlines planned enhancements to transform cheap-RAG from a functional document retrieval system into a production-ready, state-of-the-art RAG framework. Priorities are based on impact vs. effort analysis and alignment with mainstream RAG best practices.
See `specs/Semblance-MVP-Plan-v2.md` for full technical specification.
All notable changes to AvocadoDB will be documented in this file.
**Goal:** Stand up Toasty as a reliable service wired to BLT/GitHub events; deliver safe, useful summaries early.