## Introduction
Claude AI's vision capabilities, introduced with the Claude 3 family and enhanced in Claude 3.5 Sonnet, allow multimodal processing of images alongside text. This means you can upload documents, screenshots, charts, or scanned PDFs and have Claude perform optical character recognition (OCR), extract structured data, or generate deep insights—all via carefully engineered prompts.
Unlike traditional OCR tools like Tesseract, Claude combines vision with reasoning, handling handwriting, tables, layouts, and context-aware analysis. It's ideal for developers automating workflows, business users processing invoices, or teams analyzing reports. In this guide, we'll cover step-by-step prompt engineering techniques, from basic text extraction to advanced multimodal workflows, with real-world examples for Claude.ai and the API.
**Key Benefits of Claude Vision for Documents:**
- High accuracy on complex layouts (e.g., tables, multi-column text)
- Contextual understanding (e.g., inferring totals from invoices)
- Multimodal chaining (analyze image + text prompt)
- Cost-effective: No separate OCR service needed
- Supports up to 200K tokens context for long docs
Word count so far: ~150. Let's dive in.
## Prerequisites
To follow along:
- **Claude.ai**: Free tier works; Pro unlocks higher limits and Claude 3.5 Sonnet.
- **API Access**: Anthropic API key (claude.ai/api-keys). Models: `claude-3-5-sonnet-20240620` or `claude-3-opus-20240229`.
- **Images/PDFs**: Convert PDFs to images if needed (use `pdf2image` Python lib or tools like SmallPDF). Claude API accepts PNG/JPEG base64; max 20 images per request, 5MB each.
- **Tools**: Python with `anthropic` SDK for API examples.
Install SDK:
```bash
pip install anthropic pillow pdf2image
```
## Step 1: Basic OCR Extraction
Start simple: Extract raw text from a scanned document.
### On Claude.ai
1. Upload image via drag-and-drop.
2. Use this prompt:
```
<image>Your document image here</image>
Extract all visible text from this image accurately, preserving line breaks and formatting. Output only the extracted text, no additional commentary.
```
**Example Output** (for a sample invoice):
```
ACME Corp Invoice #1234
Date: 2023-10-15
Item Qty Price
Widget A 5 $10.00
Gadget B 2 $25.00
Total: $90.00
```
### Via API
```python
import anthropic
import base64
from PIL import Image
client = anthropic.Anthropic(api_key="your-api-key")
with open("invoice.png", "rb") as img_file:
base64_image = base64.b64encode(img_file.read()).decode()
message = client.messages.create(
model="claude-3-5-sonnet-20240620",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": base64_image,
},
},
{
"type": "text",
"text": "Extract all visible text from this image accurately, preserving line breaks and formatting. Output only the extracted text."
},
],
}],
)
print(message.content[0].text)
```
**Pro Tip**: Specify `preserving line breaks` to maintain structure for downstream parsing.
## Step 2: Structured Data Extraction
Raw text is step one; structure it for JSON/CSV output using Claude's XML affinity.
**Prompt Template**:
```
<image>document.png</image>
You are an expert document parser. Analyze this image and extract key fields into structured XML/JSON.
Instructions:
1. Identify document type (e.g., invoice, receipt, contract).
2. Extract fields: [list specifics like date, total, items].
3. Use <document> XML wrapper with <field name="key">value</field>.
4. If uncertain, use <uncertain>note</uncertain>.
Output only valid XML.
```
**Example for Invoice**:
```
<image>invoice.png</image>
<document type="invoice">
Extract:
- invoice_number
- date
- vendor_name
- line_items (as array of {description, qty, price, total})
- subtotal
- tax
- grand_total
</document>
Respond only with XML: <extraction><invoice_number>...</extraction>
```
**Sample Output**:
```xml
<extraction>
<invoice_number>1234</invoice_number>
<date>2023-10-15</date>
<vendor_name>ACME Corp</vendor_name>
<line_items>
<item>
<description>Widget A</description>
<qty>5</qty>
<price>10.00</price>
<total>50.00</total>
</item>
<!-- more -->
</line_items>
<grand_total>90.00</grand_total>
</extraction>
```
Parse in Python:
```python
import xml.etree.ElementTree as ET
tree = ET.fromstring(output_xml)
invoice_num = tree.find('invoice_number').text
```
This beats regex parsing—Claude handles variations like handwritten notes.
## Step 3: Handling Multi-Page PDFs
Claude supports multiple images. Split PDF into pages.
**Python Helper**:
```python
from pdf2image import convert_from_path
pages = convert_from_path('doc.pdf')
base64_pages = [base64.b64encode(Image.frombytes('RGB', p.size, p.tobytes())).decode() for p in pages]
```
**Prompt for Multi-Page**:
```
<image>page1.png</image><image>page2.png</image><!-- up to 20 -->
This is a multi-page document. Extract text from all pages, concatenating with page breaks (--- PAGE X ---).
Then, summarize key insights across pages.
Structured output:
<full_text>...</full_text>
<summary>...</summary>
```
**Advanced**: Use for contracts—"Flag risks on any page."
## Step 4: Table and Chart Analysis
Claude excels at tables invisible to basic OCR.
**Table Extraction Prompt**:
```
<image>table.png</image>
Convert this table to markdown format. Infer headers and data types. If multi-page, note spans.
Output:
| Header1 | Header2 |
|---------|---------|
| data | data |
```
**Chart Insights**:
```
<image>chart.png</image>
Describe this chart: type (bar/pie), key trends, approximate values for top 3 data points, and business implications.
Structure:
<chart type="bar">
<trends>...</trends>
<values>...</values>
<implications>...</implications>
</chart>
```
Example: Sales dashboard → "Q3 revenue up 20%, driven by Product X."
## Step 5: Error Handling and Best Practices
**Common Pitfalls**:
- Blurry images: Prompt "Enhance readability by ignoring artifacts."
- Handwriting: Add "Handle cursive/handwritten text carefully. Cross-reference context."
- Languages: "Extract in original language, then translate to English."
**Optimization Tips**:
- **Detail Level**: `detail=low` in API for broad overviews (saves tokens); `high` for fine-grained OCR.
- **Chain Prompts**: First extract text, then analyze: Use `system` prompt for role ("You are a forensic document analyst").
- **Token Efficiency**: Limit to ROI fields; use 16:9 crops for focus.
- **Validation**: "Confidence score per field: high/medium/low."
**API with Detail**:
```python
message = client.messages.create(
# ...
extra_headers={"anthropic-beta": "text-preview-2024-07-24"} # For latest vision
)
```
## Real-World Workflows
### Invoice Automation (Finance)
1. Extract → JSON.
2. Prompt: "Validate totals: subtotal + tax == total? Flag discrepancies.
3. Integrate with Zapier: Claude webhook → Google Sheets.
### Legal Contract Review
```
<image>contract.pdf pages</image>
Scan for: NDAs, termination clauses, liabilities. Highlight risks in <risk level="high">text</risk>.
Output summary + quotes.
```
### Marketing Report Analysis
Extract KPIs from dashboards, generate executive summary.
**n8n Integration Example**:
- Trigger: New PDF in Drive.
- Node: Convert to images.
- Node: Claude API (structured prompt).
- Node: Slack notification with insights.
## Advanced Prompt Engineering
Leverage Claude's strengths:
- **Chain of Thought**: "Step 1: Describe layout. Step 2: Zone text/table. Step 3: Extract."
- **Few-Shot**: Provide 1-2 example extractions.
```
Example 1: [image desc] → [XML]
Now do this image.
```
- **XML Enforcement**: System: "ALWAYS respond in valid XML, no prose."
**Ultimate Template**:
```xml
<task>OCR and analyze document</task>
<instructions>...</instructions>
<output_schema>...</schema>
<image>...</image>
```
## Comparisons and Limits
| Feature | Claude Vision | GPT-4V | Gemini 1.5 |
|---------|---------------|--------|-------------|
| OCR Accuracy | Excellent (contextual) | Good | Very Good |
| Multi-Image | 20+ | 10+ | Unlimited? |
| Context | 200K | 128K | 1M+ |
| Cost | $3/M input tokens | $10/M | Varies |
Claude shines in reasoning over extracted data.
Limits: No native PDF (images only), hallucinations rare but validate numbers.
## Conclusion
Mastering Claude Vision prompts transforms document processing. Start with basics, iterate to structured workflows. Experiment on claude.ai, scale via API. Share your prompts in comments!
**Next Steps**:
- Build agent: Claude + MCP for auto-OCR pipelines.
- Check Anthropic docs: [Vision Guide](https://docs.anthropic.com).
(Word count: ~1450)