AI Document Formatting: Enhancing OCR With Claude
Hey guys, let's dive into a cool feature request that seriously boosts the accuracy of Optical Character Recognition (OCR) by harnessing the power of AI! The main goal? To use Claude's vision to analyze original document images and reformat the extracted text, keeping that crucial visual indentation that OCR often messes up. Basically, making digital documents look as good as the real thing.
Original Request
So, the user's big idea was to leverage Claude's vision capabilities to examine the original document image and then intelligently reformat the extracted text. The aim is to mirror the visual indentation structure, which is often lost during the OCR process. As the user aptly put it, Google Vision API is fantastic at grabbing text but not so great at preserving its original formatting and indentation. The core ask was:
"let's have agents look at the image and rearrange the text to the right indention given this text extraction and the image itself to see if the agents can make this more accurate. Google vision api is really good at extracting text but not so much formatting it in the same indention."
This feature aims to solve a common problem: while OCR technology excels at recognizing text, it often fails to maintain the document's original layout, including indentation, spacing, and other formatting nuances. This can be particularly problematic when dealing with structured documents such as code, configuration files, or YAML configurations where indentation is crucial for readability and functionality. By integrating AI-driven analysis of the document's visual structure, the goal is to bridge this gap and produce more accurate and usable text extractions.
User Prompts
Throughout the development process, the user provided valuable feedback and insights through a series of prompts:
- Initial request: "the endpoint response was good. it was the following: success}"
- Feature request: "let's have agents look at the image and rearrange the text to the right indention given this text extraction and the image itself to see if the agents can make this more accurate. Google vision api is really good at extracting text but not so much formatting it in the same indention."
- Error reports: The user encountered several runtime errors during deployment, which were addressed and resolved iteratively.
- Success confirmation: "it works. lets create a github issue capturing the work effort."
These prompts demonstrate the user's engagement and collaboration in refining the feature. From the initial validation of the endpoint response to the confirmation of successful implementation, the user's input played a critical role in shaping the final outcome. The error reports, in particular, highlighted areas that required attention and improvement, ultimately leading to a more robust and reliable solution. The final confirmation of success underscores the value of the implemented feature and its potential impact on improving OCR accuracy and usability.
Implementation Summary
So, here’s the deal: we've implemented a two-step OCR + AI formatting workflow that's all about getting the best of both worlds. It cleverly combines Google Cloud Vision API for text extraction with Claude's awesome vision skills to keep the document structure intact. Think of it as giving OCR a pair of glasses and a formatting guide!
How It Works
Let's break down how this magic happens:
Step 1: OCR Text Extraction
- First, you upload your document image. We're talking PNGs, JPGs, and the like.
- Then, Google Cloud Vision API jumps in to grab the raw text from the image. It's super accurate at picking out the text itself.
- But, and here's the kicker, it loses all the formatting and indentation. It's like a super-efficient but slightly clumsy scribe.
Step 2: AI-Powered Reformatting
- This is where Claude, the "Document Formatter" agent, steps in. It gets two key ingredients:
- The original document image, so it can actually see the visual layout.
- The extracted text, to make sure no content gets lost.
- The agent then gets to work, analyzing the image structure and reformatting the text to perfectly match the visual indentation.
- It’s a pro at handling YAML, code, configuration files, and all sorts of structured documents.
The core of this implementation lies in the intelligent integration of OCR technology with AI-powered formatting. By leveraging Google Cloud Vision API for text extraction and Claude's vision capabilities for structure analysis, the workflow effectively overcomes the limitations of traditional OCR systems. The result is a more accurate and visually faithful representation of the original document, preserving its formatting and indentation. This is particularly valuable for documents where structure is critical for understanding and usability, such as code, configuration files, and structured reports.
User Experience
- It’s as easy as dragging and dropping your document image.
- You get to watch the progress through four simple steps: Upload → OCR Extract → AI Format → Complete.
- And the best part? You get a side-by-side comparison:
- The original document image, so you know what it should look like.
- The AI-reformatted text, with all the correct indentation.
- The raw OCR extraction, just for reference. It's like a before-and-after, but with extra geekiness.
Technical Details
Alright, tech enthusiasts, let's get into the nitty-gritty of the changes under the hood.
Frontend Changes
File: claude-workflow-manager/frontend/src/components/ImageDemoPage.tsx
The key implementation happens between lines 127 and 183. Here’s a snippet of the code:
// Step 2: Use agent to reformat text with correct indentation
const agentResponse = await orchestrationApi.executeSequential({
task_content: [
{
type: 'image',
source: {
type: 'base64',
media_type: state.imageFile.type,
data: base64Image
}
},
{
type: 'text',
text: `You are analyzing a document image and its OCR-extracted text.
The OCR extraction is good at recognizing text but loses formatting and indentation information.
**Your task:**
1. Look at the image carefully to understand the visual indentation and structure
2. Take the extracted text below and reformat it to match the exact indentation you see in the image
3. Pay attention to:
- Nested YAML/configuration structure
- Spaces vs tabs (use 2 spaces for indentation)
- Alignment of keys and values
- Line breaks and grouping
**OCR Extracted Text:**
\`\`\`
${extractedText}
\`\`\`
**Instructions:**
- Output ONLY the reformatted text in a markdown code block
- Match the indentation exactly as shown in the image
- Preserve all text content from the OCR
- Use proper spacing for readability`
}
],
agents: [
{
name: 'Document Formatter',
role: 'specialist',
system_prompt: 'You are an expert at analyzing document layouts and reformatting extracted text to match the original visual structure. You pay close attention to indentation, spacing, and formatting.'
}
],
agent_sequence: ['Document Formatter']
});
// Extract the final_result from the nested result object
const formattedText = agentResponse.result?.final_result || agentResponse.result || '';
Backend Changes
1. Multi-modal content support in sequential pipeline
File: claude-workflow-manager/backend/main.py
(lines 3416-3418)
We've added some smarts to handle both text and multi-modal content:
# Use task_content if provided (multi-modal), otherwise fall back to task (legacy)
task_input = request.task_content if request.task_content else request.task
result = await orchestrator.sequential_pipeline(task_input, request.agent_sequence)
2. Fixed uninitialized variable for multi-modal messages
File: claude-workflow-manager/backend/agent_orchestrator.py
(line 295)
Assigned placeholder for logging when processing multi-modal content:
# Multi-modal content blocks
full_message = f"<multi-modal content with {len(message)} blocks>"
3. OCR endpoint model
File: claude-workflow-manager/backend/models.py
(lines 657-659)
Created request validation model:
class OcrExtractRequest(BaseModel):
"""Request model for OCR text extraction from base64 image"""
image_data: str
API Endpoints Used
POST /api/ocr/extract
- OCR text extraction via Google Cloud VisionPOST /api/orchestration/sequential
- Multi-agent orchestration with vision support
The integration of these API endpoints is essential for enabling the OCR text extraction and the multi-agent orchestration with vision support. The POST /api/ocr/extract
endpoint serves as the entry point for submitting document images for OCR processing, leveraging Google Cloud Vision API to extract the raw text content. This endpoint handles the initial step of converting visual information into machine-readable text. The POST /api/orchestration/sequential
endpoint, on the other hand, orchestrates the subsequent steps of the workflow, including the AI-powered reformatting. It supports multi-agent orchestration with vision support, allowing Claude to analyze the extracted text and the original document image simultaneously to preserve the document structure.
Data Flow
1. User uploads image → Base64 encoding
2. Frontend → POST /api/ocr/extract → Google Cloud Vision API
3. OCR returns: {success: true, result: {pages: [{full_text: "..."}]}}
4. Frontend → POST /api/orchestration/sequential with:
- task_content[0]: base64 image
- task_content[1]: prompt + extracted text
5. Agent SDK receives multi-modal content → Claude analyzes image + text
6. Agent returns reformatted text with correct indentation
7. Frontend displays: original image + reformatted text + raw OCR
Bugs Fixed During Implementation
No software project is complete without a few bugs! Here’s how we squashed them:
Bug #1: Missing task_content handling
- Error: `