Files
motovaultpro/docs/ocr-pipeline-tech-stack.md
Eric Gullickson 47c5676498
Some checks failed
Deploy to Staging / Build Images (pull_request) Failing after 7m4s
Deploy to Staging / Deploy to Staging (pull_request) Has been skipped
Deploy to Staging / Verify Staging (pull_request) Has been skipped
Deploy to Staging / Notify Staging Ready (pull_request) Has been skipped
Deploy to Staging / Notify Staging Failure (pull_request) Successful in 7s
chore: update OCR tests and documentation (refs #121)
Add engine abstraction tests and update docs to reflect PaddleOCR primary
architecture with optional Google Vision cloud fallback.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-07 11:42:51 -06:00

38 KiB

Vehicle Owner's Manual & Receipt OCR Pipeline

Complete Tech Stack & Architecture


SYSTEM FLOW DIAGRAM

┌─────────────────────────────────────────────────────────────────────────────────┐
│                              USER UPLOAD                                         │
│                   (PDF, JPG, PNG, HEIC, TIFF, WEBP)                              │
└─────────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                         1. FORMAT DETECTION                                      │
│                           python-magic                                           │
│                                                                                  │
│   • Detect MIME type from file bytes (not extension)                            │
│   • Validate file is an accepted format                                          │
│   • Reject unsupported/malicious files early                                     │
└─────────────────────────────────────────────────────────────────────────────────┘
                                      │
                      ┌───────────────┴───────────────┐
                      ▼                               ▼
┌─────────────────────────────────┐   ┌─────────────────────────────────┐
│         PDF DETECTED            │   │        IMAGE DETECTED           │
└─────────────────────────────────┘   │   (JPG, PNG, HEIC, TIFF, WEBP)  │
                │                     └─────────────────────────────────┘
                ▼                                     │
┌─────────────────────────────────┐                   │
│   2a. PDF TEXT LAYER CHECK      │                   │
│         PyMuPDF (fitz)          │                   │
│                                 │                   │
│  • Check if PDF has embedded    │                   │
│    searchable text              │                   │
│  • Count text characters        │                   │
└─────────────────────────────────┘                   │
                │                                     │
        ┌───────┴───────┐                             │
        ▼               ▼                             │
┌──────────────┐ ┌──────────────┐                     │
│  HAS TEXT    │ │  NO TEXT     │                     │
│  (Native)    │ │  (Scanned)   │                     │
└──────────────┘ └──────────────┘                     │
        │               │                             │
        │               ▼                             │
        │       ┌──────────────────────┐              │
        │       │ 2b. RENDER TO IMAGES │              │
        │       │    PyMuPDF @ 300 DPI │              │
        │       │                      │              │
        │       │ • Page-by-page render│              │
        │       │ • Maintain quality   │              │
        │       └──────────────────────┘              │
        │               │                             │
        │               └──────────────┬──────────────┘
        │                              ▼
        │       ┌─────────────────────────────────────────────────────────┐
        │       │                3. IMAGE NORMALIZATION                    │
        │       ├─────────────────────────────────────────────────────────┤
        │       │                                                         │
        │       │   ┌─────────────┐    ┌─────────────┐    ┌────────────┐ │
        │       │   │ HEIC Input  │    │ EXIF Check  │    │ Convert to │ │
        │       │   │             │───▶│             │───▶│  RGB PNG   │ │
        │       │   │ pillow-heif │    │ Fix rotation│    │            │ │
        │       │   └─────────────┘    │ ImageOps    │    │  Pillow    │ │
        │       │                      └─────────────┘    └────────────┘ │
        │       │                                                         │
        │       │   ┌─────────────┐    ┌─────────────┐    ┌────────────┐ │
        │       │   │ JPG/PNG/etc │    │ EXIF Check  │    │ Convert to │ │
        │       │   │             │───▶│             │───▶│  RGB PNG   │ │
        │       │   │   Pillow    │    │ Fix rotation│    │            │ │
        │       │   └─────────────┘    └─────────────┘    └────────────┘ │
        │       │                                                         │
        │       └─────────────────────────────────────────────────────────┘
        │                              │
        │                              ▼
        │       ┌─────────────────────────────────────────────────────────┐
        │       │                 4. PREPROCESSING                         │
        │       ├─────────────────────────────────────────────────────────┤
        │       │                                                         │
        │       │   ┌─────────────────────────────────────────────────┐   │
        │       │   │  4a. Resolution Normalization                   │   │
        │       │   │      Target: 300 DPI equivalent                 │   │
        │       │   │      Tool: Pillow resize with LANCZOS           │   │
        │       │   └─────────────────────────────────────────────────┘   │
        │       │                         │                               │
        │       │                         ▼                               │
        │       │   ┌─────────────────────────────────────────────────┐   │
        │       │   │  4b. Deskew (Straighten)                        │   │
        │       │   │      Tool: OpenCV + deskew library              │   │
        │       │   │      Method: Hough transform / projection       │   │
        │       │   └─────────────────────────────────────────────────┘   │
        │       │                         │                               │
        │       │                         ▼                               │
        │       │   ┌─────────────────────────────────────────────────┐   │
        │       │   │  4c. Denoise                                    │   │
        │       │   │      Tool: OpenCV fastNlMeansDenoising          │   │
        │       │   └─────────────────────────────────────────────────┘   │
        │       │                         │                               │
        │       │                         ▼                               │
        │       │   ┌───────────────────────────────────────────────────┐ │
        │       │   │  4d. Document-Specific Enhancement                │ │
        │       │   │                                                   │ │
        │       │   │   RECEIPTS:              MANUALS:                 │ │
        │       │   │   • Adaptive threshold   • Contrast stretch       │ │
        │       │   │   • Perspective correct  • Sharpen                │ │
        │       │   │   • High contrast B&W    • Keep grayscale         │ │
        │       │   │                                                   │ │
        │       │   │   Tool: OpenCV adaptiveThreshold / CLAHE          │ │
        │       │   └───────────────────────────────────────────────────┘ │
        │       │                                                         │
        │       └─────────────────────────────────────────────────────────┘
        │                              │
        │                              ▼
        │       ┌─────────────────────────────────────────────────────────┐
        │       │                    5. OCR ENGINE                         │
        │       ├─────────────────────────────────────────────────────────┤
        │       │                                                         │
        │       │   ┌─────────────────────────────────────────────────┐   │
        │       │   │  5a. Engine Abstraction Layer                    │   │
        │       │   │                                                  │   │
        │       │   │  OcrEngine ABC -> PaddleOcrEngine (primary)      │   │
        │       │   │                -> CloudEngine (optional fallback) │   │
        │       │   │                -> TesseractEngine (backward compat)│  │
        │       │   │                -> HybridEngine (primary+fallback) │   │
        │       │   └─────────────────────────────────────────────────┘   │
        │       │                         │                               │
        │       │                         ▼                               │
        │       │   ┌─────────────────────────────────────────────────┐   │
        │       │   │  5b. Primary OCR: PaddleOCR PP-OCRv4             │   │
        │       │   │                                                  │   │
        │       │   │  • Scene text detection + angle classification   │   │
        │       │   │  • CPU-only, models baked into Docker image      │   │
        │       │   │  • Normalized output: text, confidence, word boxes│  │
        │       │   └─────────────────────────────────────────────────┘   │
        │       │                         │                               │
        │       │                         ▼                               │
        │       │                 ┌───────────────┐                       │
        │       │                 │  Confidence   │                       │
        │       │                 │   >= 60% ?    │                       │
        │       │                 └───────────────┘                       │
        │       │                    │         │                          │
        │       │              YES ──┘         └── NO (and cloud enabled) │
        │       │               │                   │                     │
        │       │               │                   ▼                     │
        │       │               │   ┌─────────────────────────────────┐   │
        │       │               │   │  5c. Optional Cloud Fallback     │   │
        │       │               │   │      (Google Vision API)         │   │
        │       │               │   │                                  │   │
        │       │               │   │  • Disabled by default           │   │
        │       │               │   │  • 5-second timeout guard        │   │
        │       │               │   │  • Returns higher-confidence     │   │
        │       │               │   │    result of primary vs fallback │   │
        │       │               │   └─────────────────────────────────┘   │
        │       │               │                   │                     │
        │       │               ▼                   ▼                     │
        │       │         ┌─────────────────────────────────┐             │
        │       │         │  5d. HybridEngine Result        │             │
        │       │         │  • Compare confidences          │             │
        │       │         │  • Keep highest confidence      │             │
        │       │         │  • Graceful fallback on error   │             │
        │       │         └─────────────────────────────────┘             │
        │       │                                                         │
        │       └─────────────────────────────────────────────────────────┘
        │                              │
        └──────────────────────────────┤
                                       ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                        6. STRUCTURED EXTRACTION                                  │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                  │
│   ┌───────────────────────────────────────────────────────────────────────┐     │
│   │  6a. Layout Analysis                                                  │     │
│   │      Tool: PaddleOCR Layout / LayoutParser                            │     │
│   │      Detects: Headers, paragraphs, tables, lists, figures             │     │
│   └───────────────────────────────────────────────────────────────────────┘     │
│                                      │                                          │
│                    ┌─────────────────┴─────────────────┐                        │
│                    ▼                                   ▼                        │
│   ┌────────────────────────────────┐   ┌────────────────────────────────┐       │
│   │  6b. Table Extraction          │   │  6c. Text Block Processing     │       │
│   │                                │   │                                │       │
│   │  Tool: img2table or Camelot    │   │  • Group by proximity          │       │
│   │  Output: Pandas DataFrame      │   │  • Identify sections           │       │
│   │                                │   │  • Extract hierarchies         │       │
│   └────────────────────────────────┘   └────────────────────────────────┘       │
│                    │                                   │                        │
│                    └─────────────────┬─────────────────┘                        │
│                                      ▼                                          │
│   ┌───────────────────────────────────────────────────────────────────────┐     │
│   │  6d. Maintenance Schedule Pattern Matching                            │     │
│   │                                                                       │     │
│   │  Tool: regex + spaCy NER                                              │     │
│   │                                                                       │     │
│   │  Patterns:                                                            │     │
│   │  • Mileage intervals: "every 5,000 miles", "30,000 km"                │     │
│   │  • Time intervals: "every 6 months", "annually"                       │     │
│   │  • Service types: "oil change", "tire rotation", "brake inspection"  │     │
│   │  • Fluid types: "5W-30", "ATF", "DOT 4"                               │     │
│   │                                                                       │     │
│   └───────────────────────────────────────────────────────────────────────┘     │
│                                                                                  │
└─────────────────────────────────────────────────────────────────────────────────┘
                                       │
                                       ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                            7. OUTPUT LAYER                                       │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                  │
│   ┌───────────────────────────────────────────────────────────────────────┐     │
│   │  Structured JSON Response                                             │     │
│   │                                                                       │     │
│   │  {                                                                    │     │
│   │    "document_type": "owners_manual" | "receipt",                      │     │
│   │    "vehicle": {                                                       │     │
│   │      "make": "Toyota",                                                │     │
│   │      "model": "Camry",                                                │     │
│   │      "year": 2024                                                     │     │
│   │    },                                                                 │     │
│   │    "maintenance_schedule": [                                          │     │
│   │      {                                                                │     │
│   │        "service": "Oil Change",                                       │     │
│   │        "interval_miles": 5000,                                        │     │
│   │        "interval_months": 6,                                          │     │
│   │        "details": "Use 0W-20 synthetic"                               │     │
│   │      }                                                                │     │
│   │    ],                                                                 │     │
│   │    "raw_text": "...",                                                 │     │
│   │    "tables": [...],                                                   │     │
│   │    "confidence_score": 0.94,                                          │     │
│   │    "processing_time_ms": 1250                                         │     │
│   │  }                                                                    │     │
│   │                                                                       │     │
│   └───────────────────────────────────────────────────────────────────────┘     │
│                                                                                  │
└─────────────────────────────────────────────────────────────────────────────────┘

COMPLETE TECH STACK

Core Dependencies

Component Tool Version Purpose
Runtime Python 3.11+ Primary language
API Framework FastAPI 0.100+ REST API with async support
Task Queue Celery + Redis 5.3+ Async processing for large docs

File Handling

Component Tool Purpose
Format Detection python-magic MIME type detection from bytes
HEIC Support pillow-heif iPhone HEIC image conversion
Image Processing Pillow General image I/O and manipulation
PDF Processing PyMuPDF (fitz) PDF text extraction & rendering

Image Preprocessing

Component Tool Purpose
Computer Vision OpenCV (cv2) Deskew, denoise, threshold
Deskew deskew Specialized document straightening
Enhancement scikit-image Additional image filters

OCR Engines

Component Tool Purpose
Primary OCR PaddleOCR PP-OCRv4 Highest accuracy scene text, CPU-only
Cloud Fallback Google Vision API Optional cloud fallback (disabled by default)
Backward Compat Tesseract 5.x / pytesseract Legacy engine, configurable via env var
Engine Abstraction OcrEngine ABC Pluggable engine interface in ocr/app/engines/

Data Extraction

Component Tool Purpose
Table Extraction img2table Image-based table extraction
PDF Tables Camelot Native PDF table extraction
NLP spaCy Entity extraction, pattern matching
Data Handling pandas Table/dataframe manipulation

Infrastructure

Component Tool Purpose
Object Storage S3 / MinIO Document and result storage
Database PostgreSQL Metadata and results
Cache Redis Result caching, queue backend
Containerization Docker Deployment

SYSTEM REQUIREMENTS FILE

requirements.txt

# API Framework
fastapi>=0.100.0
uvicorn[standard]>=0.23.0
python-multipart>=0.0.6
pydantic>=2.0.0

# File Detection & Handling
python-magic>=0.4.27
pillow>=10.0.0
pillow-heif>=0.13.0

# Image Preprocessing
opencv-python-headless>=4.8.0
numpy>=1.24.0

# OCR Engines
pytesseract>=0.3.10
paddlepaddle>=2.6.0
paddleocr>=2.8.0
google-cloud-vision>=3.7.0

# PDF Processing
PyMuPDF>=1.23.0

# Redis for job queue
redis>=5.0.0

# HTTP client for callbacks
httpx>=0.24.0

# Testing
pytest>=7.4.0
pytest-asyncio>=0.21.0

System Package Requirements (Ubuntu/Debian)

# Tesseract OCR (backward compatibility engine)
apt-get install tesseract-ocr tesseract-ocr-eng

# PaddlePaddle OpenMP runtime
apt-get install libgomp1

# HEIC Support
apt-get install libheif1 libheif-dev

# GLib (OpenCV dependency)
apt-get install libglib2.0-0

# File type detection
apt-get install libmagic1

Environment Variables

Variable Default Description
OCR_PRIMARY_ENGINE paddleocr Primary OCR engine (paddleocr, tesseract)
OCR_CONFIDENCE_THRESHOLD 0.6 Minimum confidence threshold
OCR_FALLBACK_ENGINE none Fallback engine (google_vision, none)
OCR_FALLBACK_THRESHOLD 0.6 Confidence below this triggers fallback
GOOGLE_VISION_KEY_PATH /run/secrets/google-vision-key.json Path to Google Vision service account key

DOCKERFILE

# Primary engine: PaddleOCR PP-OCRv4 (models baked into image)
# Backward compat: Tesseract 5.x (optional, via TesseractEngine)
# Cloud fallback: Google Vision (optional, requires API key at runtime)

FROM python:3.13-slim

# System dependencies
# - tesseract-ocr/eng: Backward-compatible OCR engine
# - libgomp1: OpenMP runtime required by PaddlePaddle
# - libheif1/libheif-dev: HEIF image support (iPhone photos)
# - libglib2.0-0: GLib shared library (OpenCV dependency)
# - libmagic1: File type detection
# - curl: Health check endpoint
RUN apt-get update && apt-get install -y --no-install-recommends \
    tesseract-ocr \
    tesseract-ocr-eng \
    libgomp1 \
    libheif1 \
    libheif-dev \
    libglib2.0-0 \
    libmagic1 \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Python dependencies
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Pre-download PaddleOCR PP-OCRv4 models during build (not at runtime)
RUN python -c "from paddleocr import PaddleOCR; PaddleOCR(use_angle_cls=True, lang='en', use_gpu=False, show_log=False)" \
    && echo "PaddleOCR PP-OCRv4 models downloaded and verified"

COPY . .

EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

PROCESSING TIME ESTIMATES

Document Type Size Expected Time Notes
Single receipt (HEIC) 2-5 MB 1-3 seconds After preprocessing
Single receipt (JPG) 500 KB-2 MB 0.5-2 seconds Direct processing
Owner's manual (PDF) 10-50 MB 30-120 seconds 100-300 pages
Owner's manual (scanned) 50-200 MB 2-5 minutes Requires full OCR

SCALING CONSIDERATIONS

                    ┌─────────────────────────────────────────┐
                    │            Load Balancer                │
                    │              (nginx)                    │
                    └─────────────────────────────────────────┘
                                       │
              ┌────────────────────────┼────────────────────────┐
              ▼                        ▼                        ▼
   ┌─────────────────┐      ┌─────────────────┐      ┌─────────────────┐
   │   API Server    │      │   API Server    │      │   API Server    │
   │   (FastAPI)     │      │   (FastAPI)     │      │   (FastAPI)     │
   └─────────────────┘      └─────────────────┘      └─────────────────┘
              │                        │                        │
              └────────────────────────┼────────────────────────┘
                                       │
                                       ▼
                    ┌─────────────────────────────────────────┐
                    │            Redis Queue                  │
                    └─────────────────────────────────────────┘
                                       │
              ┌────────────────────────┼────────────────────────┐
              ▼                        ▼                        ▼
   ┌─────────────────┐      ┌─────────────────┐      ┌─────────────────┐
   │  Celery Worker  │      │  Celery Worker  │      │  Celery Worker  │
   │  (OCR Heavy)    │      │  (OCR Heavy)    │      │  (OCR Heavy)    │
   │  CPU Optimized  │      │  CPU Optimized  │      │  GPU Optional   │
   └─────────────────┘      └─────────────────┘      └─────────────────┘

Scaling Strategy

  1. Small files (receipts): Process synchronously in API server
  2. Large files (manuals): Queue to Celery workers, return job ID
  3. Horizontal scaling: Add more Celery workers for throughput
  4. GPU acceleration: PaddleOCR supports GPU for 5-10x speedup

ERROR HANDLING FLOW

┌─────────────────┐
│  File Upload    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐     ┌─────────────────────────────────────────┐
│ Format Valid?   │──NO─▶│ Return 400: Unsupported format         │
└────────┬────────┘     └─────────────────────────────────────────┘
         │ YES
         ▼
┌─────────────────┐     ┌─────────────────────────────────────────┐
│ File Corrupt?   │─YES─▶│ Return 422: Unable to process file     │
└────────┬────────┘     └─────────────────────────────────────────┘
         │ NO
         ▼
┌─────────────────┐     ┌─────────────────────────────────────────┐
│ OCR Confidence  │─LOW─▶│ Return 200 with warning flag           │
│    < 50%?       │     │ { "warning": "low_confidence" }        │
└────────┬────────┘     └─────────────────────────────────────────┘
         │ OK
         ▼
┌─────────────────┐     ┌─────────────────────────────────────────┐
│ No Schedule     │─YES─▶│ Return 200 with raw text only          │
│   Found?        │     │ { "maintenance_schedule": null }       │
└────────┬────────┘     └─────────────────────────────────────────┘
         │ NO
         ▼
┌─────────────────┐
│ Return Full     │
│ Structured Data │
└─────────────────┘

QUICK START

# 1. Clone and setup
git clone <repo>
cd ocr-pipeline

# 2. Build Docker image
docker build -t ocr-pipeline .

# 3. Start services
docker-compose up -d

# 4. Test endpoint
curl -X POST "http://localhost:8000/api/v1/extract" \
  -F "file=@owners_manual.pdf" \
  -F "document_type=manual"

API ENDPOINT DESIGN

POST /api/v1/extract
  - Accepts: multipart/form-data
  - Fields: file, document_type (optional)
  - Returns: Structured JSON or job_id for large files

GET /api/v1/jobs/{job_id}
  - Poll for async job status
  - Returns: status, progress, result when complete

POST /api/v1/extract/batch
  - Accepts: multiple files
  - Returns: array of job_ids