chore: update OCR tests and documentation (refs #121)

Add engine abstraction tests and update docs to reflect PaddleOCR primary architecture with optional Google Vision cloud fallback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-07 11:42:51 -06:00
parent 1e96baca6f
commit 47c5676498
7 changed files with 870 additions and 68 deletions
--- a/docs/ocr-pipeline-tech-stack.md
+++ b/docs/ocr-pipeline-tech-stack.md
@@ -118,35 +118,48 @@
        │       ├─────────────────────────────────────────────────────────┤
        │       │                                                         │
        │       │   ┌─────────────────────────────────────────────────┐   │
-        │       │   │  5a. Primary OCR: Tesseract 5.x                 │   │
-        │       │   │                                                 │   │
-        │       │   │  • Engine: LSTM (--oem 1)                       │   │
-        │       │   │  • Page segmentation: Auto (--psm 3)            │   │
-        │       │   │  • Output: hOCR with word confidence            │   │
+        │       │   │  5a. Engine Abstraction Layer                    │   │
+        │       │   │                                                  │   │
+        │       │   │  OcrEngine ABC -> PaddleOcrEngine (primary)      │   │
+        │       │   │                -> CloudEngine (optional fallback) │   │
+        │       │   │                -> TesseractEngine (backward compat)│  │
+        │       │   │                -> HybridEngine (primary+fallback) │   │
+        │       │   └─────────────────────────────────────────────────┘   │
+        │       │                         │                               │
+        │       │                         ▼                               │
+        │       │   ┌─────────────────────────────────────────────────┐   │
+        │       │   │  5b. Primary OCR: PaddleOCR PP-OCRv4             │   │
+        │       │   │                                                  │   │
+        │       │   │  • Scene text detection + angle classification   │   │
+        │       │   │  • CPU-only, models baked into Docker image      │   │
+        │       │   │  • Normalized output: text, confidence, word boxes│  │
        │       │   └─────────────────────────────────────────────────┘   │
        │       │                         │                               │
        │       │                         ▼                               │
        │       │                 ┌───────────────┐                       │
        │       │                 │  Confidence   │                       │
-        │       │                 │    > 80% ?    │                       │
+        │       │                 │   >= 60% ?    │                       │
        │       │                 └───────────────┘                       │
        │       │                    │         │                          │
-        │       │              YES ──┘         └── NO                     │
+        │       │              YES ──┘         └── NO (and cloud enabled) │
        │       │               │                   │                     │
        │       │               │                   ▼                     │
        │       │               │   ┌─────────────────────────────────┐   │
-        │       │               │   │  5b. Fallback: PaddleOCR        │   │
-        │       │               │   │                                 │   │
-        │       │               │   │  • Better for degraded images   │   │
-        │       │               │   │  • Better table detection       │   │
-        │       │               │   │  • Slower but more accurate     │   │
+        │       │               │   │  5c. Optional Cloud Fallback     │   │
+        │       │               │   │      (Google Vision API)         │   │
+        │       │               │   │                                  │   │
+        │       │               │   │  • Disabled by default           │   │
+        │       │               │   │  • 5-second timeout guard        │   │
+        │       │               │   │  • Returns higher-confidence     │   │
+        │       │               │   │    result of primary vs fallback │   │
        │       │               │   └─────────────────────────────────┘   │
        │       │               │                   │                     │
        │       │               ▼                   ▼                     │
        │       │         ┌─────────────────────────────────┐             │
-        │       │         │  5c. Result Merging             │             │
-        │       │         │  • Merge by bounding box        │             │
+        │       │         │  5d. HybridEngine Result        │             │
+        │       │         │  • Compare confidences          │             │
        │       │         │  • Keep highest confidence      │             │
+        │       │         │  • Graceful fallback on error   │             │
        │       │         └─────────────────────────────────┘             │
        │       │                                                         │
        │       └─────────────────────────────────────────────────────────┘
@@ -257,10 +270,10 @@

 | Component              | Tool                  | Purpose                              |
 |------------------------|-----------------------|--------------------------------------|
-| **Primary OCR**        | Tesseract 5.x         | Fast, reliable text extraction       |
-| **Python Binding**     | pytesseract           | Tesseract Python wrapper             |
-| **Fallback OCR**       | PaddleOCR             | Higher accuracy, better tables       |
-| **Layout Analysis**    | PaddleOCR / LayoutParser | Document structure detection      |
+| **Primary OCR**        | PaddleOCR PP-OCRv4    | Highest accuracy scene text, CPU-only |
+| **Cloud Fallback**     | Google Vision API     | Optional cloud fallback (disabled by default) |
+| **Backward Compat**    | Tesseract 5.x / pytesseract | Legacy engine, configurable via env var |
+| **Engine Abstraction** | `OcrEngine` ABC       | Pluggable engine interface in `ocr/app/engines/` |

 ### Data Extraction

@@ -291,85 +304,93 @@
 fastapi>=0.100.0
 uvicorn[standard]>=0.23.0
 python-multipart>=0.0.6
-
-# Task Queue
-celery>=5.3.0
-redis>=4.6.0
+pydantic>=2.0.0

 # File Detection & Handling
 python-magic>=0.4.27
 pillow>=10.0.0
 pillow-heif>=0.13.0

-# PDF Processing
-pymupdf>=1.23.0
-
 # Image Preprocessing
 opencv-python-headless>=4.8.0
-deskew>=1.4.0
-scikit-image>=0.21.0
 numpy>=1.24.0

 # OCR Engines
 pytesseract>=0.3.10
-paddlepaddle>=2.5.0
-paddleocr>=2.7.0
+paddlepaddle>=2.6.0
+paddleocr>=2.8.0
+google-cloud-vision>=3.7.0

-# Table Extraction
-img2table>=1.2.0
-camelot-py[cv]>=0.11.0
+# PDF Processing
+PyMuPDF>=1.23.0

-# NLP & Data
-spacy>=3.6.0
-pandas>=2.0.0
+# Redis for job queue
+redis>=5.0.0

-# Storage & Database
-boto3>=1.28.0
-psycopg2-binary>=2.9.0
-sqlalchemy>=2.0.0
+# HTTP client for callbacks
+httpx>=0.24.0
+
+# Testing
+pytest>=7.4.0
+pytest-asyncio>=0.21.0
 ```

 ### System Package Requirements (Ubuntu/Debian)

 ```bash
-# Tesseract OCR
-apt-get install tesseract-ocr tesseract-ocr-eng libtesseract-dev
+# Tesseract OCR (backward compatibility engine)
+apt-get install tesseract-ocr tesseract-ocr-eng
+
+# PaddlePaddle OpenMP runtime
+apt-get install libgomp1

 # HEIC Support
-apt-get install libheif-examples libheif-dev
+apt-get install libheif1 libheif-dev

-# OpenCV dependencies
-apt-get install libgl1-mesa-glx libglib2.0-0
+# GLib (OpenCV dependency)
+apt-get install libglib2.0-0

-# PDF rendering dependencies
-apt-get install libmupdf-dev mupdf-tools
-
-# Image processing
-apt-get install libmagic1 ghostscript
-
-# Camelot dependencies
-apt-get install ghostscript python3-tk
+# File type detection
+apt-get install libmagic1
 ```

+### Environment Variables
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `OCR_PRIMARY_ENGINE` | `paddleocr` | Primary OCR engine (`paddleocr`, `tesseract`) |
+| `OCR_CONFIDENCE_THRESHOLD` | `0.6` | Minimum confidence threshold |
+| `OCR_FALLBACK_ENGINE` | `none` | Fallback engine (`google_vision`, `none`) |
+| `OCR_FALLBACK_THRESHOLD` | `0.6` | Confidence below this triggers fallback |
+| `GOOGLE_VISION_KEY_PATH` | `/run/secrets/google-vision-key.json` | Path to Google Vision service account key |
+
 ---

 ## DOCKERFILE

 ```dockerfile
-FROM python:3.11-slim
+# Primary engine: PaddleOCR PP-OCRv4 (models baked into image)
+# Backward compat: Tesseract 5.x (optional, via TesseractEngine)
+# Cloud fallback: Google Vision (optional, requires API key at runtime)
+
+FROM python:3.13-slim

 # System dependencies
+# - tesseract-ocr/eng: Backward-compatible OCR engine
+# - libgomp1: OpenMP runtime required by PaddlePaddle
+# - libheif1/libheif-dev: HEIF image support (iPhone photos)
+# - libglib2.0-0: GLib shared library (OpenCV dependency)
+# - libmagic1: File type detection
+# - curl: Health check endpoint
 RUN apt-get update && apt-get install -y --no-install-recommends \
    tesseract-ocr \
    tesseract-ocr-eng \
-    libtesseract-dev \
-    libheif-examples \
+    libgomp1 \
+    libheif1 \
    libheif-dev \
-    libgl1-mesa-glx \
    libglib2.0-0 \
    libmagic1 \
-    ghostscript \
-    poppler-utils \
+    curl \
    && rm -rf /var/lib/apt/lists/*

 # Python dependencies
@@ -377,11 +398,9 @@ WORKDIR /app
 COPY requirements.txt .
 RUN pip install --no-cache-dir -r requirements.txt

-# Download spaCy model
-RUN python -m spacy download en_core_web_sm
-
-# Download PaddleOCR models (cached in image)
-RUN python -c "from paddleocr import PaddleOCR; PaddleOCR(use_angle_cls=True, lang='en')"
+# Pre-download PaddleOCR PP-OCRv4 models during build (not at runtime)
+RUN python -c "from paddleocr import PaddleOCR; PaddleOCR(use_angle_cls=True, lang='en', use_gpu=False, show_log=False)" \
+    && echo "PaddleOCR PP-OCRv4 models downloaded and verified"

 COPY . .