chore: update OCR tests and documentation (refs #121)
Some checks failed
Deploy to Staging / Build Images (pull_request) Failing after 7m4s
Deploy to Staging / Deploy to Staging (pull_request) Has been skipped
Deploy to Staging / Verify Staging (pull_request) Has been skipped
Deploy to Staging / Notify Staging Ready (pull_request) Has been skipped
Deploy to Staging / Notify Staging Failure (pull_request) Successful in 7s
Some checks failed
Deploy to Staging / Build Images (pull_request) Failing after 7m4s
Deploy to Staging / Deploy to Staging (pull_request) Has been skipped
Deploy to Staging / Verify Staging (pull_request) Has been skipped
Deploy to Staging / Notify Staging Ready (pull_request) Has been skipped
Deploy to Staging / Notify Staging Failure (pull_request) Successful in 7s
Add engine abstraction tests and update docs to reflect PaddleOCR primary architecture with optional Google Vision cloud fallback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -118,35 +118,48 @@
|
||||
│ ├─────────────────────────────────────────────────────────┤
|
||||
│ │ │
|
||||
│ │ ┌─────────────────────────────────────────────────┐ │
|
||||
│ │ │ 5a. Primary OCR: Tesseract 5.x │ │
|
||||
│ │ │ │ │
|
||||
│ │ │ • Engine: LSTM (--oem 1) │ │
|
||||
│ │ │ • Page segmentation: Auto (--psm 3) │ │
|
||||
│ │ │ • Output: hOCR with word confidence │ │
|
||||
│ │ │ 5a. Engine Abstraction Layer │ │
|
||||
│ │ │ │ │
|
||||
│ │ │ OcrEngine ABC -> PaddleOcrEngine (primary) │ │
|
||||
│ │ │ -> CloudEngine (optional fallback) │ │
|
||||
│ │ │ -> TesseractEngine (backward compat)│ │
|
||||
│ │ │ -> HybridEngine (primary+fallback) │ │
|
||||
│ │ └─────────────────────────────────────────────────┘ │
|
||||
│ │ │ │
|
||||
│ │ ▼ │
|
||||
│ │ ┌─────────────────────────────────────────────────┐ │
|
||||
│ │ │ 5b. Primary OCR: PaddleOCR PP-OCRv4 │ │
|
||||
│ │ │ │ │
|
||||
│ │ │ • Scene text detection + angle classification │ │
|
||||
│ │ │ • CPU-only, models baked into Docker image │ │
|
||||
│ │ │ • Normalized output: text, confidence, word boxes│ │
|
||||
│ │ └─────────────────────────────────────────────────┘ │
|
||||
│ │ │ │
|
||||
│ │ ▼ │
|
||||
│ │ ┌───────────────┐ │
|
||||
│ │ │ Confidence │ │
|
||||
│ │ │ > 80% ? │ │
|
||||
│ │ │ >= 60% ? │ │
|
||||
│ │ └───────────────┘ │
|
||||
│ │ │ │ │
|
||||
│ │ YES ──┘ └── NO │
|
||||
│ │ YES ──┘ └── NO (and cloud enabled) │
|
||||
│ │ │ │ │
|
||||
│ │ │ ▼ │
|
||||
│ │ │ ┌─────────────────────────────────┐ │
|
||||
│ │ │ │ 5b. Fallback: PaddleOCR │ │
|
||||
│ │ │ │ │ │
|
||||
│ │ │ │ • Better for degraded images │ │
|
||||
│ │ │ │ • Better table detection │ │
|
||||
│ │ │ │ • Slower but more accurate │ │
|
||||
│ │ │ │ 5c. Optional Cloud Fallback │ │
|
||||
│ │ │ │ (Google Vision API) │ │
|
||||
│ │ │ │ │ │
|
||||
│ │ │ │ • Disabled by default │ │
|
||||
│ │ │ │ • 5-second timeout guard │ │
|
||||
│ │ │ │ • Returns higher-confidence │ │
|
||||
│ │ │ │ result of primary vs fallback │ │
|
||||
│ │ │ └─────────────────────────────────┘ │
|
||||
│ │ │ │ │
|
||||
│ │ ▼ ▼ │
|
||||
│ │ ┌─────────────────────────────────┐ │
|
||||
│ │ │ 5c. Result Merging │ │
|
||||
│ │ │ • Merge by bounding box │ │
|
||||
│ │ │ 5d. HybridEngine Result │ │
|
||||
│ │ │ • Compare confidences │ │
|
||||
│ │ │ • Keep highest confidence │ │
|
||||
│ │ │ • Graceful fallback on error │ │
|
||||
│ │ └─────────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ └─────────────────────────────────────────────────────────┘
|
||||
@@ -257,10 +270,10 @@
|
||||
|
||||
| Component | Tool | Purpose |
|
||||
|------------------------|-----------------------|--------------------------------------|
|
||||
| **Primary OCR** | Tesseract 5.x | Fast, reliable text extraction |
|
||||
| **Python Binding** | pytesseract | Tesseract Python wrapper |
|
||||
| **Fallback OCR** | PaddleOCR | Higher accuracy, better tables |
|
||||
| **Layout Analysis** | PaddleOCR / LayoutParser | Document structure detection |
|
||||
| **Primary OCR** | PaddleOCR PP-OCRv4 | Highest accuracy scene text, CPU-only |
|
||||
| **Cloud Fallback** | Google Vision API | Optional cloud fallback (disabled by default) |
|
||||
| **Backward Compat** | Tesseract 5.x / pytesseract | Legacy engine, configurable via env var |
|
||||
| **Engine Abstraction** | `OcrEngine` ABC | Pluggable engine interface in `ocr/app/engines/` |
|
||||
|
||||
### Data Extraction
|
||||
|
||||
@@ -291,85 +304,93 @@
|
||||
fastapi>=0.100.0
|
||||
uvicorn[standard]>=0.23.0
|
||||
python-multipart>=0.0.6
|
||||
|
||||
# Task Queue
|
||||
celery>=5.3.0
|
||||
redis>=4.6.0
|
||||
pydantic>=2.0.0
|
||||
|
||||
# File Detection & Handling
|
||||
python-magic>=0.4.27
|
||||
pillow>=10.0.0
|
||||
pillow-heif>=0.13.0
|
||||
|
||||
# PDF Processing
|
||||
pymupdf>=1.23.0
|
||||
|
||||
# Image Preprocessing
|
||||
opencv-python-headless>=4.8.0
|
||||
deskew>=1.4.0
|
||||
scikit-image>=0.21.0
|
||||
numpy>=1.24.0
|
||||
|
||||
# OCR Engines
|
||||
pytesseract>=0.3.10
|
||||
paddlepaddle>=2.5.0
|
||||
paddleocr>=2.7.0
|
||||
paddlepaddle>=2.6.0
|
||||
paddleocr>=2.8.0
|
||||
google-cloud-vision>=3.7.0
|
||||
|
||||
# Table Extraction
|
||||
img2table>=1.2.0
|
||||
camelot-py[cv]>=0.11.0
|
||||
# PDF Processing
|
||||
PyMuPDF>=1.23.0
|
||||
|
||||
# NLP & Data
|
||||
spacy>=3.6.0
|
||||
pandas>=2.0.0
|
||||
# Redis for job queue
|
||||
redis>=5.0.0
|
||||
|
||||
# Storage & Database
|
||||
boto3>=1.28.0
|
||||
psycopg2-binary>=2.9.0
|
||||
sqlalchemy>=2.0.0
|
||||
# HTTP client for callbacks
|
||||
httpx>=0.24.0
|
||||
|
||||
# Testing
|
||||
pytest>=7.4.0
|
||||
pytest-asyncio>=0.21.0
|
||||
```
|
||||
|
||||
### System Package Requirements (Ubuntu/Debian)
|
||||
|
||||
```bash
|
||||
# Tesseract OCR
|
||||
apt-get install tesseract-ocr tesseract-ocr-eng libtesseract-dev
|
||||
# Tesseract OCR (backward compatibility engine)
|
||||
apt-get install tesseract-ocr tesseract-ocr-eng
|
||||
|
||||
# PaddlePaddle OpenMP runtime
|
||||
apt-get install libgomp1
|
||||
|
||||
# HEIC Support
|
||||
apt-get install libheif-examples libheif-dev
|
||||
apt-get install libheif1 libheif-dev
|
||||
|
||||
# OpenCV dependencies
|
||||
apt-get install libgl1-mesa-glx libglib2.0-0
|
||||
# GLib (OpenCV dependency)
|
||||
apt-get install libglib2.0-0
|
||||
|
||||
# PDF rendering dependencies
|
||||
apt-get install libmupdf-dev mupdf-tools
|
||||
|
||||
# Image processing
|
||||
apt-get install libmagic1 ghostscript
|
||||
|
||||
# Camelot dependencies
|
||||
apt-get install ghostscript python3-tk
|
||||
# File type detection
|
||||
apt-get install libmagic1
|
||||
```
|
||||
|
||||
### Environment Variables
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `OCR_PRIMARY_ENGINE` | `paddleocr` | Primary OCR engine (`paddleocr`, `tesseract`) |
|
||||
| `OCR_CONFIDENCE_THRESHOLD` | `0.6` | Minimum confidence threshold |
|
||||
| `OCR_FALLBACK_ENGINE` | `none` | Fallback engine (`google_vision`, `none`) |
|
||||
| `OCR_FALLBACK_THRESHOLD` | `0.6` | Confidence below this triggers fallback |
|
||||
| `GOOGLE_VISION_KEY_PATH` | `/run/secrets/google-vision-key.json` | Path to Google Vision service account key |
|
||||
|
||||
---
|
||||
|
||||
## DOCKERFILE
|
||||
|
||||
```dockerfile
|
||||
FROM python:3.11-slim
|
||||
# Primary engine: PaddleOCR PP-OCRv4 (models baked into image)
|
||||
# Backward compat: Tesseract 5.x (optional, via TesseractEngine)
|
||||
# Cloud fallback: Google Vision (optional, requires API key at runtime)
|
||||
|
||||
FROM python:3.13-slim
|
||||
|
||||
# System dependencies
|
||||
# - tesseract-ocr/eng: Backward-compatible OCR engine
|
||||
# - libgomp1: OpenMP runtime required by PaddlePaddle
|
||||
# - libheif1/libheif-dev: HEIF image support (iPhone photos)
|
||||
# - libglib2.0-0: GLib shared library (OpenCV dependency)
|
||||
# - libmagic1: File type detection
|
||||
# - curl: Health check endpoint
|
||||
RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||
tesseract-ocr \
|
||||
tesseract-ocr-eng \
|
||||
libtesseract-dev \
|
||||
libheif-examples \
|
||||
libgomp1 \
|
||||
libheif1 \
|
||||
libheif-dev \
|
||||
libgl1-mesa-glx \
|
||||
libglib2.0-0 \
|
||||
libmagic1 \
|
||||
ghostscript \
|
||||
poppler-utils \
|
||||
curl \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Python dependencies
|
||||
@@ -377,11 +398,9 @@ WORKDIR /app
|
||||
COPY requirements.txt .
|
||||
RUN pip install --no-cache-dir -r requirements.txt
|
||||
|
||||
# Download spaCy model
|
||||
RUN python -m spacy download en_core_web_sm
|
||||
|
||||
# Download PaddleOCR models (cached in image)
|
||||
RUN python -c "from paddleocr import PaddleOCR; PaddleOCR(use_angle_cls=True, lang='en')"
|
||||
# Pre-download PaddleOCR PP-OCRv4 models during build (not at runtime)
|
||||
RUN python -c "from paddleocr import PaddleOCR; PaddleOCR(use_angle_cls=True, lang='en', use_gpu=False, show_log=False)" \
|
||||
&& echo "PaddleOCR PP-OCRv4 models downloaded and verified"
|
||||
|
||||
COPY . .
|
||||
|
||||
|
||||
Reference in New Issue
Block a user