Files
motovaultpro/docs/ocr-pipeline-tech-stack.md
Eric Gullickson 47c5676498
Some checks failed
Deploy to Staging / Build Images (pull_request) Failing after 7m4s
Deploy to Staging / Deploy to Staging (pull_request) Has been skipped
Deploy to Staging / Verify Staging (pull_request) Has been skipped
Deploy to Staging / Notify Staging Ready (pull_request) Has been skipped
Deploy to Staging / Notify Staging Failure (pull_request) Successful in 7s
chore: update OCR tests and documentation (refs #121)
Add engine abstraction tests and update docs to reflect PaddleOCR primary
architecture with optional Google Vision cloud fallback.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-07 11:42:51 -06:00

539 lines
38 KiB
Markdown

# Vehicle Owner's Manual & Receipt OCR Pipeline
## Complete Tech Stack & Architecture
---
## SYSTEM FLOW DIAGRAM
```
┌─────────────────────────────────────────────────────────────────────────────────┐
│ USER UPLOAD │
│ (PDF, JPG, PNG, HEIC, TIFF, WEBP) │
└─────────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────────┐
│ 1. FORMAT DETECTION │
│ python-magic │
│ │
│ • Detect MIME type from file bytes (not extension) │
│ • Validate file is an accepted format │
│ • Reject unsupported/malicious files early │
└─────────────────────────────────────────────────────────────────────────────────┘
┌───────────────┴───────────────┐
▼ ▼
┌─────────────────────────────────┐ ┌─────────────────────────────────┐
│ PDF DETECTED │ │ IMAGE DETECTED │
└─────────────────────────────────┘ │ (JPG, PNG, HEIC, TIFF, WEBP) │
│ └─────────────────────────────────┘
▼ │
┌─────────────────────────────────┐ │
│ 2a. PDF TEXT LAYER CHECK │ │
│ PyMuPDF (fitz) │ │
│ │ │
│ • Check if PDF has embedded │ │
│ searchable text │ │
│ • Count text characters │ │
└─────────────────────────────────┘ │
│ │
┌───────┴───────┐ │
▼ ▼ │
┌──────────────┐ ┌──────────────┐ │
│ HAS TEXT │ │ NO TEXT │ │
│ (Native) │ │ (Scanned) │ │
└──────────────┘ └──────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────┐ │
│ │ 2b. RENDER TO IMAGES │ │
│ │ PyMuPDF @ 300 DPI │ │
│ │ │ │
│ │ • Page-by-page render│ │
│ │ • Maintain quality │ │
│ └──────────────────────┘ │
│ │ │
│ └──────────────┬──────────────┘
│ ▼
│ ┌─────────────────────────────────────────────────────────┐
│ │ 3. IMAGE NORMALIZATION │
│ ├─────────────────────────────────────────────────────────┤
│ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌────────────┐ │
│ │ │ HEIC Input │ │ EXIF Check │ │ Convert to │ │
│ │ │ │───▶│ │───▶│ RGB PNG │ │
│ │ │ pillow-heif │ │ Fix rotation│ │ │ │
│ │ └─────────────┘ │ ImageOps │ │ Pillow │ │
│ │ └─────────────┘ └────────────┘ │
│ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌────────────┐ │
│ │ │ JPG/PNG/etc │ │ EXIF Check │ │ Convert to │ │
│ │ │ │───▶│ │───▶│ RGB PNG │ │
│ │ │ Pillow │ │ Fix rotation│ │ │ │
│ │ └─────────────┘ └─────────────┘ └────────────┘ │
│ │ │
│ └─────────────────────────────────────────────────────────┘
│ │
│ ▼
│ ┌─────────────────────────────────────────────────────────┐
│ │ 4. PREPROCESSING │
│ ├─────────────────────────────────────────────────────────┤
│ │ │
│ │ ┌─────────────────────────────────────────────────┐ │
│ │ │ 4a. Resolution Normalization │ │
│ │ │ Target: 300 DPI equivalent │ │
│ │ │ Tool: Pillow resize with LANCZOS │ │
│ │ └─────────────────────────────────────────────────┘ │
│ │ │ │
│ │ ▼ │
│ │ ┌─────────────────────────────────────────────────┐ │
│ │ │ 4b. Deskew (Straighten) │ │
│ │ │ Tool: OpenCV + deskew library │ │
│ │ │ Method: Hough transform / projection │ │
│ │ └─────────────────────────────────────────────────┘ │
│ │ │ │
│ │ ▼ │
│ │ ┌─────────────────────────────────────────────────┐ │
│ │ │ 4c. Denoise │ │
│ │ │ Tool: OpenCV fastNlMeansDenoising │ │
│ │ └─────────────────────────────────────────────────┘ │
│ │ │ │
│ │ ▼ │
│ │ ┌───────────────────────────────────────────────────┐ │
│ │ │ 4d. Document-Specific Enhancement │ │
│ │ │ │ │
│ │ │ RECEIPTS: MANUALS: │ │
│ │ │ • Adaptive threshold • Contrast stretch │ │
│ │ │ • Perspective correct • Sharpen │ │
│ │ │ • High contrast B&W • Keep grayscale │ │
│ │ │ │ │
│ │ │ Tool: OpenCV adaptiveThreshold / CLAHE │ │
│ │ └───────────────────────────────────────────────────┘ │
│ │ │
│ └─────────────────────────────────────────────────────────┘
│ │
│ ▼
│ ┌─────────────────────────────────────────────────────────┐
│ │ 5. OCR ENGINE │
│ ├─────────────────────────────────────────────────────────┤
│ │ │
│ │ ┌─────────────────────────────────────────────────┐ │
│ │ │ 5a. Engine Abstraction Layer │ │
│ │ │ │ │
│ │ │ OcrEngine ABC -> PaddleOcrEngine (primary) │ │
│ │ │ -> CloudEngine (optional fallback) │ │
│ │ │ -> TesseractEngine (backward compat)│ │
│ │ │ -> HybridEngine (primary+fallback) │ │
│ │ └─────────────────────────────────────────────────┘ │
│ │ │ │
│ │ ▼ │
│ │ ┌─────────────────────────────────────────────────┐ │
│ │ │ 5b. Primary OCR: PaddleOCR PP-OCRv4 │ │
│ │ │ │ │
│ │ │ • Scene text detection + angle classification │ │
│ │ │ • CPU-only, models baked into Docker image │ │
│ │ │ • Normalized output: text, confidence, word boxes│ │
│ │ └─────────────────────────────────────────────────┘ │
│ │ │ │
│ │ ▼ │
│ │ ┌───────────────┐ │
│ │ │ Confidence │ │
│ │ │ >= 60% ? │ │
│ │ └───────────────┘ │
│ │ │ │ │
│ │ YES ──┘ └── NO (and cloud enabled) │
│ │ │ │ │
│ │ │ ▼ │
│ │ │ ┌─────────────────────────────────┐ │
│ │ │ │ 5c. Optional Cloud Fallback │ │
│ │ │ │ (Google Vision API) │ │
│ │ │ │ │ │
│ │ │ │ • Disabled by default │ │
│ │ │ │ • 5-second timeout guard │ │
│ │ │ │ • Returns higher-confidence │ │
│ │ │ │ result of primary vs fallback │ │
│ │ │ └─────────────────────────────────┘ │
│ │ │ │ │
│ │ ▼ ▼ │
│ │ ┌─────────────────────────────────┐ │
│ │ │ 5d. HybridEngine Result │ │
│ │ │ • Compare confidences │ │
│ │ │ • Keep highest confidence │ │
│ │ │ • Graceful fallback on error │ │
│ │ └─────────────────────────────────┘ │
│ │ │
│ └─────────────────────────────────────────────────────────┘
│ │
└──────────────────────────────┤
┌─────────────────────────────────────────────────────────────────────────────────┐
│ 6. STRUCTURED EXTRACTION │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ 6a. Layout Analysis │ │
│ │ Tool: PaddleOCR Layout / LayoutParser │ │
│ │ Detects: Headers, paragraphs, tables, lists, figures │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────┴─────────────────┐ │
│ ▼ ▼ │
│ ┌────────────────────────────────┐ ┌────────────────────────────────┐ │
│ │ 6b. Table Extraction │ │ 6c. Text Block Processing │ │
│ │ │ │ │ │
│ │ Tool: img2table or Camelot │ │ • Group by proximity │ │
│ │ Output: Pandas DataFrame │ │ • Identify sections │ │
│ │ │ │ • Extract hierarchies │ │
│ └────────────────────────────────┘ └────────────────────────────────┘ │
│ │ │ │
│ └─────────────────┬─────────────────┘ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ 6d. Maintenance Schedule Pattern Matching │ │
│ │ │ │
│ │ Tool: regex + spaCy NER │ │
│ │ │ │
│ │ Patterns: │ │
│ │ • Mileage intervals: "every 5,000 miles", "30,000 km" │ │
│ │ • Time intervals: "every 6 months", "annually" │ │
│ │ • Service types: "oil change", "tire rotation", "brake inspection" │ │
│ │ • Fluid types: "5W-30", "ATF", "DOT 4" │ │
│ │ │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────────┐
│ 7. OUTPUT LAYER │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ Structured JSON Response │ │
│ │ │ │
│ │ { │ │
│ │ "document_type": "owners_manual" | "receipt", │ │
│ │ "vehicle": { │ │
│ │ "make": "Toyota", │ │
│ │ "model": "Camry", │ │
│ │ "year": 2024 │ │
│ │ }, │ │
│ │ "maintenance_schedule": [ │ │
│ │ { │ │
│ │ "service": "Oil Change", │ │
│ │ "interval_miles": 5000, │ │
│ │ "interval_months": 6, │ │
│ │ "details": "Use 0W-20 synthetic" │ │
│ │ } │ │
│ │ ], │ │
│ │ "raw_text": "...", │ │
│ │ "tables": [...], │ │
│ │ "confidence_score": 0.94, │ │
│ │ "processing_time_ms": 1250 │ │
│ │ } │ │
│ │ │ │
│ └───────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────────┘
```
---
## COMPLETE TECH STACK
### Core Dependencies
| Component | Tool | Version | Purpose |
|------------------------|-----------------------|-----------|--------------------------------------|
| **Runtime** | Python | 3.11+ | Primary language |
| **API Framework** | FastAPI | 0.100+ | REST API with async support |
| **Task Queue** | Celery + Redis | 5.3+ | Async processing for large docs |
### File Handling
| Component | Tool | Purpose |
|------------------------|-----------------------|--------------------------------------|
| **Format Detection** | python-magic | MIME type detection from bytes |
| **HEIC Support** | pillow-heif | iPhone HEIC image conversion |
| **Image Processing** | Pillow | General image I/O and manipulation |
| **PDF Processing** | PyMuPDF (fitz) | PDF text extraction & rendering |
### Image Preprocessing
| Component | Tool | Purpose |
|------------------------|-----------------------|--------------------------------------|
| **Computer Vision** | OpenCV (cv2) | Deskew, denoise, threshold |
| **Deskew** | deskew | Specialized document straightening |
| **Enhancement** | scikit-image | Additional image filters |
### OCR Engines
| Component | Tool | Purpose |
|------------------------|-----------------------|--------------------------------------|
| **Primary OCR** | PaddleOCR PP-OCRv4 | Highest accuracy scene text, CPU-only |
| **Cloud Fallback** | Google Vision API | Optional cloud fallback (disabled by default) |
| **Backward Compat** | Tesseract 5.x / pytesseract | Legacy engine, configurable via env var |
| **Engine Abstraction** | `OcrEngine` ABC | Pluggable engine interface in `ocr/app/engines/` |
### Data Extraction
| Component | Tool | Purpose |
|------------------------|-----------------------|--------------------------------------|
| **Table Extraction** | img2table | Image-based table extraction |
| **PDF Tables** | Camelot | Native PDF table extraction |
| **NLP** | spaCy | Entity extraction, pattern matching |
| **Data Handling** | pandas | Table/dataframe manipulation |
### Infrastructure
| Component | Tool | Purpose |
|------------------------|-----------------------|--------------------------------------|
| **Object Storage** | S3 / MinIO | Document and result storage |
| **Database** | PostgreSQL | Metadata and results |
| **Cache** | Redis | Result caching, queue backend |
| **Containerization** | Docker | Deployment |
---
## SYSTEM REQUIREMENTS FILE
### requirements.txt
```
# API Framework
fastapi>=0.100.0
uvicorn[standard]>=0.23.0
python-multipart>=0.0.6
pydantic>=2.0.0
# File Detection & Handling
python-magic>=0.4.27
pillow>=10.0.0
pillow-heif>=0.13.0
# Image Preprocessing
opencv-python-headless>=4.8.0
numpy>=1.24.0
# OCR Engines
pytesseract>=0.3.10
paddlepaddle>=2.6.0
paddleocr>=2.8.0
google-cloud-vision>=3.7.0
# PDF Processing
PyMuPDF>=1.23.0
# Redis for job queue
redis>=5.0.0
# HTTP client for callbacks
httpx>=0.24.0
# Testing
pytest>=7.4.0
pytest-asyncio>=0.21.0
```
### System Package Requirements (Ubuntu/Debian)
```bash
# Tesseract OCR (backward compatibility engine)
apt-get install tesseract-ocr tesseract-ocr-eng
# PaddlePaddle OpenMP runtime
apt-get install libgomp1
# HEIC Support
apt-get install libheif1 libheif-dev
# GLib (OpenCV dependency)
apt-get install libglib2.0-0
# File type detection
apt-get install libmagic1
```
### Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `OCR_PRIMARY_ENGINE` | `paddleocr` | Primary OCR engine (`paddleocr`, `tesseract`) |
| `OCR_CONFIDENCE_THRESHOLD` | `0.6` | Minimum confidence threshold |
| `OCR_FALLBACK_ENGINE` | `none` | Fallback engine (`google_vision`, `none`) |
| `OCR_FALLBACK_THRESHOLD` | `0.6` | Confidence below this triggers fallback |
| `GOOGLE_VISION_KEY_PATH` | `/run/secrets/google-vision-key.json` | Path to Google Vision service account key |
---
## DOCKERFILE
```dockerfile
# Primary engine: PaddleOCR PP-OCRv4 (models baked into image)
# Backward compat: Tesseract 5.x (optional, via TesseractEngine)
# Cloud fallback: Google Vision (optional, requires API key at runtime)
FROM python:3.13-slim
# System dependencies
# - tesseract-ocr/eng: Backward-compatible OCR engine
# - libgomp1: OpenMP runtime required by PaddlePaddle
# - libheif1/libheif-dev: HEIF image support (iPhone photos)
# - libglib2.0-0: GLib shared library (OpenCV dependency)
# - libmagic1: File type detection
# - curl: Health check endpoint
RUN apt-get update && apt-get install -y --no-install-recommends \
tesseract-ocr \
tesseract-ocr-eng \
libgomp1 \
libheif1 \
libheif-dev \
libglib2.0-0 \
libmagic1 \
curl \
&& rm -rf /var/lib/apt/lists/*
# Python dependencies
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Pre-download PaddleOCR PP-OCRv4 models during build (not at runtime)
RUN python -c "from paddleocr import PaddleOCR; PaddleOCR(use_angle_cls=True, lang='en', use_gpu=False, show_log=False)" \
&& echo "PaddleOCR PP-OCRv4 models downloaded and verified"
COPY . .
EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
```
---
## PROCESSING TIME ESTIMATES
| Document Type | Size | Expected Time | Notes |
|--------------------------|---------------|----------------|--------------------------------|
| Single receipt (HEIC) | 2-5 MB | 1-3 seconds | After preprocessing |
| Single receipt (JPG) | 500 KB-2 MB | 0.5-2 seconds | Direct processing |
| Owner's manual (PDF) | 10-50 MB | 30-120 seconds | 100-300 pages |
| Owner's manual (scanned) | 50-200 MB | 2-5 minutes | Requires full OCR |
---
## SCALING CONSIDERATIONS
```
┌─────────────────────────────────────────┐
│ Load Balancer │
│ (nginx) │
└─────────────────────────────────────────┘
┌────────────────────────┼────────────────────────┐
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ API Server │ │ API Server │ │ API Server │
│ (FastAPI) │ │ (FastAPI) │ │ (FastAPI) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
└────────────────────────┼────────────────────────┘
┌─────────────────────────────────────────┐
│ Redis Queue │
└─────────────────────────────────────────┘
┌────────────────────────┼────────────────────────┐
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Celery Worker │ │ Celery Worker │ │ Celery Worker │
│ (OCR Heavy) │ │ (OCR Heavy) │ │ (OCR Heavy) │
│ CPU Optimized │ │ CPU Optimized │ │ GPU Optional │
└─────────────────┘ └─────────────────┘ └─────────────────┘
```
### Scaling Strategy
1. **Small files (receipts)**: Process synchronously in API server
2. **Large files (manuals)**: Queue to Celery workers, return job ID
3. **Horizontal scaling**: Add more Celery workers for throughput
4. **GPU acceleration**: PaddleOCR supports GPU for 5-10x speedup
---
## ERROR HANDLING FLOW
```
┌─────────────────┐
│ File Upload │
└────────┬────────┘
┌─────────────────┐ ┌─────────────────────────────────────────┐
│ Format Valid? │──NO─▶│ Return 400: Unsupported format │
└────────┬────────┘ └─────────────────────────────────────────┘
│ YES
┌─────────────────┐ ┌─────────────────────────────────────────┐
│ File Corrupt? │─YES─▶│ Return 422: Unable to process file │
└────────┬────────┘ └─────────────────────────────────────────┘
│ NO
┌─────────────────┐ ┌─────────────────────────────────────────┐
│ OCR Confidence │─LOW─▶│ Return 200 with warning flag │
│ < 50%? │ │ { "warning": "low_confidence" } │
└────────┬────────┘ └─────────────────────────────────────────┘
│ OK
┌─────────────────┐ ┌─────────────────────────────────────────┐
│ No Schedule │─YES─▶│ Return 200 with raw text only │
│ Found? │ │ { "maintenance_schedule": null } │
└────────┬────────┘ └─────────────────────────────────────────┘
│ NO
┌─────────────────┐
│ Return Full │
│ Structured Data │
└─────────────────┘
```
---
## QUICK START
```bash
# 1. Clone and setup
git clone <repo>
cd ocr-pipeline
# 2. Build Docker image
docker build -t ocr-pipeline .
# 3. Start services
docker-compose up -d
# 4. Test endpoint
curl -X POST "http://localhost:8000/api/v1/extract" \
-F "file=@owners_manual.pdf" \
-F "document_type=manual"
```
---
## API ENDPOINT DESIGN
```
POST /api/v1/extract
- Accepts: multipart/form-data
- Fields: file, document_type (optional)
- Returns: Structured JSON or job_id for large files
GET /api/v1/jobs/{job_id}
- Poll for async job status
- Returns: status, progress, result when complete
POST /api/v1/extract/batch
- Accepts: multiple files
- Returns: array of job_ids
```