motovaultpro/docs/ocr-pipeline-tech-stack.md

# Vehicle Owner's Manual & Receipt OCR Pipeline
## Complete Tech Stack & Architecture

---

## SYSTEM FLOW DIAGRAM

```
┌─────────────────────────────────────────────────────────────────────────────────┐
│                              USER UPLOAD                                         │
│                   (PDF, JPG, PNG, HEIC, TIFF, WEBP)                              │
└─────────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                         1. FORMAT DETECTION                                      │
│                           python-magic                                           │
│                                                                                  │
│   • Detect MIME type from file bytes (not extension)                            │
│   • Validate file is an accepted format                                          │
│   • Reject unsupported/malicious files early                                     │
└─────────────────────────────────────────────────────────────────────────────────┘
                                      │
                      ┌───────────────┴───────────────┐
                      ▼                               ▼
┌─────────────────────────────────┐   ┌─────────────────────────────────┐
│         PDF DETECTED            │   │        IMAGE DETECTED           │
└─────────────────────────────────┘   │   (JPG, PNG, HEIC, TIFF, WEBP)  │
                │                     └─────────────────────────────────┘
                ▼                                     │
┌─────────────────────────────────┐                   │
│   2a. PDF TEXT LAYER CHECK      │                   │
│         PyMuPDF (fitz)          │                   │
│                                 │                   │
│  • Check if PDF has embedded    │                   │
│    searchable text              │                   │
│  • Count text characters        │                   │
└─────────────────────────────────┘                   │
                │                                     │
        ┌───────┴───────┐                             │
        ▼               ▼                             │
┌──────────────┐ ┌──────────────┐                     │
│  HAS TEXT    │ │  NO TEXT     │                     │
│  (Native)    │ │  (Scanned)   │                     │
└──────────────┘ └──────────────┘                     │
        │               │                             │
        │               ▼                             │
        │       ┌──────────────────────┐              │
        │       │ 2b. RENDER TO IMAGES │              │
        │       │    PyMuPDF @ 300 DPI │              │
        │       │                      │              │
        │       │ • Page-by-page render│              │
        │       │ • Maintain quality   │              │
        │       └──────────────────────┘              │
        │               │                             │
        │               └──────────────┬──────────────┘
        │                              ▼
        │       ┌─────────────────────────────────────────────────────────┐
        │       │                3. IMAGE NORMALIZATION                    │
        │       ├─────────────────────────────────────────────────────────┤
        │       │                                                         │
        │       │   ┌─────────────┐    ┌─────────────┐    ┌────────────┐ │
        │       │   │ HEIC Input  │    │ EXIF Check  │    │ Convert to │ │
        │       │   │             │───▶│             │───▶│  RGB PNG   │ │
        │       │   │ pillow-heif │    │ Fix rotation│    │            │ │
        │       │   └─────────────┘    │ ImageOps    │    │  Pillow    │ │
        │       │                      └─────────────┘    └────────────┘ │
        │       │                                                         │
        │       │   ┌─────────────┐    ┌─────────────┐    ┌────────────┐ │
        │       │   │ JPG/PNG/etc │    │ EXIF Check  │    │ Convert to │ │
        │       │   │             │───▶│             │───▶│  RGB PNG   │ │
        │       │   │   Pillow    │    │ Fix rotation│    │            │ │
        │       │   └─────────────┘    └─────────────┘    └────────────┘ │
        │       │                                                         │
        │       └─────────────────────────────────────────────────────────┘
        │                              │
        │                              ▼
        │       ┌─────────────────────────────────────────────────────────┐
        │       │                 4. PREPROCESSING                         │
        │       ├─────────────────────────────────────────────────────────┤
        │       │                                                         │
        │       │   ┌─────────────────────────────────────────────────┐   │
        │       │   │  4a. Resolution Normalization                   │   │
        │       │   │      Target: 300 DPI equivalent                 │   │
        │       │   │      Tool: Pillow resize with LANCZOS           │   │
        │       │   └─────────────────────────────────────────────────┘   │
        │       │                         │                               │
        │       │                         ▼                               │
        │       │   ┌─────────────────────────────────────────────────┐   │
        │       │   │  4b. Deskew (Straighten)                        │   │
        │       │   │      Tool: OpenCV + deskew library              │   │
        │       │   │      Method: Hough transform / projection       │   │
        │       │   └─────────────────────────────────────────────────┘   │
        │       │                         │                               │
        │       │                         ▼                               │
        │       │   ┌─────────────────────────────────────────────────┐   │
        │       │   │  4c. Denoise                                    │   │
        │       │   │      Tool: OpenCV fastNlMeansDenoising          │   │
        │       │   └─────────────────────────────────────────────────┘   │
        │       │                         │                               │
        │       │                         ▼                               │
        │       │   ┌───────────────────────────────────────────────────┐ │
        │       │   │  4d. Document-Specific Enhancement                │ │
        │       │   │                                                   │ │
        │       │   │   RECEIPTS:              MANUALS:                 │ │
        │       │   │   • Adaptive threshold   • Contrast stretch       │ │
        │       │   │   • Perspective correct  • Sharpen                │ │
        │       │   │   • High contrast B&W    • Keep grayscale         │ │
        │       │   │                                                   │ │
        │       │   │   Tool: OpenCV adaptiveThreshold / CLAHE          │ │
        │       │   └───────────────────────────────────────────────────┘ │
        │       │                                                         │
        │       └─────────────────────────────────────────────────────────┘
        │                              │
        │                              ▼
        │       ┌─────────────────────────────────────────────────────────┐
        │       │                    5. OCR ENGINE                         │
        │       ├─────────────────────────────────────────────────────────┤
        │       │                                                         │
        │       │   ┌─────────────────────────────────────────────────┐   │
        │       │   │  5a. Engine Abstraction Layer                    │   │
        │       │   │                                                  │   │
        │       │   │  OcrEngine ABC -> PaddleOcrEngine (primary)      │   │
        │       │   │                -> CloudEngine (optional fallback) │   │
        │       │   │                -> TesseractEngine (backward compat)│  │
        │       │   │                -> HybridEngine (primary+fallback) │   │
        │       │   └─────────────────────────────────────────────────┘   │
        │       │                         │                               │
        │       │                         ▼                               │
        │       │   ┌─────────────────────────────────────────────────┐   │
        │       │   │  5b. Primary OCR: PaddleOCR PP-OCRv4             │   │
        │       │   │                                                  │   │
        │       │   │  • Scene text detection + angle classification   │   │
        │       │   │  • CPU-only, models baked into Docker image      │   │
        │       │   │  • Normalized output: text, confidence, word boxes│  │
        │       │   └─────────────────────────────────────────────────┘   │
        │       │                         │                               │
        │       │                         ▼                               │
        │       │                 ┌───────────────┐                       │
        │       │                 │  Confidence   │                       │
        │       │                 │   >= 60% ?    │                       │
        │       │                 └───────────────┘                       │
        │       │                    │         │                          │
        │       │              YES ──┘         └── NO (and cloud enabled) │
        │       │               │                   │                     │
        │       │               │                   ▼                     │
        │       │               │   ┌─────────────────────────────────┐   │
        │       │               │   │  5c. Optional Cloud Fallback     │   │
        │       │               │   │      (Google Vision API)         │   │
        │       │               │   │                                  │   │
        │       │               │   │  • Disabled by default           │   │
        │       │               │   │  • 5-second timeout guard        │   │
        │       │               │   │  • Returns higher-confidence     │   │
        │       │               │   │    result of primary vs fallback │   │
        │       │               │   └─────────────────────────────────┘   │
        │       │               │                   │                     │
        │       │               ▼                   ▼                     │
        │       │         ┌─────────────────────────────────┐             │
        │       │         │  5d. HybridEngine Result        │             │
        │       │         │  • Compare confidences          │             │
        │       │         │  • Keep highest confidence      │             │
        │       │         │  • Graceful fallback on error   │             │
        │       │         └─────────────────────────────────┘             │
        │       │                                                         │
        │       └─────────────────────────────────────────────────────────┘
        │                              │
        └──────────────────────────────┤
                                       ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                        6. STRUCTURED EXTRACTION                                  │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                  │
│   ┌───────────────────────────────────────────────────────────────────────┐     │
│   │  6a. Layout Analysis                                                  │     │
│   │      Tool: PaddleOCR Layout / LayoutParser                            │     │
│   │      Detects: Headers, paragraphs, tables, lists, figures             │     │
│   └───────────────────────────────────────────────────────────────────────┘     │
│                                      │                                          │
│                    ┌─────────────────┴─────────────────┐                        │
│                    ▼                                   ▼                        │
│   ┌────────────────────────────────┐   ┌────────────────────────────────┐       │
│   │  6b. Table Extraction          │   │  6c. Text Block Processing     │       │
│   │                                │   │                                │       │
│   │  Tool: img2table or Camelot    │   │  • Group by proximity          │       │
│   │  Output: Pandas DataFrame      │   │  • Identify sections           │       │
│   │                                │   │  • Extract hierarchies         │       │
│   └────────────────────────────────┘   └────────────────────────────────┘       │
│                    │                                   │                        │
│                    └─────────────────┬─────────────────┘                        │
│                                      ▼                                          │
│   ┌───────────────────────────────────────────────────────────────────────┐     │
│   │  6d. Maintenance Schedule Pattern Matching                            │     │
│   │                                                                       │     │
│   │  Tool: regex + spaCy NER                                              │     │
│   │                                                                       │     │
│   │  Patterns:                                                            │     │
│   │  • Mileage intervals: "every 5,000 miles", "30,000 km"                │     │
│   │  • Time intervals: "every 6 months", "annually"                       │     │
│   │  • Service types: "oil change", "tire rotation", "brake inspection"  │     │
│   │  • Fluid types: "5W-30", "ATF", "DOT 4"                               │     │
│   │                                                                       │     │
│   └───────────────────────────────────────────────────────────────────────┘     │
│                                                                                  │
└─────────────────────────────────────────────────────────────────────────────────┘
                                       │
                                       ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                            7. OUTPUT LAYER                                       │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                  │
│   ┌───────────────────────────────────────────────────────────────────────┐     │
│   │  Structured JSON Response                                             │     │
│   │                                                                       │     │
│   │  {                                                                    │     │
│   │    "document_type": "owners_manual" | "receipt",                      │     │
│   │    "vehicle": {                                                       │     │
│   │      "make": "Toyota",                                                │     │
│   │      "model": "Camry",                                                │     │
│   │      "year": 2024                                                     │     │
│   │    },                                                                 │     │
│   │    "maintenance_schedule": [                                          │     │
│   │      {                                                                │     │
│   │        "service": "Oil Change",                                       │     │
│   │        "interval_miles": 5000,                                        │     │
│   │        "interval_months": 6,                                          │     │
│   │        "details": "Use 0W-20 synthetic"                               │     │
│   │      }                                                                │     │
│   │    ],                                                                 │     │
│   │    "raw_text": "...",                                                 │     │
│   │    "tables": [...],                                                   │     │
│   │    "confidence_score": 0.94,                                          │     │
│   │    "processing_time_ms": 1250                                         │     │
│   │  }                                                                    │     │
│   │                                                                       │     │
│   └───────────────────────────────────────────────────────────────────────┘     │
│                                                                                  │
└─────────────────────────────────────────────────────────────────────────────────┘
```

---

## COMPLETE TECH STACK

### Core Dependencies

| Component              | Tool                  | Version   | Purpose                              |
|------------------------|-----------------------|-----------|--------------------------------------|
| **Runtime**            | Python                | 3.11+     | Primary language                     |
| **API Framework**      | FastAPI               | 0.100+    | REST API with async support          |
| **Task Queue**         | Celery + Redis        | 5.3+      | Async processing for large docs      |

### File Handling

| Component              | Tool                  | Purpose                              |
|------------------------|-----------------------|--------------------------------------|
| **Format Detection**   | python-magic          | MIME type detection from bytes       |
| **HEIC Support**       | pillow-heif           | iPhone HEIC image conversion         |
| **Image Processing**   | Pillow                | General image I/O and manipulation   |
| **PDF Processing**     | PyMuPDF (fitz)        | PDF text extraction & rendering      |

### Image Preprocessing

| Component              | Tool                  | Purpose                              |
|------------------------|-----------------------|--------------------------------------|
| **Computer Vision**    | OpenCV (cv2)          | Deskew, denoise, threshold           |
| **Deskew**             | deskew                | Specialized document straightening   |
| **Enhancement**        | scikit-image          | Additional image filters             |

### OCR Engines

| Component              | Tool                  | Purpose                              |
|------------------------|-----------------------|--------------------------------------|
| **Primary OCR**        | PaddleOCR PP-OCRv4    | Highest accuracy scene text, CPU-only |
| **Cloud Fallback**     | Google Vision API     | Optional cloud fallback (disabled by default) |
| **Backward Compat**    | Tesseract 5.x / pytesseract | Legacy engine, configurable via env var |
| **Engine Abstraction** | `OcrEngine` ABC       | Pluggable engine interface in `ocr/app/engines/` |

### Data Extraction

| Component              | Tool                  | Purpose                              |
|------------------------|-----------------------|--------------------------------------|
| **Table Extraction**   | img2table             | Image-based table extraction         |
| **PDF Tables**         | Camelot               | Native PDF table extraction          |
| **NLP**                | spaCy                 | Entity extraction, pattern matching  |
| **Data Handling**      | pandas                | Table/dataframe manipulation         |

### Infrastructure

| Component              | Tool                  | Purpose                              |
|------------------------|-----------------------|--------------------------------------|
| **Object Storage**     | S3 / MinIO            | Document and result storage          |
| **Database**           | PostgreSQL            | Metadata and results                 |
| **Cache**              | Redis                 | Result caching, queue backend        |
| **Containerization**   | Docker                | Deployment                           |

---

## SYSTEM REQUIREMENTS FILE

### requirements.txt

```
# API Framework
fastapi>=0.100.0
uvicorn[standard]>=0.23.0
python-multipart>=0.0.6
pydantic>=2.0.0

# File Detection & Handling
python-magic>=0.4.27
pillow>=10.0.0
pillow-heif>=0.13.0

# Image Preprocessing
opencv-python-headless>=4.8.0
numpy>=1.24.0

# OCR Engines
pytesseract>=0.3.10
paddlepaddle>=2.6.0
paddleocr>=2.8.0
google-cloud-vision>=3.7.0

# PDF Processing
PyMuPDF>=1.23.0

# Redis for job queue
redis>=5.0.0

# HTTP client for callbacks
httpx>=0.24.0

# Testing
pytest>=7.4.0
pytest-asyncio>=0.21.0
```

### System Package Requirements (Ubuntu/Debian)

```bash
# Tesseract OCR (backward compatibility engine)
apt-get install tesseract-ocr tesseract-ocr-eng

# PaddlePaddle OpenMP runtime
apt-get install libgomp1

# HEIC Support
apt-get install libheif1 libheif-dev

# GLib (OpenCV dependency)
apt-get install libglib2.0-0

# File type detection
apt-get install libmagic1
```

### Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `OCR_PRIMARY_ENGINE` | `paddleocr` | Primary OCR engine (`paddleocr`, `tesseract`) |
| `OCR_CONFIDENCE_THRESHOLD` | `0.6` | Minimum confidence threshold |
| `OCR_FALLBACK_ENGINE` | `none` | Fallback engine (`google_vision`, `none`) |
| `OCR_FALLBACK_THRESHOLD` | `0.6` | Confidence below this triggers fallback |
| `GOOGLE_VISION_KEY_PATH` | `/run/secrets/google-vision-key.json` | Path to Google Vision service account key |

---

## DOCKERFILE

```dockerfile
# Primary engine: PaddleOCR PP-OCRv4 (models baked into image)
# Backward compat: Tesseract 5.x (optional, via TesseractEngine)
# Cloud fallback: Google Vision (optional, requires API key at runtime)

FROM python:3.13-slim

# System dependencies
# - tesseract-ocr/eng: Backward-compatible OCR engine
# - libgomp1: OpenMP runtime required by PaddlePaddle
# - libheif1/libheif-dev: HEIF image support (iPhone photos)
# - libglib2.0-0: GLib shared library (OpenCV dependency)
# - libmagic1: File type detection
# - curl: Health check endpoint
RUN apt-get update && apt-get install -y --no-install-recommends \
    tesseract-ocr \
    tesseract-ocr-eng \
    libgomp1 \
    libheif1 \
    libheif-dev \
    libglib2.0-0 \
    libmagic1 \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Python dependencies
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Pre-download PaddleOCR PP-OCRv4 models during build (not at runtime)
RUN python -c "from paddleocr import PaddleOCR; PaddleOCR(use_angle_cls=True, lang='en', use_gpu=False, show_log=False)" \
    && echo "PaddleOCR PP-OCRv4 models downloaded and verified"

COPY . .

EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
```

---

## PROCESSING TIME ESTIMATES

| Document Type            | Size          | Expected Time  | Notes                          |
|--------------------------|---------------|----------------|--------------------------------|
| Single receipt (HEIC)    | 2-5 MB        | 1-3 seconds    | After preprocessing            |
| Single receipt (JPG)     | 500 KB-2 MB   | 0.5-2 seconds  | Direct processing              |
| Owner's manual (PDF)     | 10-50 MB      | 30-120 seconds | 100-300 pages                  |
| Owner's manual (scanned) | 50-200 MB     | 2-5 minutes    | Requires full OCR              |

---

## SCALING CONSIDERATIONS

```
                    ┌─────────────────────────────────────────┐
                    │            Load Balancer                │
                    │              (nginx)                    │
                    └─────────────────────────────────────────┘
                                       │
              ┌────────────────────────┼────────────────────────┐
              ▼                        ▼                        ▼
   ┌─────────────────┐      ┌─────────────────┐      ┌─────────────────┐
   │   API Server    │      │   API Server    │      │   API Server    │
   │   (FastAPI)     │      │   (FastAPI)     │      │   (FastAPI)     │
   └─────────────────┘      └─────────────────┘      └─────────────────┘
              │                        │                        │
              └────────────────────────┼────────────────────────┘
                                       │
                                       ▼
                    ┌─────────────────────────────────────────┐
                    │            Redis Queue                  │
                    └─────────────────────────────────────────┘
                                       │
              ┌────────────────────────┼────────────────────────┐
              ▼                        ▼                        ▼
   ┌─────────────────┐      ┌─────────────────┐      ┌─────────────────┐
   │  Celery Worker  │      │  Celery Worker  │      │  Celery Worker  │
   │  (OCR Heavy)    │      │  (OCR Heavy)    │      │  (OCR Heavy)    │
   │  CPU Optimized  │      │  CPU Optimized  │      │  GPU Optional   │
   └─────────────────┘      └─────────────────┘      └─────────────────┘
```

### Scaling Strategy

1. **Small files (receipts)**: Process synchronously in API server
2. **Large files (manuals)**: Queue to Celery workers, return job ID
3. **Horizontal scaling**: Add more Celery workers for throughput
4. **GPU acceleration**: PaddleOCR supports GPU for 5-10x speedup

---

## ERROR HANDLING FLOW

```
┌─────────────────┐
│  File Upload    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐     ┌─────────────────────────────────────────┐
│ Format Valid?   │──NO─▶│ Return 400: Unsupported format         │
└────────┬────────┘     └─────────────────────────────────────────┘
         │ YES
         ▼
┌─────────────────┐     ┌─────────────────────────────────────────┐
│ File Corrupt?   │─YES─▶│ Return 422: Unable to process file     │
└────────┬────────┘     └─────────────────────────────────────────┘
         │ NO
         ▼
┌─────────────────┐     ┌─────────────────────────────────────────┐
│ OCR Confidence  │─LOW─▶│ Return 200 with warning flag           │
│    < 50%?       │     │ { "warning": "low_confidence" }        │
└────────┬────────┘     └─────────────────────────────────────────┘
         │ OK
         ▼
┌─────────────────┐     ┌─────────────────────────────────────────┐
│ No Schedule     │─YES─▶│ Return 200 with raw text only          │
│   Found?        │     │ { "maintenance_schedule": null }       │
└────────┬────────┘     └─────────────────────────────────────────┘
         │ NO
         ▼
┌─────────────────┐
│ Return Full     │
│ Structured Data │
└─────────────────┘
```

---

## QUICK START

```bash
# 1. Clone and setup
git clone <repo>
cd ocr-pipeline

# 2. Build Docker image
docker build -t ocr-pipeline .

# 3. Start services
docker-compose up -d

# 4. Test endpoint
curl -X POST "http://localhost:8000/api/v1/extract" \
  -F "file=@owners_manual.pdf" \
  -F "document_type=manual"
```

---

## API ENDPOINT DESIGN

```
POST /api/v1/extract
  - Accepts: multipart/form-data
  - Fields: file, document_type (optional)
  - Returns: Structured JSON or job_id for large files

GET /api/v1/jobs/{job_id}
  - Poll for async job status
  - Returns: status, progress, result when complete

POST /api/v1/extract/batch
  - Accepts: multiple files
  - Returns: array of job_ids
```