motovaultpro/docs/ocr-pipeline-tech-stack.md

# Vehicle Owner's Manual & Receipt OCR Pipeline
## Complete Tech Stack & Architecture

---

## SYSTEM FLOW DIAGRAM

```
┌─────────────────────────────────────────────────────────────────────────────────┐
│                              USER UPLOAD                                         │
│                   (PDF, JPG, PNG, HEIC, TIFF, WEBP)                              │
└─────────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                         1. FORMAT DETECTION                                      │
│                           python-magic                                           │
│                                                                                  │
│   • Detect MIME type from file bytes (not extension)                            │
│   • Validate file is an accepted format                                          │
│   • Reject unsupported/malicious files early                                     │
└─────────────────────────────────────────────────────────────────────────────────┘
                                      │
                      ┌───────────────┴───────────────┐
                      ▼                               ▼
┌─────────────────────────────────┐   ┌─────────────────────────────────┐
│         PDF DETECTED            │   │        IMAGE DETECTED           │
└─────────────────────────────────┘   │   (JPG, PNG, HEIC, TIFF, WEBP)  │
                │                     └─────────────────────────────────┘
                ▼                                     │
┌─────────────────────────────────┐                   │
│   2a. PDF TEXT LAYER CHECK      │                   │
│         PyMuPDF (fitz)          │                   │
│                                 │                   │
│  • Check if PDF has embedded    │                   │
│    searchable text              │                   │
│  • Count text characters        │                   │
└─────────────────────────────────┘                   │
                │                                     │
        ┌───────┴───────┐                             │
        ▼               ▼                             │
┌──────────────┐ ┌──────────────┐                     │
│  HAS TEXT    │ │  NO TEXT     │                     │
│  (Native)    │ │  (Scanned)   │                     │
└──────────────┘ └──────────────┘                     │
        │               │                             │
        │               ▼                             │
        │       ┌──────────────────────┐              │
        │       │ 2b. RENDER TO IMAGES │              │
        │       │    PyMuPDF @ 300 DPI │              │
        │       │                      │              │
        │       │ • Page-by-page render│              │
        │       │ • Maintain quality   │              │
        │       └──────────────────────┘              │
        │               │                             │
        │               └──────────────┬──────────────┘
        │                              ▼
        │       ┌─────────────────────────────────────────────────────────┐
        │       │                3. IMAGE NORMALIZATION                    │
        │       ├─────────────────────────────────────────────────────────┤
        │       │                                                         │
        │       │   ┌─────────────┐    ┌─────────────┐    ┌────────────┐ │
        │       │   │ HEIC Input  │    │ EXIF Check  │    │ Convert to │ │
        │       │   │             │───▶│             │───▶│  RGB PNG   │ │
        │       │   │ pillow-heif │    │ Fix rotation│    │            │ │
        │       │   └─────────────┘    │ ImageOps    │    │  Pillow    │ │
        │       │                      └─────────────┘    └────────────┘ │
        │       │                                                         │
        │       │   ┌─────────────┐    ┌─────────────┐    ┌────────────┐ │
        │       │   │ JPG/PNG/etc │    │ EXIF Check  │    │ Convert to │ │
        │       │   │             │───▶│             │───▶│  RGB PNG   │ │
        │       │   │   Pillow    │    │ Fix rotation│    │            │ │
        │       │   └─────────────┘    └─────────────┘    └────────────┘ │
        │       │                                                         │
        │       └─────────────────────────────────────────────────────────┘
        │                              │
        │                              ▼
        │       ┌─────────────────────────────────────────────────────────┐
        │       │                 4. PREPROCESSING                         │
        │       ├─────────────────────────────────────────────────────────┤
        │       │                                                         │
        │       │   ┌─────────────────────────────────────────────────┐   │
        │       │   │  4a. Resolution Normalization                   │   │
        │       │   │      Target: 300 DPI equivalent                 │   │
        │       │   │      Tool: Pillow resize with LANCZOS           │   │
        │       │   └─────────────────────────────────────────────────┘   │
        │       │                         │                               │
        │       │                         ▼                               │
        │       │   ┌─────────────────────────────────────────────────┐   │
        │       │   │  4b. Deskew (Straighten)                        │   │
        │       │   │      Tool: OpenCV + deskew library              │   │
        │       │   │      Method: Hough transform / projection       │   │
        │       │   └─────────────────────────────────────────────────┘   │
        │       │                         │                               │
        │       │                         ▼                               │
        │       │   ┌─────────────────────────────────────────────────┐   │
        │       │   │  4c. Denoise                                    │   │
        │       │   │      Tool: OpenCV fastNlMeansDenoising          │   │
        │       │   └─────────────────────────────────────────────────┘   │
        │       │                         │                               │
        │       │                         ▼                               │
        │       │   ┌───────────────────────────────────────────────────┐ │
        │       │   │  4d. Document-Specific Enhancement                │ │
        │       │   │                                                   │ │
        │       │   │   RECEIPTS:              MANUALS:                 │ │
        │       │   │   • Adaptive threshold   • Contrast stretch       │ │
        │       │   │   • Perspective correct  • Sharpen                │ │
        │       │   │   • High contrast B&W    • Keep grayscale         │ │
        │       │   │                                                   │ │
        │       │   │   Tool: OpenCV adaptiveThreshold / CLAHE          │ │
        │       │   └───────────────────────────────────────────────────┘ │
        │       │                                                         │
        │       └─────────────────────────────────────────────────────────┘
        │                              │
        │                              ▼
        │       ┌─────────────────────────────────────────────────────────┐
        │       │                    5. OCR ENGINE                         │
        │       ├─────────────────────────────────────────────────────────┤
        │       │                                                         │
        │       │   ┌─────────────────────────────────────────────────┐   │
        │       │   │  5a. Primary OCR: Tesseract 5.x                 │   │
        │       │   │                                                 │   │
        │       │   │  • Engine: LSTM (--oem 1)                       │   │
        │       │   │  • Page segmentation: Auto (--psm 3)            │   │
        │       │   │  • Output: hOCR with word confidence            │   │
        │       │   └─────────────────────────────────────────────────┘   │
        │       │                         │                               │
        │       │                         ▼                               │
        │       │                 ┌───────────────┐                       │
        │       │                 │  Confidence   │                       │
        │       │                 │    > 80% ?    │                       │
        │       │                 └───────────────┘                       │
        │       │                    │         │                          │
        │       │              YES ──┘         └── NO                     │
        │       │               │                   │                     │
        │       │               │                   ▼                     │
        │       │               │   ┌─────────────────────────────────┐   │
        │       │               │   │  5b. Fallback: PaddleOCR        │   │
        │       │               │   │                                 │   │
        │       │               │   │  • Better for degraded images   │   │
        │       │               │   │  • Better table detection       │   │
        │       │               │   │  • Slower but more accurate     │   │
        │       │               │   └─────────────────────────────────┘   │
        │       │               │                   │                     │
        │       │               ▼                   ▼                     │
        │       │         ┌─────────────────────────────────┐             │
        │       │         │  5c. Result Merging             │             │
        │       │         │  • Merge by bounding box        │             │
        │       │         │  • Keep highest confidence      │             │
        │       │         └─────────────────────────────────┘             │
        │       │                                                         │
        │       └─────────────────────────────────────────────────────────┘
        │                              │
        └──────────────────────────────┤
                                       ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                        6. STRUCTURED EXTRACTION                                  │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                  │
│   ┌───────────────────────────────────────────────────────────────────────┐     │
│   │  6a. Layout Analysis                                                  │     │
│   │      Tool: PaddleOCR Layout / LayoutParser                            │     │
│   │      Detects: Headers, paragraphs, tables, lists, figures             │     │
│   └───────────────────────────────────────────────────────────────────────┘     │
│                                      │                                          │
│                    ┌─────────────────┴─────────────────┐                        │
│                    ▼                                   ▼                        │
│   ┌────────────────────────────────┐   ┌────────────────────────────────┐       │
│   │  6b. Table Extraction          │   │  6c. Text Block Processing     │       │
│   │                                │   │                                │       │
│   │  Tool: img2table or Camelot    │   │  • Group by proximity          │       │
│   │  Output: Pandas DataFrame      │   │  • Identify sections           │       │
│   │                                │   │  • Extract hierarchies         │       │
│   └────────────────────────────────┘   └────────────────────────────────┘       │
│                    │                                   │                        │
│                    └─────────────────┬─────────────────┘                        │
│                                      ▼                                          │
│   ┌───────────────────────────────────────────────────────────────────────┐     │
│   │  6d. Maintenance Schedule Pattern Matching                            │     │
│   │                                                                       │     │
│   │  Tool: regex + spaCy NER                                              │     │
│   │                                                                       │     │
│   │  Patterns:                                                            │     │
│   │  • Mileage intervals: "every 5,000 miles", "30,000 km"                │     │
│   │  • Time intervals: "every 6 months", "annually"                       │     │
│   │  • Service types: "oil change", "tire rotation", "brake inspection"  │     │
│   │  • Fluid types: "5W-30", "ATF", "DOT 4"                               │     │
│   │                                                                       │     │
│   └───────────────────────────────────────────────────────────────────────┘     │
│                                                                                  │
└─────────────────────────────────────────────────────────────────────────────────┘
                                       │
                                       ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                            7. OUTPUT LAYER                                       │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                  │
│   ┌───────────────────────────────────────────────────────────────────────┐     │
│   │  Structured JSON Response                                             │     │
│   │                                                                       │     │
│   │  {                                                                    │     │
│   │    "document_type": "owners_manual" | "receipt",                      │     │
│   │    "vehicle": {                                                       │     │
│   │      "make": "Toyota",                                                │     │
│   │      "model": "Camry",                                                │     │
│   │      "year": 2024                                                     │     │
│   │    },                                                                 │     │
│   │    "maintenance_schedule": [                                          │     │
│   │      {                                                                │     │
│   │        "service": "Oil Change",                                       │     │
│   │        "interval_miles": 5000,                                        │     │
│   │        "interval_months": 6,                                          │     │
│   │        "details": "Use 0W-20 synthetic"                               │     │
│   │      }                                                                │     │
│   │    ],                                                                 │     │
│   │    "raw_text": "...",                                                 │     │
│   │    "tables": [...],                                                   │     │
│   │    "confidence_score": 0.94,                                          │     │
│   │    "processing_time_ms": 1250                                         │     │
│   │  }                                                                    │     │
│   │                                                                       │     │
│   └───────────────────────────────────────────────────────────────────────┘     │
│                                                                                  │
└─────────────────────────────────────────────────────────────────────────────────┘
```

---

## COMPLETE TECH STACK

### Core Dependencies

| Component              | Tool                  | Version   | Purpose                              |
|------------------------|-----------------------|-----------|--------------------------------------|
| **Runtime**            | Python                | 3.11+     | Primary language                     |
| **API Framework**      | FastAPI               | 0.100+    | REST API with async support          |
| **Task Queue**         | Celery + Redis        | 5.3+      | Async processing for large docs      |

### File Handling

| Component              | Tool                  | Purpose                              |
|------------------------|-----------------------|--------------------------------------|
| **Format Detection**   | python-magic          | MIME type detection from bytes       |
| **HEIC Support**       | pillow-heif           | iPhone HEIC image conversion         |
| **Image Processing**   | Pillow                | General image I/O and manipulation   |
| **PDF Processing**     | PyMuPDF (fitz)        | PDF text extraction & rendering      |

### Image Preprocessing

| Component              | Tool                  | Purpose                              |
|------------------------|-----------------------|--------------------------------------|
| **Computer Vision**    | OpenCV (cv2)          | Deskew, denoise, threshold           |
| **Deskew**             | deskew                | Specialized document straightening   |
| **Enhancement**        | scikit-image          | Additional image filters             |

### OCR Engines

| Component              | Tool                  | Purpose                              |
|------------------------|-----------------------|--------------------------------------|
| **Primary OCR**        | Tesseract 5.x         | Fast, reliable text extraction       |
| **Python Binding**     | pytesseract           | Tesseract Python wrapper             |
| **Fallback OCR**       | PaddleOCR             | Higher accuracy, better tables       |
| **Layout Analysis**    | PaddleOCR / LayoutParser | Document structure detection      |

### Data Extraction

| Component              | Tool                  | Purpose                              |
|------------------------|-----------------------|--------------------------------------|
| **Table Extraction**   | img2table             | Image-based table extraction         |
| **PDF Tables**         | Camelot               | Native PDF table extraction          |
| **NLP**                | spaCy                 | Entity extraction, pattern matching  |
| **Data Handling**      | pandas                | Table/dataframe manipulation         |

### Infrastructure

| Component              | Tool                  | Purpose                              |
|------------------------|-----------------------|--------------------------------------|
| **Object Storage**     | S3 / MinIO            | Document and result storage          |
| **Database**           | PostgreSQL            | Metadata and results                 |
| **Cache**              | Redis                 | Result caching, queue backend        |
| **Containerization**   | Docker                | Deployment                           |

---

## SYSTEM REQUIREMENTS FILE

### requirements.txt

```
# API Framework
fastapi>=0.100.0
uvicorn[standard]>=0.23.0
python-multipart>=0.0.6

# Task Queue
celery>=5.3.0
redis>=4.6.0

# File Detection & Handling
python-magic>=0.4.27
pillow>=10.0.0
pillow-heif>=0.13.0

# PDF Processing
pymupdf>=1.23.0

# Image Preprocessing
opencv-python-headless>=4.8.0
deskew>=1.4.0
scikit-image>=0.21.0
numpy>=1.24.0

# OCR Engines
pytesseract>=0.3.10
paddlepaddle>=2.5.0
paddleocr>=2.7.0

# Table Extraction
img2table>=1.2.0
camelot-py[cv]>=0.11.0

# NLP & Data
spacy>=3.6.0
pandas>=2.0.0

# Storage & Database
boto3>=1.28.0
psycopg2-binary>=2.9.0
sqlalchemy>=2.0.0
```

### System Package Requirements (Ubuntu/Debian)

```bash
# Tesseract OCR
apt-get install tesseract-ocr tesseract-ocr-eng libtesseract-dev

# HEIC Support
apt-get install libheif-examples libheif-dev

# OpenCV dependencies
apt-get install libgl1-mesa-glx libglib2.0-0

# PDF rendering dependencies
apt-get install libmupdf-dev mupdf-tools

# Image processing
apt-get install libmagic1 ghostscript

# Camelot dependencies
apt-get install ghostscript python3-tk
```

---

## DOCKERFILE

```dockerfile
FROM python:3.11-slim

# System dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    tesseract-ocr \
    tesseract-ocr-eng \
    libtesseract-dev \
    libheif-examples \
    libheif-dev \
    libgl1-mesa-glx \
    libglib2.0-0 \
    libmagic1 \
    ghostscript \
    poppler-utils \
    && rm -rf /var/lib/apt/lists/*

# Python dependencies
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Download spaCy model
RUN python -m spacy download en_core_web_sm

# Download PaddleOCR models (cached in image)
RUN python -c "from paddleocr import PaddleOCR; PaddleOCR(use_angle_cls=True, lang='en')"

COPY . .

EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
```

---

## PROCESSING TIME ESTIMATES

| Document Type            | Size          | Expected Time  | Notes                          |
|--------------------------|---------------|----------------|--------------------------------|
| Single receipt (HEIC)    | 2-5 MB        | 1-3 seconds    | After preprocessing            |
| Single receipt (JPG)     | 500 KB-2 MB   | 0.5-2 seconds  | Direct processing              |
| Owner's manual (PDF)     | 10-50 MB      | 30-120 seconds | 100-300 pages                  |
| Owner's manual (scanned) | 50-200 MB     | 2-5 minutes    | Requires full OCR              |

---

## SCALING CONSIDERATIONS

```
                    ┌─────────────────────────────────────────┐
                    │            Load Balancer                │
                    │              (nginx)                    │
                    └─────────────────────────────────────────┘
                                       │
              ┌────────────────────────┼────────────────────────┐
              ▼                        ▼                        ▼
   ┌─────────────────┐      ┌─────────────────┐      ┌─────────────────┐
   │   API Server    │      │   API Server    │      │   API Server    │
   │   (FastAPI)     │      │   (FastAPI)     │      │   (FastAPI)     │
   └─────────────────┘      └─────────────────┘      └─────────────────┘
              │                        │                        │
              └────────────────────────┼────────────────────────┘
                                       │
                                       ▼
                    ┌─────────────────────────────────────────┐
                    │            Redis Queue                  │
                    └─────────────────────────────────────────┘
                                       │
              ┌────────────────────────┼────────────────────────┐
              ▼                        ▼                        ▼
   ┌─────────────────┐      ┌─────────────────┐      ┌─────────────────┐
   │  Celery Worker  │      │  Celery Worker  │      │  Celery Worker  │
   │  (OCR Heavy)    │      │  (OCR Heavy)    │      │  (OCR Heavy)    │
   │  CPU Optimized  │      │  CPU Optimized  │      │  GPU Optional   │
   └─────────────────┘      └─────────────────┘      └─────────────────┘
```

### Scaling Strategy

1. **Small files (receipts)**: Process synchronously in API server
2. **Large files (manuals)**: Queue to Celery workers, return job ID
3. **Horizontal scaling**: Add more Celery workers for throughput
4. **GPU acceleration**: PaddleOCR supports GPU for 5-10x speedup

---

## ERROR HANDLING FLOW

```
┌─────────────────┐
│  File Upload    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐     ┌─────────────────────────────────────────┐
│ Format Valid?   │──NO─▶│ Return 400: Unsupported format         │
└────────┬────────┘     └─────────────────────────────────────────┘
         │ YES
         ▼
┌─────────────────┐     ┌─────────────────────────────────────────┐
│ File Corrupt?   │─YES─▶│ Return 422: Unable to process file     │
└────────┬────────┘     └─────────────────────────────────────────┘
         │ NO
         ▼
┌─────────────────┐     ┌─────────────────────────────────────────┐
│ OCR Confidence  │─LOW─▶│ Return 200 with warning flag           │
│    < 50%?       │     │ { "warning": "low_confidence" }        │
└────────┬────────┘     └─────────────────────────────────────────┘
         │ OK
         ▼
┌─────────────────┐     ┌─────────────────────────────────────────┐
│ No Schedule     │─YES─▶│ Return 200 with raw text only          │
│   Found?        │     │ { "maintenance_schedule": null }       │
└────────┬────────┘     └─────────────────────────────────────────┘
         │ NO
         ▼
┌─────────────────┐
│ Return Full     │
│ Structured Data │
└─────────────────┘
```

---

## QUICK START

```bash
# 1. Clone and setup
git clone <repo>
cd ocr-pipeline

# 2. Build Docker image
docker build -t ocr-pipeline .

# 3. Start services
docker-compose up -d

# 4. Test endpoint
curl -X POST "http://localhost:8000/api/v1/extract" \
  -F "file=@owners_manual.pdf" \
  -F "document_type=manual"
```

---

## API ENDPOINT DESIGN

```
POST /api/v1/extract
  - Accepts: multipart/form-data
  - Fields: file, document_type (optional)
  - Returns: Structured JSON or job_id for large files

GET /api/v1/jobs/{job_id}
  - Poll for async job status
  - Returns: status, progress, result when complete

POST /api/v1/extract/batch
  - Accepts: multiple files
  - Returns: array of job_ids
```