From a396fc0f381f5360a93be45768da11ff421a7dd1 Mon Sep 17 00:00:00 2001
From: Eric Gullickson <16152721+ericgullickson@users.noreply.github.com>
Date: Sun, 4 Jan 2026 13:35:38 -0600
Subject: [PATCH] feat: OCR Pipeline tech stack file

---
 docs/ocr-pipeline-tech-stack.md | 519 ++++++++++++++++++++++++++++++++
 1 file changed, 519 insertions(+)
 create mode 100644 docs/ocr-pipeline-tech-stack.md

diff --git a/docs/ocr-pipeline-tech-stack.md b/docs/ocr-pipeline-tech-stack.md
new file mode 100644
index 0000000..5160a9a
--- /dev/null
+++ b/docs/ocr-pipeline-tech-stack.md
@@ -0,0 +1,519 @@
+# Vehicle Owner's Manual & Receipt OCR Pipeline
+## Complete Tech Stack & Architecture
+
+---
+
+## SYSTEM FLOW DIAGRAM
+
+```
+┌─────────────────────────────────────────────────────────────────────────────────┐
+│                              USER UPLOAD                                         │
+│                   (PDF, JPG, PNG, HEIC, TIFF, WEBP)                              │
+└─────────────────────────────────────────────────────────────────────────────────┘
+                                      │
+                                      ▼
+┌─────────────────────────────────────────────────────────────────────────────────┐
+│                         1. FORMAT DETECTION                                      │
+│                           python-magic                                           │
+│                                                                                  │
+│   • Detect MIME type from file bytes (not extension)                            │
+│   • Validate file is an accepted format                                          │
+│   • Reject unsupported/malicious files early                                     │
+└─────────────────────────────────────────────────────────────────────────────────┘
+                                      │
+                      ┌───────────────┴───────────────┐
+                      ▼                               ▼
+┌─────────────────────────────────┐   ┌─────────────────────────────────┐
+│         PDF DETECTED            │   │        IMAGE DETECTED           │
+└─────────────────────────────────┘   │   (JPG, PNG, HEIC, TIFF, WEBP)  │
+                │                     └─────────────────────────────────┘
+                ▼                                     │
+┌─────────────────────────────────┐                   │
+│   2a. PDF TEXT LAYER CHECK      │                   │
+│         PyMuPDF (fitz)          │                   │
+│                                 │                   │
+│  • Check if PDF has embedded    │                   │
+│    searchable text              │                   │
+│  • Count text characters        │                   │
+└─────────────────────────────────┘                   │
+                │                                     │
+        ┌───────┴───────┐                             │
+        ▼               ▼                             │
+┌──────────────┐ ┌──────────────┐                     │
+│  HAS TEXT    │ │  NO TEXT     │                     │
+│  (Native)    │ │  (Scanned)   │                     │
+└──────────────┘ └──────────────┘                     │
+        │               │                             │
+        │               ▼                             │
+        │       ┌──────────────────────┐              │
+        │       │ 2b. RENDER TO IMAGES │              │
+        │       │    PyMuPDF @ 300 DPI │              │
+        │       │                      │              │
+        │       │ • Page-by-page render│              │
+        │       │ • Maintain quality   │              │
+        │       └──────────────────────┘              │
+        │               │                             │
+        │               └──────────────┬──────────────┘
+        │                              ▼
+        │       ┌─────────────────────────────────────────────────────────┐
+        │       │                3. IMAGE NORMALIZATION                    │
+        │       ├─────────────────────────────────────────────────────────┤
+        │       │                                                         │
+        │       │   ┌─────────────┐    ┌─────────────┐    ┌────────────┐ │
+        │       │   │ HEIC Input  │    │ EXIF Check  │    │ Convert to │ │
+        │       │   │             │───▶│             │───▶│  RGB PNG   │ │
+        │       │   │ pillow-heif │    │ Fix rotation│    │            │ │
+        │       │   └─────────────┘    │ ImageOps    │    │  Pillow    │ │
+        │       │                      └─────────────┘    └────────────┘ │
+        │       │                                                         │
+        │       │   ┌─────────────┐    ┌─────────────┐    ┌────────────┐ │
+        │       │   │ JPG/PNG/etc │    │ EXIF Check  │    │ Convert to │ │
+        │       │   │             │───▶│             │───▶│  RGB PNG   │ │
+        │       │   │   Pillow    │    │ Fix rotation│    │            │ │
+        │       │   └─────────────┘    └─────────────┘    └────────────┘ │
+        │       │                                                         │
+        │       └─────────────────────────────────────────────────────────┘
+        │                              │
+        │                              ▼
+        │       ┌─────────────────────────────────────────────────────────┐
+        │       │                 4. PREPROCESSING                         │
+        │       ├─────────────────────────────────────────────────────────┤
+        │       │                                                         │
+        │       │   ┌─────────────────────────────────────────────────┐   │
+        │       │   │  4a. Resolution Normalization                   │   │
+        │       │   │      Target: 300 DPI equivalent                 │   │
+        │       │   │      Tool: Pillow resize with LANCZOS           │   │
+        │       │   └─────────────────────────────────────────────────┘   │
+        │       │                         │                               │
+        │       │                         ▼                               │
+        │       │   ┌─────────────────────────────────────────────────┐   │
+        │       │   │  4b. Deskew (Straighten)                        │   │
+        │       │   │      Tool: OpenCV + deskew library              │   │
+        │       │   │      Method: Hough transform / projection       │   │
+        │       │   └─────────────────────────────────────────────────┘   │
+        │       │                         │                               │
+        │       │                         ▼                               │
+        │       │   ┌─────────────────────────────────────────────────┐   │
+        │       │   │  4c. Denoise                                    │   │
+        │       │   │      Tool: OpenCV fastNlMeansDenoising          │   │
+        │       │   └─────────────────────────────────────────────────┘   │
+        │       │                         │                               │
+        │       │                         ▼                               │
+        │       │   ┌───────────────────────────────────────────────────┐ │
+        │       │   │  4d. Document-Specific Enhancement                │ │
+        │       │   │                                                   │ │
+        │       │   │   RECEIPTS:              MANUALS:                 │ │
+        │       │   │   • Adaptive threshold   • Contrast stretch       │ │
+        │       │   │   • Perspective correct  • Sharpen                │ │
+        │       │   │   • High contrast B&W    • Keep grayscale         │ │
+        │       │   │                                                   │ │
+        │       │   │   Tool: OpenCV adaptiveThreshold / CLAHE          │ │
+        │       │   └───────────────────────────────────────────────────┘ │
+        │       │                                                         │
+        │       └─────────────────────────────────────────────────────────┘
+        │                              │
+        │                              ▼
+        │       ┌─────────────────────────────────────────────────────────┐
+        │       │                    5. OCR ENGINE                         │
+        │       ├─────────────────────────────────────────────────────────┤
+        │       │                                                         │
+        │       │   ┌─────────────────────────────────────────────────┐   │
+        │       │   │  5a. Primary OCR: Tesseract 5.x                 │   │
+        │       │   │                                                 │   │
+        │       │   │  • Engine: LSTM (--oem 1)                       │   │
+        │       │   │  • Page segmentation: Auto (--psm 3)            │   │
+        │       │   │  • Output: hOCR with word confidence            │   │
+        │       │   └─────────────────────────────────────────────────┘   │
+        │       │                         │                               │
+        │       │                         ▼                               │
+        │       │                 ┌───────────────┐                       │
+        │       │                 │  Confidence   │                       │
+        │       │                 │    > 80% ?    │                       │
+        │       │                 └───────────────┘                       │
+        │       │                    │         │                          │
+        │       │              YES ──┘         └── NO                     │
+        │       │               │                   │                     │
+        │       │               │                   ▼                     │
+        │       │               │   ┌─────────────────────────────────┐   │
+        │       │               │   │  5b. Fallback: PaddleOCR        │   │
+        │       │               │   │                                 │   │
+        │       │               │   │  • Better for degraded images   │   │
+        │       │               │   │  • Better table detection       │   │
+        │       │               │   │  • Slower but more accurate     │   │
+        │       │               │   └─────────────────────────────────┘   │
+        │       │               │                   │                     │
+        │       │               ▼                   ▼                     │
+        │       │         ┌─────────────────────────────────┐             │
+        │       │         │  5c. Result Merging             │             │
+        │       │         │  • Merge by bounding box        │             │
+        │       │         │  • Keep highest confidence      │             │
+        │       │         └─────────────────────────────────┘             │
+        │       │                                                         │
+        │       └─────────────────────────────────────────────────────────┘
+        │                              │
+        └──────────────────────────────┤
+                                       ▼
+┌─────────────────────────────────────────────────────────────────────────────────┐
+│                        6. STRUCTURED EXTRACTION                                  │
+├─────────────────────────────────────────────────────────────────────────────────┤
+│                                                                                  │
+│   ┌───────────────────────────────────────────────────────────────────────┐     │
+│   │  6a. Layout Analysis                                                  │     │
+│   │      Tool: PaddleOCR Layout / LayoutParser                            │     │
+│   │      Detects: Headers, paragraphs, tables, lists, figures             │     │
+│   └───────────────────────────────────────────────────────────────────────┘     │
+│                                      │                                          │
+│                    ┌─────────────────┴─────────────────┐                        │
+│                    ▼                                   ▼                        │
+│   ┌────────────────────────────────┐   ┌────────────────────────────────┐       │
+│   │  6b. Table Extraction          │   │  6c. Text Block Processing     │       │
+│   │                                │   │                                │       │
+│   │  Tool: img2table or Camelot    │   │  • Group by proximity          │       │
+│   │  Output: Pandas DataFrame      │   │  • Identify sections           │       │
+│   │                                │   │  • Extract hierarchies         │       │
+│   └────────────────────────────────┘   └────────────────────────────────┘       │
+│                    │                                   │                        │
+│                    └─────────────────┬─────────────────┘                        │
+│                                      ▼                                          │
+│   ┌───────────────────────────────────────────────────────────────────────┐     │
+│   │  6d. Maintenance Schedule Pattern Matching                            │     │
+│   │                                                                       │     │
+│   │  Tool: regex + spaCy NER                                              │     │
+│   │                                                                       │     │
+│   │  Patterns:                                                            │     │
+│   │  • Mileage intervals: "every 5,000 miles", "30,000 km"                │     │
+│   │  • Time intervals: "every 6 months", "annually"                       │     │
+│   │  • Service types: "oil change", "tire rotation", "brake inspection"  │     │
+│   │  • Fluid types: "5W-30", "ATF", "DOT 4"                               │     │
+│   │                                                                       │     │
+│   └───────────────────────────────────────────────────────────────────────┘     │
+│                                                                                  │
+└─────────────────────────────────────────────────────────────────────────────────┘
+                                       │
+                                       ▼
+┌─────────────────────────────────────────────────────────────────────────────────┐
+│                            7. OUTPUT LAYER                                       │
+├─────────────────────────────────────────────────────────────────────────────────┤
+│                                                                                  │
+│   ┌───────────────────────────────────────────────────────────────────────┐     │
+│   │  Structured JSON Response                                             │     │
+│   │                                                                       │     │
+│   │  {                                                                    │     │
+│   │    "document_type": "owners_manual" | "receipt",                      │     │
+│   │    "vehicle": {                                                       │     │
+│   │      "make": "Toyota",                                                │     │
+│   │      "model": "Camry",                                                │     │
+│   │      "year": 2024                                                     │     │
+│   │    },                                                                 │     │
+│   │    "maintenance_schedule": [                                          │     │
+│   │      {                                                                │     │
+│   │        "service": "Oil Change",                                       │     │
+│   │        "interval_miles": 5000,                                        │     │
+│   │        "interval_months": 6,                                          │     │
+│   │        "details": "Use 0W-20 synthetic"                               │     │
+│   │      }                                                                │     │
+│   │    ],                                                                 │     │
+│   │    "raw_text": "...",                                                 │     │
+│   │    "tables": [...],                                                   │     │
+│   │    "confidence_score": 0.94,                                          │     │
+│   │    "processing_time_ms": 1250                                         │     │
+│   │  }                                                                    │     │
+│   │                                                                       │     │
+│   └───────────────────────────────────────────────────────────────────────┘     │
+│                                                                                  │
+└─────────────────────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## COMPLETE TECH STACK
+
+### Core Dependencies
+
+| Component              | Tool                  | Version   | Purpose                              |
+|------------------------|-----------------------|-----------|--------------------------------------|
+| **Runtime**            | Python                | 3.11+     | Primary language                     |
+| **API Framework**      | FastAPI               | 0.100+    | REST API with async support          |
+| **Task Queue**         | Celery + Redis        | 5.3+      | Async processing for large docs      |
+
+### File Handling
+
+| Component              | Tool                  | Purpose                              |
+|------------------------|-----------------------|--------------------------------------|
+| **Format Detection**   | python-magic          | MIME type detection from bytes       |
+| **HEIC Support**       | pillow-heif           | iPhone HEIC image conversion         |
+| **Image Processing**   | Pillow                | General image I/O and manipulation   |
+| **PDF Processing**     | PyMuPDF (fitz)        | PDF text extraction & rendering      |
+
+### Image Preprocessing
+
+| Component              | Tool                  | Purpose                              |
+|------------------------|-----------------------|--------------------------------------|
+| **Computer Vision**    | OpenCV (cv2)          | Deskew, denoise, threshold           |
+| **Deskew**             | deskew                | Specialized document straightening   |
+| **Enhancement**        | scikit-image          | Additional image filters             |
+
+### OCR Engines
+
+| Component              | Tool                  | Purpose                              |
+|------------------------|-----------------------|--------------------------------------|
+| **Primary OCR**        | Tesseract 5.x         | Fast, reliable text extraction       |
+| **Python Binding**     | pytesseract           | Tesseract Python wrapper             |
+| **Fallback OCR**       | PaddleOCR             | Higher accuracy, better tables       |
+| **Layout Analysis**    | PaddleOCR / LayoutParser | Document structure detection      |
+
+### Data Extraction
+
+| Component              | Tool                  | Purpose                              |
+|------------------------|-----------------------|--------------------------------------|
+| **Table Extraction**   | img2table             | Image-based table extraction         |
+| **PDF Tables**         | Camelot               | Native PDF table extraction          |
+| **NLP**                | spaCy                 | Entity extraction, pattern matching  |
+| **Data Handling**      | pandas                | Table/dataframe manipulation         |
+
+### Infrastructure
+
+| Component              | Tool                  | Purpose                              |
+|------------------------|-----------------------|--------------------------------------|
+| **Object Storage**     | S3 / MinIO            | Document and result storage          |
+| **Database**           | PostgreSQL            | Metadata and results                 |
+| **Cache**              | Redis                 | Result caching, queue backend        |
+| **Containerization**   | Docker                | Deployment                           |
+
+---
+
+## SYSTEM REQUIREMENTS FILE
+
+### requirements.txt
+
+```
+# API Framework
+fastapi>=0.100.0
+uvicorn[standard]>=0.23.0
+python-multipart>=0.0.6
+
+# Task Queue
+celery>=5.3.0
+redis>=4.6.0
+
+# File Detection & Handling
+python-magic>=0.4.27
+pillow>=10.0.0
+pillow-heif>=0.13.0
+
+# PDF Processing
+pymupdf>=1.23.0
+
+# Image Preprocessing
+opencv-python-headless>=4.8.0
+deskew>=1.4.0
+scikit-image>=0.21.0
+numpy>=1.24.0
+
+# OCR Engines
+pytesseract>=0.3.10
+paddlepaddle>=2.5.0
+paddleocr>=2.7.0
+
+# Table Extraction
+img2table>=1.2.0
+camelot-py[cv]>=0.11.0
+
+# NLP & Data
+spacy>=3.6.0
+pandas>=2.0.0
+
+# Storage & Database
+boto3>=1.28.0
+psycopg2-binary>=2.9.0
+sqlalchemy>=2.0.0
+```
+
+### System Package Requirements (Ubuntu/Debian)
+
+```bash
+# Tesseract OCR
+apt-get install tesseract-ocr tesseract-ocr-eng libtesseract-dev
+
+# HEIC Support
+apt-get install libheif-examples libheif-dev
+
+# OpenCV dependencies
+apt-get install libgl1-mesa-glx libglib2.0-0
+
+# PDF rendering dependencies
+apt-get install libmupdf-dev mupdf-tools
+
+# Image processing
+apt-get install libmagic1 ghostscript
+
+# Camelot dependencies
+apt-get install ghostscript python3-tk
+```
+
+---
+
+## DOCKERFILE
+
+```dockerfile
+FROM python:3.11-slim
+
+# System dependencies
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    tesseract-ocr \
+    tesseract-ocr-eng \
+    libtesseract-dev \
+    libheif-examples \
+    libheif-dev \
+    libgl1-mesa-glx \
+    libglib2.0-0 \
+    libmagic1 \
+    ghostscript \
+    poppler-utils \
+    && rm -rf /var/lib/apt/lists/*
+
+# Python dependencies
+WORKDIR /app
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+
+# Download spaCy model
+RUN python -m spacy download en_core_web_sm
+
+# Download PaddleOCR models (cached in image)
+RUN python -c "from paddleocr import PaddleOCR; PaddleOCR(use_angle_cls=True, lang='en')"
+
+COPY . .
+
+EXPOSE 8000
+CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
+```
+
+---
+
+## PROCESSING TIME ESTIMATES
+
+| Document Type            | Size          | Expected Time  | Notes                          |
+|--------------------------|---------------|----------------|--------------------------------|
+| Single receipt (HEIC)    | 2-5 MB        | 1-3 seconds    | After preprocessing            |
+| Single receipt (JPG)     | 500 KB-2 MB   | 0.5-2 seconds  | Direct processing              |
+| Owner's manual (PDF)     | 10-50 MB      | 30-120 seconds | 100-300 pages                  |
+| Owner's manual (scanned) | 50-200 MB     | 2-5 minutes    | Requires full OCR              |
+
+---
+
+## SCALING CONSIDERATIONS
+
+```
+                    ┌─────────────────────────────────────────┐
+                    │            Load Balancer                │
+                    │              (nginx)                    │
+                    └─────────────────────────────────────────┘
+                                       │
+              ┌────────────────────────┼────────────────────────┐
+              ▼                        ▼                        ▼
+   ┌─────────────────┐      ┌─────────────────┐      ┌─────────────────┐
+   │   API Server    │      │   API Server    │      │   API Server    │
+   │   (FastAPI)     │      │   (FastAPI)     │      │   (FastAPI)     │
+   └─────────────────┘      └─────────────────┘      └─────────────────┘
+              │                        │                        │
+              └────────────────────────┼────────────────────────┘
+                                       │
+                                       ▼
+                    ┌─────────────────────────────────────────┐
+                    │            Redis Queue                  │
+                    └─────────────────────────────────────────┘
+                                       │
+              ┌────────────────────────┼────────────────────────┐
+              ▼                        ▼                        ▼
+   ┌─────────────────┐      ┌─────────────────┐      ┌─────────────────┐
+   │  Celery Worker  │      │  Celery Worker  │      │  Celery Worker  │
+   │  (OCR Heavy)    │      │  (OCR Heavy)    │      │  (OCR Heavy)    │
+   │  CPU Optimized  │      │  CPU Optimized  │      │  GPU Optional   │
+   └─────────────────┘      └─────────────────┘      └─────────────────┘
+```
+
+### Scaling Strategy
+
+1. **Small files (receipts)**: Process synchronously in API server
+2. **Large files (manuals)**: Queue to Celery workers, return job ID
+3. **Horizontal scaling**: Add more Celery workers for throughput
+4. **GPU acceleration**: PaddleOCR supports GPU for 5-10x speedup
+
+---
+
+## ERROR HANDLING FLOW
+
+```
+┌─────────────────┐
+│  File Upload    │
+└────────┬────────┘
+         │
+         ▼
+┌─────────────────┐     ┌─────────────────────────────────────────┐
+│ Format Valid?   │──NO─▶│ Return 400: Unsupported format         │
+└────────┬────────┘     └─────────────────────────────────────────┘
+         │ YES
+         ▼
+┌─────────────────┐     ┌─────────────────────────────────────────┐
+│ File Corrupt?   │─YES─▶│ Return 422: Unable to process file     │
+└────────┬────────┘     └─────────────────────────────────────────┘
+         │ NO
+         ▼
+┌─────────────────┐     ┌─────────────────────────────────────────┐
+│ OCR Confidence  │─LOW─▶│ Return 200 with warning flag           │
+│    < 50%?       │     │ { "warning": "low_confidence" }        │
+└────────┬────────┘     └─────────────────────────────────────────┘
+         │ OK
+         ▼
+┌─────────────────┐     ┌─────────────────────────────────────────┐
+│ No Schedule     │─YES─▶│ Return 200 with raw text only          │
+│   Found?        │     │ { "maintenance_schedule": null }       │
+└────────┬────────┘     └─────────────────────────────────────────┘
+         │ NO
+         ▼
+┌─────────────────┐
+│ Return Full     │
+│ Structured Data │
+└─────────────────┘
+```
+
+---
+
+## QUICK START
+
+```bash
+# 1. Clone and setup
+git clone <repo>
+cd ocr-pipeline
+
+# 2. Build Docker image
+docker build -t ocr-pipeline .
+
+# 3. Start services
+docker-compose up -d
+
+# 4. Test endpoint
+curl -X POST "http://localhost:8000/api/v1/extract" \
+  -F "file=@owners_manual.pdf" \
+  -F "document_type=manual"
+```
+
+---
+
+## API ENDPOINT DESIGN
+
+```
+POST /api/v1/extract
+  - Accepts: multipart/form-data
+  - Fields: file, document_type (optional)
+  - Returns: Structured JSON or job_id for large files
+
+GET /api/v1/jobs/{job_id}
+  - Poll for async job status
+  - Returns: status, progress, result when complete
+
+POST /api/v1/extract/batch
+  - Accepts: multiple files
+  - Returns: array of job_ids
+```