egullickson/motovaultpro

Fork 0

Files

Eric Gullickson a396fc0f38 feat: OCR Pipeline tech stack file

2026-01-04 13:35:38 -06:00

36 KiB

Raw Blame History

Vehicle Owner's Manual & Receipt OCR Pipeline

Complete Tech Stack & Architecture

SYSTEM FLOW DIAGRAM

┌─────────────────────────────────────────────────────────────────────────────────┐
│                              USER UPLOAD                                         │
│                   (PDF, JPG, PNG, HEIC, TIFF, WEBP)                              │
└─────────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                         1. FORMAT DETECTION                                      │
│                           python-magic                                           │
│                                                                                  │
│   • Detect MIME type from file bytes (not extension)                            │
│   • Validate file is an accepted format                                          │
│   • Reject unsupported/malicious files early                                     │
└─────────────────────────────────────────────────────────────────────────────────┘
                                      │
                      ┌───────────────┴───────────────┐
                      ▼                               ▼
┌─────────────────────────────────┐   ┌─────────────────────────────────┐
│         PDF DETECTED            │   │        IMAGE DETECTED           │
└─────────────────────────────────┘   │   (JPG, PNG, HEIC, TIFF, WEBP)  │
                │                     └─────────────────────────────────┘
                ▼                                     │
┌─────────────────────────────────┐                   │
│   2a. PDF TEXT LAYER CHECK      │                   │
│         PyMuPDF (fitz)          │                   │
│                                 │                   │
│  • Check if PDF has embedded    │                   │
│    searchable text              │                   │
│  • Count text characters        │                   │
└─────────────────────────────────┘                   │
                │                                     │
        ┌───────┴───────┐                             │
        ▼               ▼                             │
┌──────────────┐ ┌──────────────┐                     │
│  HAS TEXT    │ │  NO TEXT     │                     │
│  (Native)    │ │  (Scanned)   │                     │
└──────────────┘ └──────────────┘                     │
        │               │                             │
        │               ▼                             │
        │       ┌──────────────────────┐              │
        │       │ 2b. RENDER TO IMAGES │              │
        │       │    PyMuPDF @ 300 DPI │              │
        │       │                      │              │
        │       │ • Page-by-page render│              │
        │       │ • Maintain quality   │              │
        │       └──────────────────────┘              │
        │               │                             │
        │               └──────────────┬──────────────┘
        │                              ▼
        │       ┌─────────────────────────────────────────────────────────┐
        │       │                3. IMAGE NORMALIZATION                    │
        │       ├─────────────────────────────────────────────────────────┤
        │       │                                                         │
        │       │   ┌─────────────┐    ┌─────────────┐    ┌────────────┐ │
        │       │   │ HEIC Input  │    │ EXIF Check  │    │ Convert to │ │
        │       │   │             │───▶│             │───▶│  RGB PNG   │ │
        │       │   │ pillow-heif │    │ Fix rotation│    │            │ │
        │       │   └─────────────┘    │ ImageOps    │    │  Pillow    │ │
        │       │                      └─────────────┘    └────────────┘ │
        │       │                                                         │
        │       │   ┌─────────────┐    ┌─────────────┐    ┌────────────┐ │
        │       │   │ JPG/PNG/etc │    │ EXIF Check  │    │ Convert to │ │
        │       │   │             │───▶│             │───▶│  RGB PNG   │ │
        │       │   │   Pillow    │    │ Fix rotation│    │            │ │
        │       │   └─────────────┘    └─────────────┘    └────────────┘ │
        │       │                                                         │
        │       └─────────────────────────────────────────────────────────┘
        │                              │
        │                              ▼
        │       ┌─────────────────────────────────────────────────────────┐
        │       │                 4. PREPROCESSING                         │
        │       ├─────────────────────────────────────────────────────────┤
        │       │                                                         │
        │       │   ┌─────────────────────────────────────────────────┐   │
        │       │   │  4a. Resolution Normalization                   │   │
        │       │   │      Target: 300 DPI equivalent                 │   │
        │       │   │      Tool: Pillow resize with LANCZOS           │   │
        │       │   └─────────────────────────────────────────────────┘   │
        │       │                         │                               │
        │       │                         ▼                               │
        │       │   ┌─────────────────────────────────────────────────┐   │
        │       │   │  4b. Deskew (Straighten)                        │   │
        │       │   │      Tool: OpenCV + deskew library              │   │
        │       │   │      Method: Hough transform / projection       │   │
        │       │   └─────────────────────────────────────────────────┘   │
        │       │                         │                               │
        │       │                         ▼                               │
        │       │   ┌─────────────────────────────────────────────────┐   │
        │       │   │  4c. Denoise                                    │   │
        │       │   │      Tool: OpenCV fastNlMeansDenoising          │   │
        │       │   └─────────────────────────────────────────────────┘   │
        │       │                         │                               │
        │       │                         ▼                               │
        │       │   ┌───────────────────────────────────────────────────┐ │
        │       │   │  4d. Document-Specific Enhancement                │ │
        │       │   │                                                   │ │
        │       │   │   RECEIPTS:              MANUALS:                 │ │
        │       │   │   • Adaptive threshold   • Contrast stretch       │ │
        │       │   │   • Perspective correct  • Sharpen                │ │
        │       │   │   • High contrast B&W    • Keep grayscale         │ │
        │       │   │                                                   │ │
        │       │   │   Tool: OpenCV adaptiveThreshold / CLAHE          │ │
        │       │   └───────────────────────────────────────────────────┘ │
        │       │                                                         │
        │       └─────────────────────────────────────────────────────────┘
        │                              │
        │                              ▼
        │       ┌─────────────────────────────────────────────────────────┐
        │       │                    5. OCR ENGINE                         │
        │       ├─────────────────────────────────────────────────────────┤
        │       │                                                         │
        │       │   ┌─────────────────────────────────────────────────┐   │
        │       │   │  5a. Primary OCR: Tesseract 5.x                 │   │
        │       │   │                                                 │   │
        │       │   │  • Engine: LSTM (--oem 1)                       │   │
        │       │   │  • Page segmentation: Auto (--psm 3)            │   │
        │       │   │  • Output: hOCR with word confidence            │   │
        │       │   └─────────────────────────────────────────────────┘   │
        │       │                         │                               │
        │       │                         ▼                               │
        │       │                 ┌───────────────┐                       │
        │       │                 │  Confidence   │                       │
        │       │                 │    > 80% ?    │                       │
        │       │                 └───────────────┘                       │
        │       │                    │         │                          │
        │       │              YES ──┘         └── NO                     │
        │       │               │                   │                     │
        │       │               │                   ▼                     │
        │       │               │   ┌─────────────────────────────────┐   │
        │       │               │   │  5b. Fallback: PaddleOCR        │   │
        │       │               │   │                                 │   │
        │       │               │   │  • Better for degraded images   │   │
        │       │               │   │  • Better table detection       │   │
        │       │               │   │  • Slower but more accurate     │   │
        │       │               │   └─────────────────────────────────┘   │
        │       │               │                   │                     │
        │       │               ▼                   ▼                     │
        │       │         ┌─────────────────────────────────┐             │
        │       │         │  5c. Result Merging             │             │
        │       │         │  • Merge by bounding box        │             │
        │       │         │  • Keep highest confidence      │             │
        │       │         └─────────────────────────────────┘             │
        │       │                                                         │
        │       └─────────────────────────────────────────────────────────┘
        │                              │
        └──────────────────────────────┤
                                       ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                        6. STRUCTURED EXTRACTION                                  │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                  │
│   ┌───────────────────────────────────────────────────────────────────────┐     │
│   │  6a. Layout Analysis                                                  │     │
│   │      Tool: PaddleOCR Layout / LayoutParser                            │     │
│   │      Detects: Headers, paragraphs, tables, lists, figures             │     │
│   └───────────────────────────────────────────────────────────────────────┘     │
│                                      │                                          │
│                    ┌─────────────────┴─────────────────┐                        │
│                    ▼                                   ▼                        │
│   ┌────────────────────────────────┐   ┌────────────────────────────────┐       │
│   │  6b. Table Extraction          │   │  6c. Text Block Processing     │       │
│   │                                │   │                                │       │
│   │  Tool: img2table or Camelot    │   │  • Group by proximity          │       │
│   │  Output: Pandas DataFrame      │   │  • Identify sections           │       │
│   │                                │   │  • Extract hierarchies         │       │
│   └────────────────────────────────┘   └────────────────────────────────┘       │
│                    │                                   │                        │
│                    └─────────────────┬─────────────────┘                        │
│                                      ▼                                          │
│   ┌───────────────────────────────────────────────────────────────────────┐     │
│   │  6d. Maintenance Schedule Pattern Matching                            │     │
│   │                                                                       │     │
│   │  Tool: regex + spaCy NER                                              │     │
│   │                                                                       │     │
│   │  Patterns:                                                            │     │
│   │  • Mileage intervals: "every 5,000 miles", "30,000 km"                │     │
│   │  • Time intervals: "every 6 months", "annually"                       │     │
│   │  • Service types: "oil change", "tire rotation", "brake inspection"  │     │
│   │  • Fluid types: "5W-30", "ATF", "DOT 4"                               │     │
│   │                                                                       │     │
│   └───────────────────────────────────────────────────────────────────────┘     │
│                                                                                  │
└─────────────────────────────────────────────────────────────────────────────────┘
                                       │
                                       ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                            7. OUTPUT LAYER                                       │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                  │
│   ┌───────────────────────────────────────────────────────────────────────┐     │
│   │  Structured JSON Response                                             │     │
│   │                                                                       │     │
│   │  {                                                                    │     │
│   │    "document_type": "owners_manual" | "receipt",                      │     │
│   │    "vehicle": {                                                       │     │
│   │      "make": "Toyota",                                                │     │
│   │      "model": "Camry",                                                │     │
│   │      "year": 2024                                                     │     │
│   │    },                                                                 │     │
│   │    "maintenance_schedule": [                                          │     │
│   │      {                                                                │     │
│   │        "service": "Oil Change",                                       │     │
│   │        "interval_miles": 5000,                                        │     │
│   │        "interval_months": 6,                                          │     │
│   │        "details": "Use 0W-20 synthetic"                               │     │
│   │      }                                                                │     │
│   │    ],                                                                 │     │
│   │    "raw_text": "...",                                                 │     │
│   │    "tables": [...],                                                   │     │
│   │    "confidence_score": 0.94,                                          │     │
│   │    "processing_time_ms": 1250                                         │     │
│   │  }                                                                    │     │
│   │                                                                       │     │
│   └───────────────────────────────────────────────────────────────────────┘     │
│                                                                                  │
└─────────────────────────────────────────────────────────────────────────────────┘

COMPLETE TECH STACK

Core Dependencies

Component	Tool	Version	Purpose
Runtime	Python	3.11+	Primary language
API Framework	FastAPI	0.100+	REST API with async support
Task Queue	Celery + Redis	5.3+	Async processing for large docs

File Handling

Component	Tool	Purpose
Format Detection	python-magic	MIME type detection from bytes
HEIC Support	pillow-heif	iPhone HEIC image conversion
Image Processing	Pillow	General image I/O and manipulation
PDF Processing	PyMuPDF (fitz)	PDF text extraction & rendering

Image Preprocessing

Component	Tool	Purpose
Computer Vision	OpenCV (cv2)	Deskew, denoise, threshold
Deskew	deskew	Specialized document straightening
Enhancement	scikit-image	Additional image filters

OCR Engines

Component	Tool	Purpose
Primary OCR	Tesseract 5.x	Fast, reliable text extraction
Python Binding	pytesseract	Tesseract Python wrapper
Fallback OCR	PaddleOCR	Higher accuracy, better tables
Layout Analysis	PaddleOCR / LayoutParser	Document structure detection

Data Extraction

Component	Tool	Purpose
Table Extraction	img2table	Image-based table extraction
PDF Tables	Camelot	Native PDF table extraction
NLP	spaCy	Entity extraction, pattern matching
Data Handling	pandas	Table/dataframe manipulation

Infrastructure

Component	Tool	Purpose
Object Storage	S3 / MinIO	Document and result storage
Database	PostgreSQL	Metadata and results
Cache	Redis	Result caching, queue backend
Containerization	Docker	Deployment

SYSTEM REQUIREMENTS FILE

requirements.txt

# API Framework
fastapi>=0.100.0
uvicorn[standard]>=0.23.0
python-multipart>=0.0.6

# Task Queue
celery>=5.3.0
redis>=4.6.0

# File Detection & Handling
python-magic>=0.4.27
pillow>=10.0.0
pillow-heif>=0.13.0

# PDF Processing
pymupdf>=1.23.0

# Image Preprocessing
opencv-python-headless>=4.8.0
deskew>=1.4.0
scikit-image>=0.21.0
numpy>=1.24.0

# OCR Engines
pytesseract>=0.3.10
paddlepaddle>=2.5.0
paddleocr>=2.7.0

# Table Extraction
img2table>=1.2.0
camelot-py[cv]>=0.11.0

# NLP & Data
spacy>=3.6.0
pandas>=2.0.0

# Storage & Database
boto3>=1.28.0
psycopg2-binary>=2.9.0
sqlalchemy>=2.0.0

System Package Requirements (Ubuntu/Debian)

# Tesseract OCR
apt-get install tesseract-ocr tesseract-ocr-eng libtesseract-dev

# HEIC Support
apt-get install libheif-examples libheif-dev

# OpenCV dependencies
apt-get install libgl1-mesa-glx libglib2.0-0

# PDF rendering dependencies
apt-get install libmupdf-dev mupdf-tools

# Image processing
apt-get install libmagic1 ghostscript

# Camelot dependencies
apt-get install ghostscript python3-tk

DOCKERFILE

FROM python:3.11-slim

# System dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    tesseract-ocr \
    tesseract-ocr-eng \
    libtesseract-dev \
    libheif-examples \
    libheif-dev \
    libgl1-mesa-glx \
    libglib2.0-0 \
    libmagic1 \
    ghostscript \
    poppler-utils \
    && rm -rf /var/lib/apt/lists/*

# Python dependencies
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Download spaCy model
RUN python -m spacy download en_core_web_sm

# Download PaddleOCR models (cached in image)
RUN python -c "from paddleocr import PaddleOCR; PaddleOCR(use_angle_cls=True, lang='en')"

COPY . .

EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

PROCESSING TIME ESTIMATES

Document Type	Size	Expected Time	Notes
Single receipt (HEIC)	2-5 MB	1-3 seconds	After preprocessing
Single receipt (JPG)	500 KB-2 MB	0.5-2 seconds	Direct processing
Owner's manual (PDF)	10-50 MB	30-120 seconds	100-300 pages
Owner's manual (scanned)	50-200 MB	2-5 minutes	Requires full OCR

SCALING CONSIDERATIONS

                    ┌─────────────────────────────────────────┐
                    │            Load Balancer                │
                    │              (nginx)                    │
                    └─────────────────────────────────────────┘
                                       │
              ┌────────────────────────┼────────────────────────┐
              ▼                        ▼                        ▼
   ┌─────────────────┐      ┌─────────────────┐      ┌─────────────────┐
   │   API Server    │      │   API Server    │      │   API Server    │
   │   (FastAPI)     │      │   (FastAPI)     │      │   (FastAPI)     │
   └─────────────────┘      └─────────────────┘      └─────────────────┘
              │                        │                        │
              └────────────────────────┼────────────────────────┘
                                       │
                                       ▼
                    ┌─────────────────────────────────────────┐
                    │            Redis Queue                  │
                    └─────────────────────────────────────────┘
                                       │
              ┌────────────────────────┼────────────────────────┐
              ▼                        ▼                        ▼
   ┌─────────────────┐      ┌─────────────────┐      ┌─────────────────┐
   │  Celery Worker  │      │  Celery Worker  │      │  Celery Worker  │
   │  (OCR Heavy)    │      │  (OCR Heavy)    │      │  (OCR Heavy)    │
   │  CPU Optimized  │      │  CPU Optimized  │      │  GPU Optional   │
   └─────────────────┘      └─────────────────┘      └─────────────────┘

Scaling Strategy

Small files (receipts): Process synchronously in API server
Large files (manuals): Queue to Celery workers, return job ID
Horizontal scaling: Add more Celery workers for throughput
GPU acceleration: PaddleOCR supports GPU for 5-10x speedup

ERROR HANDLING FLOW

┌─────────────────┐
│  File Upload    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐     ┌─────────────────────────────────────────┐
│ Format Valid?   │──NO─▶│ Return 400: Unsupported format         │
└────────┬────────┘     └─────────────────────────────────────────┘
         │ YES
         ▼
┌─────────────────┐     ┌─────────────────────────────────────────┐
│ File Corrupt?   │─YES─▶│ Return 422: Unable to process file     │
└────────┬────────┘     └─────────────────────────────────────────┘
         │ NO
         ▼
┌─────────────────┐     ┌─────────────────────────────────────────┐
│ OCR Confidence  │─LOW─▶│ Return 200 with warning flag           │
│    < 50%?       │     │ { "warning": "low_confidence" }        │
└────────┬────────┘     └─────────────────────────────────────────┘
         │ OK
         ▼
┌─────────────────┐     ┌─────────────────────────────────────────┐
│ No Schedule     │─YES─▶│ Return 200 with raw text only          │
│   Found?        │     │ { "maintenance_schedule": null }       │
└────────┬────────┘     └─────────────────────────────────────────┘
         │ NO
         ▼
┌─────────────────┐
│ Return Full     │
│ Structured Data │
└─────────────────┘

QUICK START

# 1. Clone and setup
git clone <repo>
cd ocr-pipeline

# 2. Build Docker image
docker build -t ocr-pipeline .

# 3. Start services
docker-compose up -d

# 4. Test endpoint
curl -X POST "http://localhost:8000/api/v1/extract" \
  -F "file=@owners_manual.pdf" \
  -F "document_type=manual"

API ENDPOINT DESIGN

POST /api/v1/extract
  - Accepts: multipart/form-data
  - Fields: file, document_type (optional)
  - Returns: Structured JSON or job_id for large files

GET /api/v1/jobs/{job_id}
  - Poll for async job status
  - Returns: status, progress, result when complete

POST /api/v1/extract/batch
  - Accepts: multiple files
  - Returns: array of job_ids

36 KiB Raw Blame History

Vehicle Owner's Manual & Receipt OCR Pipeline

Complete Tech Stack & Architecture

SYSTEM FLOW DIAGRAM

COMPLETE TECH STACK

Core Dependencies

File Handling

Image Preprocessing

OCR Engines

Data Extraction

Infrastructure

SYSTEM REQUIREMENTS FILE

requirements.txt

System Package Requirements (Ubuntu/Debian)

DOCKERFILE

PROCESSING TIME ESTIMATES

SCALING CONSIDERATIONS

Scaling Strategy

ERROR HANDLING FLOW

QUICK START

API ENDPOINT DESIGN

36 KiB

Raw Blame History