Merge branch 'main' of 172.30.1.72:egullickson/motovaultpro

feat: OCR Pipeline tech stack file
2026-01-04 13:35:43 -06:00 · 2026-01-04 13:35:38 -06:00
1 changed files with 519 additions and 0 deletions
--- a/docs/ocr-pipeline-tech-stack.md
+++ b/docs/ocr-pipeline-tech-stack.md
@@ -0,0 +1,519 @@
 # Vehicle Owner's Manual & Receipt OCR Pipeline
 ## Complete Tech Stack & Architecture
 ---
 ## SYSTEM FLOW DIAGRAM
 ```
 ┌─────────────────────────────────────────────────────────────────────────────────┐
 │                              USER UPLOAD                                         │
 │                   (PDF, JPG, PNG, HEIC, TIFF, WEBP)                              │
 └─────────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
 ┌─────────────────────────────────────────────────────────────────────────────────┐
 │                         1. FORMAT DETECTION                                      │
 │                           python-magic                                           │
 │                                                                                  │
 │   • Detect MIME type from file bytes (not extension)                            │
 │   • Validate file is an accepted format                                          │
 │   • Reject unsupported/malicious files early                                     │
 └─────────────────────────────────────────────────────────────────────────────────┘
                                      │
                      ┌───────────────┴───────────────┐
                      ▼                               ▼
 ┌─────────────────────────────────┐   ┌─────────────────────────────────┐
 │         PDF DETECTED            │   │        IMAGE DETECTED           │
 └─────────────────────────────────┘   │   (JPG, PNG, HEIC, TIFF, WEBP)  │
                │                     └─────────────────────────────────┘
                ▼                                     │
 ┌─────────────────────────────────┐                   │
 │   2a. PDF TEXT LAYER CHECK      │                   │
 │         PyMuPDF (fitz)          │                   │
 │                                 │                   │
 │  • Check if PDF has embedded    │                   │
 │    searchable text              │                   │
 │  • Count text characters        │                   │
 └─────────────────────────────────┘                   │
                │                                     │
        ┌───────┴───────┐                             │
        ▼               ▼                             │
 ┌──────────────┐ ┌──────────────┐                     │
 │  HAS TEXT    │ │  NO TEXT     │                     │
 │  (Native)    │ │  (Scanned)   │                     │
 └──────────────┘ └──────────────┘                     │
        │               │                             │
        │               ▼                             │
        │       ┌──────────────────────┐              │
        │       │ 2b. RENDER TO IMAGES │              │
        │       │    PyMuPDF @ 300 DPI │              │
        │       │                      │              │
        │       │ • Page-by-page render│              │
        │       │ • Maintain quality   │              │
        │       └──────────────────────┘              │
        │               │                             │
        │               └──────────────┬──────────────┘
        │                              ▼
        │       ┌─────────────────────────────────────────────────────────┐
        │       │                3. IMAGE NORMALIZATION                    │
        │       ├─────────────────────────────────────────────────────────┤
        │       │                                                         │
        │       │   ┌─────────────┐    ┌─────────────┐    ┌────────────┐ │
        │       │   │ HEIC Input  │    │ EXIF Check  │    │ Convert to │ │
        │       │   │             │───▶│             │───▶│  RGB PNG   │ │
        │       │   │ pillow-heif │    │ Fix rotation│    │            │ │
        │       │   └─────────────┘    │ ImageOps    │    │  Pillow    │ │
        │       │                      └─────────────┘    └────────────┘ │
        │       │                                                         │
        │       │   ┌─────────────┐    ┌─────────────┐    ┌────────────┐ │
        │       │   │ JPG/PNG/etc │    │ EXIF Check  │    │ Convert to │ │
        │       │   │             │───▶│             │───▶│  RGB PNG   │ │
        │       │   │   Pillow    │    │ Fix rotation│    │            │ │
        │       │   └─────────────┘    └─────────────┘    └────────────┘ │
        │       │                                                         │
        │       └─────────────────────────────────────────────────────────┘
        │                              │
        │                              ▼
        │       ┌─────────────────────────────────────────────────────────┐
        │       │                 4. PREPROCESSING                         │
        │       ├─────────────────────────────────────────────────────────┤
        │       │                                                         │
        │       │   ┌─────────────────────────────────────────────────┐   │
        │       │   │  4a. Resolution Normalization                   │   │
        │       │   │      Target: 300 DPI equivalent                 │   │
        │       │   │      Tool: Pillow resize with LANCZOS           │   │
        │       │   └─────────────────────────────────────────────────┘   │
        │       │                         │                               │
        │       │                         ▼                               │
        │       │   ┌─────────────────────────────────────────────────┐   │
        │       │   │  4b. Deskew (Straighten)                        │   │
        │       │   │      Tool: OpenCV + deskew library              │   │
        │       │   │      Method: Hough transform / projection       │   │
        │       │   └─────────────────────────────────────────────────┘   │
        │       │                         │                               │
        │       │                         ▼                               │
        │       │   ┌─────────────────────────────────────────────────┐   │
        │       │   │  4c. Denoise                                    │   │
        │       │   │      Tool: OpenCV fastNlMeansDenoising          │   │
        │       │   └─────────────────────────────────────────────────┘   │
        │       │                         │                               │
        │       │                         ▼                               │
        │       │   ┌───────────────────────────────────────────────────┐ │
        │       │   │  4d. Document-Specific Enhancement                │ │
        │       │   │                                                   │ │
        │       │   │   RECEIPTS:              MANUALS:                 │ │
        │       │   │   • Adaptive threshold   • Contrast stretch       │ │
        │       │   │   • Perspective correct  • Sharpen                │ │
        │       │   │   • High contrast B&W    • Keep grayscale         │ │
        │       │   │                                                   │ │
        │       │   │   Tool: OpenCV adaptiveThreshold / CLAHE          │ │
        │       │   └───────────────────────────────────────────────────┘ │
        │       │                                                         │
        │       └─────────────────────────────────────────────────────────┘
        │                              │
        │                              ▼
        │       ┌─────────────────────────────────────────────────────────┐
        │       │                    5. OCR ENGINE                         │
        │       ├─────────────────────────────────────────────────────────┤
        │       │                                                         │
        │       │   ┌─────────────────────────────────────────────────┐   │
        │       │   │  5a. Primary OCR: Tesseract 5.x                 │   │
        │       │   │                                                 │   │
        │       │   │  • Engine: LSTM (--oem 1)                       │   │
        │       │   │  • Page segmentation: Auto (--psm 3)            │   │
        │       │   │  • Output: hOCR with word confidence            │   │
        │       │   └─────────────────────────────────────────────────┘   │
        │       │                         │                               │
        │       │                         ▼                               │
        │       │                 ┌───────────────┐                       │
        │       │                 │  Confidence   │                       │
        │       │                 │    > 80% ?    │                       │
        │       │                 └───────────────┘                       │
        │       │                    │         │                          │
        │       │              YES ──┘         └── NO                     │
        │       │               │                   │                     │
        │       │               │                   ▼                     │
        │       │               │   ┌─────────────────────────────────┐   │
        │       │               │   │  5b. Fallback: PaddleOCR        │   │
        │       │               │   │                                 │   │
        │       │               │   │  • Better for degraded images   │   │
        │       │               │   │  • Better table detection       │   │
        │       │               │   │  • Slower but more accurate     │   │
        │       │               │   └─────────────────────────────────┘   │
        │       │               │                   │                     │
        │       │               ▼                   ▼                     │
        │       │         ┌─────────────────────────────────┐             │
        │       │         │  5c. Result Merging             │             │
        │       │         │  • Merge by bounding box        │             │
        │       │         │  • Keep highest confidence      │             │
        │       │         └─────────────────────────────────┘             │
        │       │                                                         │
        │       └─────────────────────────────────────────────────────────┘
        │                              │
        └──────────────────────────────┤
                                       ▼
 ┌─────────────────────────────────────────────────────────────────────────────────┐
 │                        6. STRUCTURED EXTRACTION                                  │
 ├─────────────────────────────────────────────────────────────────────────────────┤
 │                                                                                  │
 │   ┌───────────────────────────────────────────────────────────────────────┐     │
 │   │  6a. Layout Analysis                                                  │     │
 │   │      Tool: PaddleOCR Layout / LayoutParser                            │     │
 │   │      Detects: Headers, paragraphs, tables, lists, figures             │     │
 │   └───────────────────────────────────────────────────────────────────────┘     │
 │                                      │                                          │
 │                    ┌─────────────────┴─────────────────┐                        │
 │                    ▼                                   ▼                        │
 │   ┌────────────────────────────────┐   ┌────────────────────────────────┐       │
 │   │  6b. Table Extraction          │   │  6c. Text Block Processing     │       │
 │   │                                │   │                                │       │
 │   │  Tool: img2table or Camelot    │   │  • Group by proximity          │       │
 │   │  Output: Pandas DataFrame      │   │  • Identify sections           │       │
 │   │                                │   │  • Extract hierarchies         │       │
 │   └────────────────────────────────┘   └────────────────────────────────┘       │
 │                    │                                   │                        │
 │                    └─────────────────┬─────────────────┘                        │
 │                                      ▼                                          │
 │   ┌───────────────────────────────────────────────────────────────────────┐     │
 │   │  6d. Maintenance Schedule Pattern Matching                            │     │
 │   │                                                                       │     │
 │   │  Tool: regex + spaCy NER                                              │     │
 │   │                                                                       │     │
 │   │  Patterns:                                                            │     │
 │   │  • Mileage intervals: "every 5,000 miles", "30,000 km"                │     │
 │   │  • Time intervals: "every 6 months", "annually"                       │     │
 │   │  • Service types: "oil change", "tire rotation", "brake inspection"  │     │
 │   │  • Fluid types: "5W-30", "ATF", "DOT 4"                               │     │
 │   │                                                                       │     │
 │   └───────────────────────────────────────────────────────────────────────┘     │
 │                                                                                  │
 └─────────────────────────────────────────────────────────────────────────────────┘
                                       │
                                       ▼
 ┌─────────────────────────────────────────────────────────────────────────────────┐
 │                            7. OUTPUT LAYER                                       │
 ├─────────────────────────────────────────────────────────────────────────────────┤
 │                                                                                  │
 │   ┌───────────────────────────────────────────────────────────────────────┐     │
 │   │  Structured JSON Response                                             │     │
 │   │                                                                       │     │
 │   │  {                                                                    │     │
 │   │    "document_type": "owners_manual" | "receipt",                      │     │
 │   │    "vehicle": {                                                       │     │
 │   │      "make": "Toyota",                                                │     │
 │   │      "model": "Camry",                                                │     │
 │   │      "year": 2024                                                     │     │
 │   │    },                                                                 │     │
 │   │    "maintenance_schedule": [                                          │     │
 │   │      {                                                                │     │
 │   │        "service": "Oil Change",                                       │     │
 │   │        "interval_miles": 5000,                                        │     │
 │   │        "interval_months": 6,                                          │     │
 │   │        "details": "Use 0W-20 synthetic"                               │     │
 │   │      }                                                                │     │
 │   │    ],                                                                 │     │
 │   │    "raw_text": "...",                                                 │     │
 │   │    "tables": [...],                                                   │     │
 │   │    "confidence_score": 0.94,                                          │     │
 │   │    "processing_time_ms": 1250                                         │     │
 │   │  }                                                                    │     │
 │   │                                                                       │     │
 │   └───────────────────────────────────────────────────────────────────────┘     │
 │                                                                                  │
 └─────────────────────────────────────────────────────────────────────────────────┘
 ```
 ---
 ## COMPLETE TECH STACK
 ### Core Dependencies
 | Component              | Tool                  | Version   | Purpose                              |
 |------------------------|-----------------------|-----------|--------------------------------------|
 | **Runtime**            | Python                | 3.11+     | Primary language                     |
 | **API Framework**      | FastAPI               | 0.100+    | REST API with async support          |
 | **Task Queue**         | Celery + Redis        | 5.3+      | Async processing for large docs      |
 ### File Handling
 | Component              | Tool                  | Purpose                              |
 |------------------------|-----------------------|--------------------------------------|
 | **Format Detection**   | python-magic          | MIME type detection from bytes       |
 | **HEIC Support**       | pillow-heif           | iPhone HEIC image conversion         |
 | **Image Processing**   | Pillow                | General image I/O and manipulation   |
 | **PDF Processing**     | PyMuPDF (fitz)        | PDF text extraction & rendering      |
 ### Image Preprocessing
 | Component              | Tool                  | Purpose                              |
 |------------------------|-----------------------|--------------------------------------|
 | **Computer Vision**    | OpenCV (cv2)          | Deskew, denoise, threshold           |
 | **Deskew**             | deskew                | Specialized document straightening   |
 | **Enhancement**        | scikit-image          | Additional image filters             |
 ### OCR Engines
 | Component              | Tool                  | Purpose                              |
 |------------------------|-----------------------|--------------------------------------|
 | **Primary OCR**        | Tesseract 5.x         | Fast, reliable text extraction       |
 | **Python Binding**     | pytesseract           | Tesseract Python wrapper             |
 | **Fallback OCR**       | PaddleOCR             | Higher accuracy, better tables       |
 | **Layout Analysis**    | PaddleOCR / LayoutParser | Document structure detection      |
 ### Data Extraction
 | Component              | Tool                  | Purpose                              |
 |------------------------|-----------------------|--------------------------------------|
 | **Table Extraction**   | img2table             | Image-based table extraction         |
 | **PDF Tables**         | Camelot               | Native PDF table extraction          |
 | **NLP**                | spaCy                 | Entity extraction, pattern matching  |
 | **Data Handling**      | pandas                | Table/dataframe manipulation         |
 ### Infrastructure
 | Component              | Tool                  | Purpose                              |
 |------------------------|-----------------------|--------------------------------------|
 | **Object Storage**     | S3 / MinIO            | Document and result storage          |
 | **Database**           | PostgreSQL            | Metadata and results                 |
 | **Cache**              | Redis                 | Result caching, queue backend        |
 | **Containerization**   | Docker                | Deployment                           |
 ---
 ## SYSTEM REQUIREMENTS FILE
 ### requirements.txt
 ```
 # API Framework
 fastapi>=0.100.0
 uvicorn[standard]>=0.23.0
 python-multipart>=0.0.6
 # Task Queue
 celery>=5.3.0
 redis>=4.6.0
 # File Detection & Handling
 python-magic>=0.4.27
 pillow>=10.0.0
 pillow-heif>=0.13.0
 # PDF Processing
 pymupdf>=1.23.0
 # Image Preprocessing
 opencv-python-headless>=4.8.0
 deskew>=1.4.0
 scikit-image>=0.21.0
 numpy>=1.24.0
 # OCR Engines
 pytesseract>=0.3.10
 paddlepaddle>=2.5.0
 paddleocr>=2.7.0
 # Table Extraction
 img2table>=1.2.0
 camelot-py[cv]>=0.11.0
 # NLP & Data
 spacy>=3.6.0
 pandas>=2.0.0
 # Storage & Database
 boto3>=1.28.0
 psycopg2-binary>=2.9.0
 sqlalchemy>=2.0.0
 ```
 ### System Package Requirements (Ubuntu/Debian)
 ```bash
 # Tesseract OCR
 apt-get install tesseract-ocr tesseract-ocr-eng libtesseract-dev
 # HEIC Support
 apt-get install libheif-examples libheif-dev
 # OpenCV dependencies
 apt-get install libgl1-mesa-glx libglib2.0-0
 # PDF rendering dependencies
 apt-get install libmupdf-dev mupdf-tools
 # Image processing
 apt-get install libmagic1 ghostscript
 # Camelot dependencies
 apt-get install ghostscript python3-tk
 ```
 ---
 ## DOCKERFILE
 ```dockerfile
 FROM python:3.11-slim
 # System dependencies
 RUN apt-get update && apt-get install -y --no-install-recommends \
    tesseract-ocr \
    tesseract-ocr-eng \
    libtesseract-dev \
    libheif-examples \
    libheif-dev \
    libgl1-mesa-glx \
    libglib2.0-0 \
    libmagic1 \
    ghostscript \
    poppler-utils \
    && rm -rf /var/lib/apt/lists/*
 # Python dependencies
 WORKDIR /app
 COPY requirements.txt .
 RUN pip install --no-cache-dir -r requirements.txt
 # Download spaCy model
 RUN python -m spacy download en_core_web_sm
 # Download PaddleOCR models (cached in image)
 RUN python -c "from paddleocr import PaddleOCR; PaddleOCR(use_angle_cls=True, lang='en')"
 COPY . .
 EXPOSE 8000
 CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
 ```
 ---
 ## PROCESSING TIME ESTIMATES
 | Document Type            | Size          | Expected Time  | Notes                          |
 |--------------------------|---------------|----------------|--------------------------------|
 | Single receipt (HEIC)    | 2-5 MB        | 1-3 seconds    | After preprocessing            |
 | Single receipt (JPG)     | 500 KB-2 MB   | 0.5-2 seconds  | Direct processing              |
 | Owner's manual (PDF)     | 10-50 MB      | 30-120 seconds | 100-300 pages                  |
 | Owner's manual (scanned) | 50-200 MB     | 2-5 minutes    | Requires full OCR              |
 ---
 ## SCALING CONSIDERATIONS
 ```
                    ┌─────────────────────────────────────────┐
                    │            Load Balancer                │
                    │              (nginx)                    │
                    └─────────────────────────────────────────┘
                                       │
              ┌────────────────────────┼────────────────────────┐
              ▼                        ▼                        ▼
   ┌─────────────────┐      ┌─────────────────┐      ┌─────────────────┐
   │   API Server    │      │   API Server    │      │   API Server    │
   │   (FastAPI)     │      │   (FastAPI)     │      │   (FastAPI)     │
   └─────────────────┘      └─────────────────┘      └─────────────────┘
              │                        │                        │
              └────────────────────────┼────────────────────────┘
                                       │
                                       ▼
                    ┌─────────────────────────────────────────┐
                    │            Redis Queue                  │
                    └─────────────────────────────────────────┘
                                       │
              ┌────────────────────────┼────────────────────────┐
              ▼                        ▼                        ▼
   ┌─────────────────┐      ┌─────────────────┐      ┌─────────────────┐
   │  Celery Worker  │      │  Celery Worker  │      │  Celery Worker  │
   │  (OCR Heavy)    │      │  (OCR Heavy)    │      │  (OCR Heavy)    │
   │  CPU Optimized  │      │  CPU Optimized  │      │  GPU Optional   │
   └─────────────────┘      └─────────────────┘      └─────────────────┘
 ```
 ### Scaling Strategy
 1. **Small files (receipts)**: Process synchronously in API server
 2. **Large files (manuals)**: Queue to Celery workers, return job ID
 3. **Horizontal scaling**: Add more Celery workers for throughput
 4. **GPU acceleration**: PaddleOCR supports GPU for 5-10x speedup
 ---
 ## ERROR HANDLING FLOW
 ```
 ┌─────────────────┐
 │  File Upload    │
 └────────┬────────┘
         │
         ▼
 ┌─────────────────┐     ┌─────────────────────────────────────────┐
 │ Format Valid?   │──NO─▶│ Return 400: Unsupported format         │
 └────────┬────────┘     └─────────────────────────────────────────┘
         │ YES
         ▼
 ┌─────────────────┐     ┌─────────────────────────────────────────┐
 │ File Corrupt?   │─YES─▶│ Return 422: Unable to process file     │
 └────────┬────────┘     └─────────────────────────────────────────┘
         │ NO
         ▼
 ┌─────────────────┐     ┌─────────────────────────────────────────┐
 │ OCR Confidence  │─LOW─▶│ Return 200 with warning flag           │
 │    < 50%?       │     │ { "warning": "low_confidence" }        │
 └────────┬────────┘     └─────────────────────────────────────────┘
         │ OK
         ▼
 ┌─────────────────┐     ┌─────────────────────────────────────────┐
 │ No Schedule     │─YES─▶│ Return 200 with raw text only          │
 │   Found?        │     │ { "maintenance_schedule": null }       │
 └────────┬────────┘     └─────────────────────────────────────────┘
         │ NO
         ▼
 ┌─────────────────┐
 │ Return Full     │
 │ Structured Data │
 └─────────────────┘
 ```
 ---
 ## QUICK START
 ```bash
 # 1. Clone and setup
 git clone <repo>
 cd ocr-pipeline
 # 2. Build Docker image
 docker build -t ocr-pipeline .
 # 3. Start services
 docker-compose up -d
 # 4. Test endpoint
 curl -X POST "http://localhost:8000/api/v1/extract" \
  -F "file=@owners_manual.pdf" \
  -F "document_type=manual"
 ```
 ---
 ## API ENDPOINT DESIGN
 ```
 POST /api/v1/extract
  - Accepts: multipart/form-data
  - Fields: file, document_type (optional)
  - Returns: Structured JSON or job_id for large files
 GET /api/v1/jobs/{job_id}
  - Poll for async job status
  - Returns: status, progress, result when complete
 POST /api/v1/extract/batch
  - Accepts: multiple files
  - Returns: array of job_ids
 ```
Author	SHA1	Message	Date
Eric Gullickson	453083b7db	Merge branch 'main' of 172.30.1.72:egullickson/motovaultpro All checks were successful Deploy to Staging / Build Images (push) Successful in 21s Details Deploy to Staging / Deploy to Staging (push) Successful in 25s Details Deploy to Staging / Verify Staging (push) Successful in 6s Details Deploy to Staging / Notify Staging Ready (push) Successful in 6s Details Deploy to Staging / Notify Staging Failure (push) Has been skipped Details	2026-01-04 13:35:43 -06:00
Eric Gullickson	a396fc0f38	feat: OCR Pipeline tech stack file	2026-01-04 13:35:38 -06:00