520 lines
36 KiB
Markdown
520 lines
36 KiB
Markdown
# Vehicle Owner's Manual & Receipt OCR Pipeline
|
|
## Complete Tech Stack & Architecture
|
|
|
|
---
|
|
|
|
## SYSTEM FLOW DIAGRAM
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────────────────┐
|
|
│ USER UPLOAD │
|
|
│ (PDF, JPG, PNG, HEIC, TIFF, WEBP) │
|
|
└─────────────────────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────────────────┐
|
|
│ 1. FORMAT DETECTION │
|
|
│ python-magic │
|
|
│ │
|
|
│ • Detect MIME type from file bytes (not extension) │
|
|
│ • Validate file is an accepted format │
|
|
│ • Reject unsupported/malicious files early │
|
|
└─────────────────────────────────────────────────────────────────────────────────┘
|
|
│
|
|
┌───────────────┴───────────────┐
|
|
▼ ▼
|
|
┌─────────────────────────────────┐ ┌─────────────────────────────────┐
|
|
│ PDF DETECTED │ │ IMAGE DETECTED │
|
|
└─────────────────────────────────┘ │ (JPG, PNG, HEIC, TIFF, WEBP) │
|
|
│ └─────────────────────────────────┘
|
|
▼ │
|
|
┌─────────────────────────────────┐ │
|
|
│ 2a. PDF TEXT LAYER CHECK │ │
|
|
│ PyMuPDF (fitz) │ │
|
|
│ │ │
|
|
│ • Check if PDF has embedded │ │
|
|
│ searchable text │ │
|
|
│ • Count text characters │ │
|
|
└─────────────────────────────────┘ │
|
|
│ │
|
|
┌───────┴───────┐ │
|
|
▼ ▼ │
|
|
┌──────────────┐ ┌──────────────┐ │
|
|
│ HAS TEXT │ │ NO TEXT │ │
|
|
│ (Native) │ │ (Scanned) │ │
|
|
└──────────────┘ └──────────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌──────────────────────┐ │
|
|
│ │ 2b. RENDER TO IMAGES │ │
|
|
│ │ PyMuPDF @ 300 DPI │ │
|
|
│ │ │ │
|
|
│ │ • Page-by-page render│ │
|
|
│ │ • Maintain quality │ │
|
|
│ └──────────────────────┘ │
|
|
│ │ │
|
|
│ └──────────────┬──────────────┘
|
|
│ ▼
|
|
│ ┌─────────────────────────────────────────────────────────┐
|
|
│ │ 3. IMAGE NORMALIZATION │
|
|
│ ├─────────────────────────────────────────────────────────┤
|
|
│ │ │
|
|
│ │ ┌─────────────┐ ┌─────────────┐ ┌────────────┐ │
|
|
│ │ │ HEIC Input │ │ EXIF Check │ │ Convert to │ │
|
|
│ │ │ │───▶│ │───▶│ RGB PNG │ │
|
|
│ │ │ pillow-heif │ │ Fix rotation│ │ │ │
|
|
│ │ └─────────────┘ │ ImageOps │ │ Pillow │ │
|
|
│ │ └─────────────┘ └────────────┘ │
|
|
│ │ │
|
|
│ │ ┌─────────────┐ ┌─────────────┐ ┌────────────┐ │
|
|
│ │ │ JPG/PNG/etc │ │ EXIF Check │ │ Convert to │ │
|
|
│ │ │ │───▶│ │───▶│ RGB PNG │ │
|
|
│ │ │ Pillow │ │ Fix rotation│ │ │ │
|
|
│ │ └─────────────┘ └─────────────┘ └────────────┘ │
|
|
│ │ │
|
|
│ └─────────────────────────────────────────────────────────┘
|
|
│ │
|
|
│ ▼
|
|
│ ┌─────────────────────────────────────────────────────────┐
|
|
│ │ 4. PREPROCESSING │
|
|
│ ├─────────────────────────────────────────────────────────┤
|
|
│ │ │
|
|
│ │ ┌─────────────────────────────────────────────────┐ │
|
|
│ │ │ 4a. Resolution Normalization │ │
|
|
│ │ │ Target: 300 DPI equivalent │ │
|
|
│ │ │ Tool: Pillow resize with LANCZOS │ │
|
|
│ │ └─────────────────────────────────────────────────┘ │
|
|
│ │ │ │
|
|
│ │ ▼ │
|
|
│ │ ┌─────────────────────────────────────────────────┐ │
|
|
│ │ │ 4b. Deskew (Straighten) │ │
|
|
│ │ │ Tool: OpenCV + deskew library │ │
|
|
│ │ │ Method: Hough transform / projection │ │
|
|
│ │ └─────────────────────────────────────────────────┘ │
|
|
│ │ │ │
|
|
│ │ ▼ │
|
|
│ │ ┌─────────────────────────────────────────────────┐ │
|
|
│ │ │ 4c. Denoise │ │
|
|
│ │ │ Tool: OpenCV fastNlMeansDenoising │ │
|
|
│ │ └─────────────────────────────────────────────────┘ │
|
|
│ │ │ │
|
|
│ │ ▼ │
|
|
│ │ ┌───────────────────────────────────────────────────┐ │
|
|
│ │ │ 4d. Document-Specific Enhancement │ │
|
|
│ │ │ │ │
|
|
│ │ │ RECEIPTS: MANUALS: │ │
|
|
│ │ │ • Adaptive threshold • Contrast stretch │ │
|
|
│ │ │ • Perspective correct • Sharpen │ │
|
|
│ │ │ • High contrast B&W • Keep grayscale │ │
|
|
│ │ │ │ │
|
|
│ │ │ Tool: OpenCV adaptiveThreshold / CLAHE │ │
|
|
│ │ └───────────────────────────────────────────────────┘ │
|
|
│ │ │
|
|
│ └─────────────────────────────────────────────────────────┘
|
|
│ │
|
|
│ ▼
|
|
│ ┌─────────────────────────────────────────────────────────┐
|
|
│ │ 5. OCR ENGINE │
|
|
│ ├─────────────────────────────────────────────────────────┤
|
|
│ │ │
|
|
│ │ ┌─────────────────────────────────────────────────┐ │
|
|
│ │ │ 5a. Primary OCR: Tesseract 5.x │ │
|
|
│ │ │ │ │
|
|
│ │ │ • Engine: LSTM (--oem 1) │ │
|
|
│ │ │ • Page segmentation: Auto (--psm 3) │ │
|
|
│ │ │ • Output: hOCR with word confidence │ │
|
|
│ │ └─────────────────────────────────────────────────┘ │
|
|
│ │ │ │
|
|
│ │ ▼ │
|
|
│ │ ┌───────────────┐ │
|
|
│ │ │ Confidence │ │
|
|
│ │ │ > 80% ? │ │
|
|
│ │ └───────────────┘ │
|
|
│ │ │ │ │
|
|
│ │ YES ──┘ └── NO │
|
|
│ │ │ │ │
|
|
│ │ │ ▼ │
|
|
│ │ │ ┌─────────────────────────────────┐ │
|
|
│ │ │ │ 5b. Fallback: PaddleOCR │ │
|
|
│ │ │ │ │ │
|
|
│ │ │ │ • Better for degraded images │ │
|
|
│ │ │ │ • Better table detection │ │
|
|
│ │ │ │ • Slower but more accurate │ │
|
|
│ │ │ └─────────────────────────────────┘ │
|
|
│ │ │ │ │
|
|
│ │ ▼ ▼ │
|
|
│ │ ┌─────────────────────────────────┐ │
|
|
│ │ │ 5c. Result Merging │ │
|
|
│ │ │ • Merge by bounding box │ │
|
|
│ │ │ • Keep highest confidence │ │
|
|
│ │ └─────────────────────────────────┘ │
|
|
│ │ │
|
|
│ └─────────────────────────────────────────────────────────┘
|
|
│ │
|
|
└──────────────────────────────┤
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────────────────┐
|
|
│ 6. STRUCTURED EXTRACTION │
|
|
├─────────────────────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ ┌───────────────────────────────────────────────────────────────────────┐ │
|
|
│ │ 6a. Layout Analysis │ │
|
|
│ │ Tool: PaddleOCR Layout / LayoutParser │ │
|
|
│ │ Detects: Headers, paragraphs, tables, lists, figures │ │
|
|
│ └───────────────────────────────────────────────────────────────────────┘ │
|
|
│ │ │
|
|
│ ┌─────────────────┴─────────────────┐ │
|
|
│ ▼ ▼ │
|
|
│ ┌────────────────────────────────┐ ┌────────────────────────────────┐ │
|
|
│ │ 6b. Table Extraction │ │ 6c. Text Block Processing │ │
|
|
│ │ │ │ │ │
|
|
│ │ Tool: img2table or Camelot │ │ • Group by proximity │ │
|
|
│ │ Output: Pandas DataFrame │ │ • Identify sections │ │
|
|
│ │ │ │ • Extract hierarchies │ │
|
|
│ └────────────────────────────────┘ └────────────────────────────────┘ │
|
|
│ │ │ │
|
|
│ └─────────────────┬─────────────────┘ │
|
|
│ ▼ │
|
|
│ ┌───────────────────────────────────────────────────────────────────────┐ │
|
|
│ │ 6d. Maintenance Schedule Pattern Matching │ │
|
|
│ │ │ │
|
|
│ │ Tool: regex + spaCy NER │ │
|
|
│ │ │ │
|
|
│ │ Patterns: │ │
|
|
│ │ • Mileage intervals: "every 5,000 miles", "30,000 km" │ │
|
|
│ │ • Time intervals: "every 6 months", "annually" │ │
|
|
│ │ • Service types: "oil change", "tire rotation", "brake inspection" │ │
|
|
│ │ • Fluid types: "5W-30", "ATF", "DOT 4" │ │
|
|
│ │ │ │
|
|
│ └───────────────────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────────────────┐
|
|
│ 7. OUTPUT LAYER │
|
|
├─────────────────────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ ┌───────────────────────────────────────────────────────────────────────┐ │
|
|
│ │ Structured JSON Response │ │
|
|
│ │ │ │
|
|
│ │ { │ │
|
|
│ │ "document_type": "owners_manual" | "receipt", │ │
|
|
│ │ "vehicle": { │ │
|
|
│ │ "make": "Toyota", │ │
|
|
│ │ "model": "Camry", │ │
|
|
│ │ "year": 2024 │ │
|
|
│ │ }, │ │
|
|
│ │ "maintenance_schedule": [ │ │
|
|
│ │ { │ │
|
|
│ │ "service": "Oil Change", │ │
|
|
│ │ "interval_miles": 5000, │ │
|
|
│ │ "interval_months": 6, │ │
|
|
│ │ "details": "Use 0W-20 synthetic" │ │
|
|
│ │ } │ │
|
|
│ │ ], │ │
|
|
│ │ "raw_text": "...", │ │
|
|
│ │ "tables": [...], │ │
|
|
│ │ "confidence_score": 0.94, │ │
|
|
│ │ "processing_time_ms": 1250 │ │
|
|
│ │ } │ │
|
|
│ │ │ │
|
|
│ └───────────────────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## COMPLETE TECH STACK
|
|
|
|
### Core Dependencies
|
|
|
|
| Component | Tool | Version | Purpose |
|
|
|------------------------|-----------------------|-----------|--------------------------------------|
|
|
| **Runtime** | Python | 3.11+ | Primary language |
|
|
| **API Framework** | FastAPI | 0.100+ | REST API with async support |
|
|
| **Task Queue** | Celery + Redis | 5.3+ | Async processing for large docs |
|
|
|
|
### File Handling
|
|
|
|
| Component | Tool | Purpose |
|
|
|------------------------|-----------------------|--------------------------------------|
|
|
| **Format Detection** | python-magic | MIME type detection from bytes |
|
|
| **HEIC Support** | pillow-heif | iPhone HEIC image conversion |
|
|
| **Image Processing** | Pillow | General image I/O and manipulation |
|
|
| **PDF Processing** | PyMuPDF (fitz) | PDF text extraction & rendering |
|
|
|
|
### Image Preprocessing
|
|
|
|
| Component | Tool | Purpose |
|
|
|------------------------|-----------------------|--------------------------------------|
|
|
| **Computer Vision** | OpenCV (cv2) | Deskew, denoise, threshold |
|
|
| **Deskew** | deskew | Specialized document straightening |
|
|
| **Enhancement** | scikit-image | Additional image filters |
|
|
|
|
### OCR Engines
|
|
|
|
| Component | Tool | Purpose |
|
|
|------------------------|-----------------------|--------------------------------------|
|
|
| **Primary OCR** | Tesseract 5.x | Fast, reliable text extraction |
|
|
| **Python Binding** | pytesseract | Tesseract Python wrapper |
|
|
| **Fallback OCR** | PaddleOCR | Higher accuracy, better tables |
|
|
| **Layout Analysis** | PaddleOCR / LayoutParser | Document structure detection |
|
|
|
|
### Data Extraction
|
|
|
|
| Component | Tool | Purpose |
|
|
|------------------------|-----------------------|--------------------------------------|
|
|
| **Table Extraction** | img2table | Image-based table extraction |
|
|
| **PDF Tables** | Camelot | Native PDF table extraction |
|
|
| **NLP** | spaCy | Entity extraction, pattern matching |
|
|
| **Data Handling** | pandas | Table/dataframe manipulation |
|
|
|
|
### Infrastructure
|
|
|
|
| Component | Tool | Purpose |
|
|
|------------------------|-----------------------|--------------------------------------|
|
|
| **Object Storage** | S3 / MinIO | Document and result storage |
|
|
| **Database** | PostgreSQL | Metadata and results |
|
|
| **Cache** | Redis | Result caching, queue backend |
|
|
| **Containerization** | Docker | Deployment |
|
|
|
|
---
|
|
|
|
## SYSTEM REQUIREMENTS FILE
|
|
|
|
### requirements.txt
|
|
|
|
```
|
|
# API Framework
|
|
fastapi>=0.100.0
|
|
uvicorn[standard]>=0.23.0
|
|
python-multipart>=0.0.6
|
|
|
|
# Task Queue
|
|
celery>=5.3.0
|
|
redis>=4.6.0
|
|
|
|
# File Detection & Handling
|
|
python-magic>=0.4.27
|
|
pillow>=10.0.0
|
|
pillow-heif>=0.13.0
|
|
|
|
# PDF Processing
|
|
pymupdf>=1.23.0
|
|
|
|
# Image Preprocessing
|
|
opencv-python-headless>=4.8.0
|
|
deskew>=1.4.0
|
|
scikit-image>=0.21.0
|
|
numpy>=1.24.0
|
|
|
|
# OCR Engines
|
|
pytesseract>=0.3.10
|
|
paddlepaddle>=2.5.0
|
|
paddleocr>=2.7.0
|
|
|
|
# Table Extraction
|
|
img2table>=1.2.0
|
|
camelot-py[cv]>=0.11.0
|
|
|
|
# NLP & Data
|
|
spacy>=3.6.0
|
|
pandas>=2.0.0
|
|
|
|
# Storage & Database
|
|
boto3>=1.28.0
|
|
psycopg2-binary>=2.9.0
|
|
sqlalchemy>=2.0.0
|
|
```
|
|
|
|
### System Package Requirements (Ubuntu/Debian)
|
|
|
|
```bash
|
|
# Tesseract OCR
|
|
apt-get install tesseract-ocr tesseract-ocr-eng libtesseract-dev
|
|
|
|
# HEIC Support
|
|
apt-get install libheif-examples libheif-dev
|
|
|
|
# OpenCV dependencies
|
|
apt-get install libgl1-mesa-glx libglib2.0-0
|
|
|
|
# PDF rendering dependencies
|
|
apt-get install libmupdf-dev mupdf-tools
|
|
|
|
# Image processing
|
|
apt-get install libmagic1 ghostscript
|
|
|
|
# Camelot dependencies
|
|
apt-get install ghostscript python3-tk
|
|
```
|
|
|
|
---
|
|
|
|
## DOCKERFILE
|
|
|
|
```dockerfile
|
|
FROM python:3.11-slim
|
|
|
|
# System dependencies
|
|
RUN apt-get update && apt-get install -y --no-install-recommends \
|
|
tesseract-ocr \
|
|
tesseract-ocr-eng \
|
|
libtesseract-dev \
|
|
libheif-examples \
|
|
libheif-dev \
|
|
libgl1-mesa-glx \
|
|
libglib2.0-0 \
|
|
libmagic1 \
|
|
ghostscript \
|
|
poppler-utils \
|
|
&& rm -rf /var/lib/apt/lists/*
|
|
|
|
# Python dependencies
|
|
WORKDIR /app
|
|
COPY requirements.txt .
|
|
RUN pip install --no-cache-dir -r requirements.txt
|
|
|
|
# Download spaCy model
|
|
RUN python -m spacy download en_core_web_sm
|
|
|
|
# Download PaddleOCR models (cached in image)
|
|
RUN python -c "from paddleocr import PaddleOCR; PaddleOCR(use_angle_cls=True, lang='en')"
|
|
|
|
COPY . .
|
|
|
|
EXPOSE 8000
|
|
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
|
|
```
|
|
|
|
---
|
|
|
|
## PROCESSING TIME ESTIMATES
|
|
|
|
| Document Type | Size | Expected Time | Notes |
|
|
|--------------------------|---------------|----------------|--------------------------------|
|
|
| Single receipt (HEIC) | 2-5 MB | 1-3 seconds | After preprocessing |
|
|
| Single receipt (JPG) | 500 KB-2 MB | 0.5-2 seconds | Direct processing |
|
|
| Owner's manual (PDF) | 10-50 MB | 30-120 seconds | 100-300 pages |
|
|
| Owner's manual (scanned) | 50-200 MB | 2-5 minutes | Requires full OCR |
|
|
|
|
---
|
|
|
|
## SCALING CONSIDERATIONS
|
|
|
|
```
|
|
┌─────────────────────────────────────────┐
|
|
│ Load Balancer │
|
|
│ (nginx) │
|
|
└─────────────────────────────────────────┘
|
|
│
|
|
┌────────────────────────┼────────────────────────┐
|
|
▼ ▼ ▼
|
|
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
|
│ API Server │ │ API Server │ │ API Server │
|
|
│ (FastAPI) │ │ (FastAPI) │ │ (FastAPI) │
|
|
└─────────────────┘ └─────────────────┘ └─────────────────┘
|
|
│ │ │
|
|
└────────────────────────┼────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────┐
|
|
│ Redis Queue │
|
|
└─────────────────────────────────────────┘
|
|
│
|
|
┌────────────────────────┼────────────────────────┐
|
|
▼ ▼ ▼
|
|
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
|
│ Celery Worker │ │ Celery Worker │ │ Celery Worker │
|
|
│ (OCR Heavy) │ │ (OCR Heavy) │ │ (OCR Heavy) │
|
|
│ CPU Optimized │ │ CPU Optimized │ │ GPU Optional │
|
|
└─────────────────┘ └─────────────────┘ └─────────────────┘
|
|
```
|
|
|
|
### Scaling Strategy
|
|
|
|
1. **Small files (receipts)**: Process synchronously in API server
|
|
2. **Large files (manuals)**: Queue to Celery workers, return job ID
|
|
3. **Horizontal scaling**: Add more Celery workers for throughput
|
|
4. **GPU acceleration**: PaddleOCR supports GPU for 5-10x speedup
|
|
|
|
---
|
|
|
|
## ERROR HANDLING FLOW
|
|
|
|
```
|
|
┌─────────────────┐
|
|
│ File Upload │
|
|
└────────┬────────┘
|
|
│
|
|
▼
|
|
┌─────────────────┐ ┌─────────────────────────────────────────┐
|
|
│ Format Valid? │──NO─▶│ Return 400: Unsupported format │
|
|
└────────┬────────┘ └─────────────────────────────────────────┘
|
|
│ YES
|
|
▼
|
|
┌─────────────────┐ ┌─────────────────────────────────────────┐
|
|
│ File Corrupt? │─YES─▶│ Return 422: Unable to process file │
|
|
└────────┬────────┘ └─────────────────────────────────────────┘
|
|
│ NO
|
|
▼
|
|
┌─────────────────┐ ┌─────────────────────────────────────────┐
|
|
│ OCR Confidence │─LOW─▶│ Return 200 with warning flag │
|
|
│ < 50%? │ │ { "warning": "low_confidence" } │
|
|
└────────┬────────┘ └─────────────────────────────────────────┘
|
|
│ OK
|
|
▼
|
|
┌─────────────────┐ ┌─────────────────────────────────────────┐
|
|
│ No Schedule │─YES─▶│ Return 200 with raw text only │
|
|
│ Found? │ │ { "maintenance_schedule": null } │
|
|
└────────┬────────┘ └─────────────────────────────────────────┘
|
|
│ NO
|
|
▼
|
|
┌─────────────────┐
|
|
│ Return Full │
|
|
│ Structured Data │
|
|
└─────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## QUICK START
|
|
|
|
```bash
|
|
# 1. Clone and setup
|
|
git clone <repo>
|
|
cd ocr-pipeline
|
|
|
|
# 2. Build Docker image
|
|
docker build -t ocr-pipeline .
|
|
|
|
# 3. Start services
|
|
docker-compose up -d
|
|
|
|
# 4. Test endpoint
|
|
curl -X POST "http://localhost:8000/api/v1/extract" \
|
|
-F "file=@owners_manual.pdf" \
|
|
-F "document_type=manual"
|
|
```
|
|
|
|
---
|
|
|
|
## API ENDPOINT DESIGN
|
|
|
|
```
|
|
POST /api/v1/extract
|
|
- Accepts: multipart/form-data
|
|
- Fields: file, document_type (optional)
|
|
- Returns: Structured JSON or job_id for large files
|
|
|
|
GET /api/v1/jobs/{job_id}
|
|
- Poll for async job status
|
|
- Returns: status, progress, result when complete
|
|
|
|
POST /api/v1/extract/batch
|
|
- Accepts: multiple files
|
|
- Returns: array of job_ids
|
|
```
|