feat: Owner's Manual OCR Pipeline #71

New Issue

egullickson · 2026-02-01T18:48:26Z

egullickson commented

2026-02-01 18:48:26 +00:00

Overview

Implement async PDF processing for owner's manuals, extracting maintenance schedule tables and creating maintenance_schedules entries.

Parent Issue: #12 (OCR-powered smart capture)
Priority: P3 - Owner's Manual OCR
Dependencies: OCR Service Container Setup, Core OCR API Integration

Scope

Async Processing Flow

1. User uploads PDF (10-200MB)
2. Backend creates async job, returns job_id
3. OCR service processes in background (Celery worker)
4. Frontend polls job status with progress
5. On completion, extracted schedules returned
6. User reviews and confirms schedule creation

Manual Extraction Endpoint

POST /extract/manual
Content-Type: multipart/form-data

Request:
  - file: PDF document
  - vehicle_id: UUID (for context)

Response (immediate):
{
  "jobId": "abc123",
  "status": "pending",
  "estimatedSeconds": 120
}

GET /jobs/{jobId}
Response (completed):
{
  "jobId": "abc123",
  "status": "completed",
  "progress": 100,
  "result": {
    "success": true,
    "vehicleInfo": {
      "make": "Honda",
      "model": "Civic",
      "year": 2024
    },
    "maintenanceSchedules": [
      {
        "service": "Engine Oil",
        "intervalMiles": 5000,
        "intervalMonths": 6,
        "details": "Use 0W-20 synthetic",
        "confidence": 0.92
      },
      {
        "service": "Engine Air Filter",
        "intervalMiles": 30000,
        "intervalMonths": 24,
        "details": "Replace element",
        "confidence": 0.88
      }
    ],
    "rawTables": [...],
    "processingTimeMs": 95000
  }
}

PDF Processing Pipeline

PDF Upload
    ↓
Check for text layer (PyMuPDF)
    ↓
├── Has text: Extract directly
└── Scanned: Render to images @ 300 DPI
    ↓
Table detection (img2table / PaddleOCR Layout)
    ↓
OCR on table regions
    ↓
Pattern matching for maintenance data
    ↓
Structured schedule extraction

Pattern Matching

Mileage Intervals:

MILEAGE_PATTERNS = [
    r'every\s+([\d,]+)\s*(?:miles?|mi)',
    r'([\d,]+)\s*(?:miles?|mi)\s*(?:or|/)',
    r'at\s+([\d,]+)\s*(?:miles?|mi)',
]

Time Intervals:

TIME_PATTERNS = [
    r'every\s+(\d+)\s*months?',
    r'(\d+)\s*months?\s*(?:or|/)',
    r'annually',  # → 12 months
    r'semi-annual',  # → 6 months
]

Service Types (map to maintenance subtypes):

SERVICE_MAPPING = {
    'oil change': ['Engine Oil', 'Oil Filter'],
    'engine oil': ['Engine Oil'],
    'oil filter': ['Oil Filter'],
    'air filter': ['Engine Air Filter'],
    'cabin filter': ['Cabin Air Filter'],
    'tire rotation': ['Tire Rotation'],
    'brake inspection': ['Brake Inspection'],
    'coolant': ['Engine Coolant'],
    'transmission fluid': ['Transmission Fluid'],
    'spark plug': ['Spark Plugs'],
    # ... etc
}

Fluid Specifications:

FLUID_PATTERNS = [
    r'(\d+W-\d+)',           # Oil viscosity: 0W-20, 5W-30
    r'(ATF[- ]?\w+)',        # Transmission: ATF-Z1
    r'(DOT\s*\d)',           # Brake fluid: DOT 4
]

Table Extraction

Use img2table for image-based tables
Use Camelot for native PDF tables
Identify maintenance schedule tables by:
- Headers containing "mileage", "interval", "service"
- Row structure with numeric intervals
- Multi-column format typical of schedules

Directory Structure

ocr/app/
├── extractors/
│   └── manual_extractor.py    # Manual-specific logic
├── preprocessors/
│   └── pdf_preprocessor.py    # PDF handling
├── patterns/
│   ├── maintenance_patterns.py
│   └── service_mapping.py     # Service name normalization
├── table_extraction/
│   ├── __init__.py
│   ├── detector.py            # Table detection
│   └── parser.py              # Table content parsing
└── workers/
    └── manual_worker.py       # Celery task

Integration with Maintenance Feature

// After user confirms extracted schedules:
for (const schedule of extractedSchedules) {
  await maintenanceApi.createSchedule({
    vehicleId,
    category: 'routine_maintenance',
    subtypes: mapServiceToSubtypes(schedule.service),
    intervalMonths: schedule.intervalMonths,
    intervalMiles: schedule.intervalMiles,
    scheduleType: 'interval',
    emailNotifications: false,
  });
}

Acceptance Criteria

PDF upload creates async job
Job polling returns progress updates
Native PDF text extraction works
Scanned PDF OCR works (slower)
Table detection identifies maintenance schedules
Mileage intervals extracted correctly
Time intervals extracted correctly
Service types mapped to maintenance subtypes
Fluid specifications captured in details
Confidence scoring for each schedule
Processing completes within 5 minutes for typical manuals
Large file handling (up to 200MB)
Error handling for corrupt/invalid PDFs
Progress feedback during processing

Technical Notes

Celery worker with dedicated queue for CPU-intensive tasks
Store intermediate results in Redis for progress tracking
Consider chunking large PDFs (process in batches of pages)
Cache extracted schedules by document hash
Log table detection results for pattern improvement

Out of Scope

Frontend upload UI (enhance existing documents feature)
Real-time processing (always async for manuals)
Non-English manual support (English only for MVP)
Insurance/registration document extraction (Phase 2)

## Overview Implement async PDF processing for owner's manuals, extracting maintenance schedule tables and creating maintenance_schedules entries. **Parent Issue**: #12 (OCR-powered smart capture) **Priority**: P3 - Owner's Manual OCR **Dependencies**: OCR Service Container Setup, Core OCR API Integration ## Scope ### Async Processing Flow ``` 1. User uploads PDF (10-200MB) 2. Backend creates async job, returns job_id 3. OCR service processes in background (Celery worker) 4. Frontend polls job status with progress 5. On completion, extracted schedules returned 6. User reviews and confirms schedule creation ``` ### Manual Extraction Endpoint ``` POST /extract/manual Content-Type: multipart/form-data Request: - file: PDF document - vehicle_id: UUID (for context) Response (immediate): { "jobId": "abc123", "status": "pending", "estimatedSeconds": 120 } GET /jobs/{jobId} Response (completed): { "jobId": "abc123", "status": "completed", "progress": 100, "result": { "success": true, "vehicleInfo": { "make": "Honda", "model": "Civic", "year": 2024 }, "maintenanceSchedules": [ { "service": "Engine Oil", "intervalMiles": 5000, "intervalMonths": 6, "details": "Use 0W-20 synthetic", "confidence": 0.92 }, { "service": "Engine Air Filter", "intervalMiles": 30000, "intervalMonths": 24, "details": "Replace element", "confidence": 0.88 } ], "rawTables": [...], "processingTimeMs": 95000 } } ``` ### PDF Processing Pipeline ``` PDF Upload ↓ Check for text layer (PyMuPDF) ↓ ├── Has text: Extract directly └── Scanned: Render to images @ 300 DPI ↓ Table detection (img2table / PaddleOCR Layout) ↓ OCR on table regions ↓ Pattern matching for maintenance data ↓ Structured schedule extraction ``` ### Pattern Matching **Mileage Intervals:** ```python MILEAGE_PATTERNS = [ r'every\s+([\d,]+)\s*(?:miles?|mi)', r'([\d,]+)\s*(?:miles?|mi)\s*(?:or|/)', r'at\s+([\d,]+)\s*(?:miles?|mi)', ] ``` **Time Intervals:** ```python TIME_PATTERNS = [ r'every\s+(\d+)\s*months?', r'(\d+)\s*months?\s*(?:or|/)', r'annually', # → 12 months r'semi-annual', # → 6 months ] ``` **Service Types (map to maintenance subtypes):** ```python SERVICE_MAPPING = { 'oil change': ['Engine Oil', 'Oil Filter'], 'engine oil': ['Engine Oil'], 'oil filter': ['Oil Filter'], 'air filter': ['Engine Air Filter'], 'cabin filter': ['Cabin Air Filter'], 'tire rotation': ['Tire Rotation'], 'brake inspection': ['Brake Inspection'], 'coolant': ['Engine Coolant'], 'transmission fluid': ['Transmission Fluid'], 'spark plug': ['Spark Plugs'], # ... etc } ``` **Fluid Specifications:** ```python FLUID_PATTERNS = [ r'(\d+W-\d+)', # Oil viscosity: 0W-20, 5W-30 r'(ATF[- ]?\w+)', # Transmission: ATF-Z1 r'(DOT\s*\d)', # Brake fluid: DOT 4 ] ``` ### Table Extraction - Use img2table for image-based tables - Use Camelot for native PDF tables - Identify maintenance schedule tables by: - Headers containing "mileage", "interval", "service" - Row structure with numeric intervals - Multi-column format typical of schedules ## Directory Structure ``` ocr/app/ ├── extractors/ │ └── manual_extractor.py # Manual-specific logic ├── preprocessors/ │ └── pdf_preprocessor.py # PDF handling ├── patterns/ │ ├── maintenance_patterns.py │ └── service_mapping.py # Service name normalization ├── table_extraction/ │ ├── __init__.py │ ├── detector.py # Table detection │ └── parser.py # Table content parsing └── workers/ └── manual_worker.py # Celery task ``` ## Integration with Maintenance Feature ```typescript // After user confirms extracted schedules: for (const schedule of extractedSchedules) { await maintenanceApi.createSchedule({ vehicleId, category: 'routine_maintenance', subtypes: mapServiceToSubtypes(schedule.service), intervalMonths: schedule.intervalMonths, intervalMiles: schedule.intervalMiles, scheduleType: 'interval', emailNotifications: false, }); } ``` ## Acceptance Criteria - [ ] PDF upload creates async job - [ ] Job polling returns progress updates - [ ] Native PDF text extraction works - [ ] Scanned PDF OCR works (slower) - [ ] Table detection identifies maintenance schedules - [ ] Mileage intervals extracted correctly - [ ] Time intervals extracted correctly - [ ] Service types mapped to maintenance subtypes - [ ] Fluid specifications captured in details - [ ] Confidence scoring for each schedule - [ ] Processing completes within 5 minutes for typical manuals - [ ] Large file handling (up to 200MB) - [ ] Error handling for corrupt/invalid PDFs - [ ] Progress feedback during processing ## Technical Notes - Celery worker with dedicated queue for CPU-intensive tasks - Store intermediate results in Redis for progress tracking - Consider chunking large PDFs (process in batches of pages) - Cache extracted schedules by document hash - Log table detection results for pattern improvement ## Out of Scope - Frontend upload UI (enhance existing documents feature) - Real-time processing (always async for manuals) - Non-English manual support (English only for MVP) - Insurance/registration document extraction (Phase 2)

egullickson added the

labels 2026-02-01 18:48:39 +00:00

egullickson referenced this issue

2026-02-01 18:49:00 +00:00

feat: OCR-powered smart capture for VIN, receipts, and owner's manuals #12

egullickson added

status

in-progress