feat: Owner's Manual OCR Pipeline #71

Closed
opened 2026-02-01 18:48:26 +00:00 by egullickson · 0 comments
Owner

Overview

Implement async PDF processing for owner's manuals, extracting maintenance schedule tables and creating maintenance_schedules entries.

Parent Issue: #12 (OCR-powered smart capture)
Priority: P3 - Owner's Manual OCR
Dependencies: OCR Service Container Setup, Core OCR API Integration

Scope

Async Processing Flow

1. User uploads PDF (10-200MB)
2. Backend creates async job, returns job_id
3. OCR service processes in background (Celery worker)
4. Frontend polls job status with progress
5. On completion, extracted schedules returned
6. User reviews and confirms schedule creation

Manual Extraction Endpoint

POST /extract/manual
Content-Type: multipart/form-data

Request:
  - file: PDF document
  - vehicle_id: UUID (for context)

Response (immediate):
{
  "jobId": "abc123",
  "status": "pending",
  "estimatedSeconds": 120
}

GET /jobs/{jobId}
Response (completed):
{
  "jobId": "abc123",
  "status": "completed",
  "progress": 100,
  "result": {
    "success": true,
    "vehicleInfo": {
      "make": "Honda",
      "model": "Civic",
      "year": 2024
    },
    "maintenanceSchedules": [
      {
        "service": "Engine Oil",
        "intervalMiles": 5000,
        "intervalMonths": 6,
        "details": "Use 0W-20 synthetic",
        "confidence": 0.92
      },
      {
        "service": "Engine Air Filter",
        "intervalMiles": 30000,
        "intervalMonths": 24,
        "details": "Replace element",
        "confidence": 0.88
      }
    ],
    "rawTables": [...],
    "processingTimeMs": 95000
  }
}

PDF Processing Pipeline

PDF Upload
    ↓
Check for text layer (PyMuPDF)
    ↓
├── Has text: Extract directly
└── Scanned: Render to images @ 300 DPI
    ↓
Table detection (img2table / PaddleOCR Layout)
    ↓
OCR on table regions
    ↓
Pattern matching for maintenance data
    ↓
Structured schedule extraction

Pattern Matching

Mileage Intervals:

MILEAGE_PATTERNS = [
    r'every\s+([\d,]+)\s*(?:miles?|mi)',
    r'([\d,]+)\s*(?:miles?|mi)\s*(?:or|/)',
    r'at\s+([\d,]+)\s*(?:miles?|mi)',
]

Time Intervals:

TIME_PATTERNS = [
    r'every\s+(\d+)\s*months?',
    r'(\d+)\s*months?\s*(?:or|/)',
    r'annually',  # → 12 months
    r'semi-annual',  # → 6 months
]

Service Types (map to maintenance subtypes):

SERVICE_MAPPING = {
    'oil change': ['Engine Oil', 'Oil Filter'],
    'engine oil': ['Engine Oil'],
    'oil filter': ['Oil Filter'],
    'air filter': ['Engine Air Filter'],
    'cabin filter': ['Cabin Air Filter'],
    'tire rotation': ['Tire Rotation'],
    'brake inspection': ['Brake Inspection'],
    'coolant': ['Engine Coolant'],
    'transmission fluid': ['Transmission Fluid'],
    'spark plug': ['Spark Plugs'],
    # ... etc
}

Fluid Specifications:

FLUID_PATTERNS = [
    r'(\d+W-\d+)',           # Oil viscosity: 0W-20, 5W-30
    r'(ATF[- ]?\w+)',        # Transmission: ATF-Z1
    r'(DOT\s*\d)',           # Brake fluid: DOT 4
]

Table Extraction

  • Use img2table for image-based tables
  • Use Camelot for native PDF tables
  • Identify maintenance schedule tables by:
    • Headers containing "mileage", "interval", "service"
    • Row structure with numeric intervals
    • Multi-column format typical of schedules

Directory Structure

ocr/app/
├── extractors/
│   └── manual_extractor.py    # Manual-specific logic
├── preprocessors/
│   └── pdf_preprocessor.py    # PDF handling
├── patterns/
│   ├── maintenance_patterns.py
│   └── service_mapping.py     # Service name normalization
├── table_extraction/
│   ├── __init__.py
│   ├── detector.py            # Table detection
│   └── parser.py              # Table content parsing
└── workers/
    └── manual_worker.py       # Celery task

Integration with Maintenance Feature

// After user confirms extracted schedules:
for (const schedule of extractedSchedules) {
  await maintenanceApi.createSchedule({
    vehicleId,
    category: 'routine_maintenance',
    subtypes: mapServiceToSubtypes(schedule.service),
    intervalMonths: schedule.intervalMonths,
    intervalMiles: schedule.intervalMiles,
    scheduleType: 'interval',
    emailNotifications: false,
  });
}

Acceptance Criteria

  • PDF upload creates async job
  • Job polling returns progress updates
  • Native PDF text extraction works
  • Scanned PDF OCR works (slower)
  • Table detection identifies maintenance schedules
  • Mileage intervals extracted correctly
  • Time intervals extracted correctly
  • Service types mapped to maintenance subtypes
  • Fluid specifications captured in details
  • Confidence scoring for each schedule
  • Processing completes within 5 minutes for typical manuals
  • Large file handling (up to 200MB)
  • Error handling for corrupt/invalid PDFs
  • Progress feedback during processing

Technical Notes

  • Celery worker with dedicated queue for CPU-intensive tasks
  • Store intermediate results in Redis for progress tracking
  • Consider chunking large PDFs (process in batches of pages)
  • Cache extracted schedules by document hash
  • Log table detection results for pattern improvement

Out of Scope

  • Frontend upload UI (enhance existing documents feature)
  • Real-time processing (always async for manuals)
  • Non-English manual support (English only for MVP)
  • Insurance/registration document extraction (Phase 2)
## Overview Implement async PDF processing for owner's manuals, extracting maintenance schedule tables and creating maintenance_schedules entries. **Parent Issue**: #12 (OCR-powered smart capture) **Priority**: P3 - Owner's Manual OCR **Dependencies**: OCR Service Container Setup, Core OCR API Integration ## Scope ### Async Processing Flow ``` 1. User uploads PDF (10-200MB) 2. Backend creates async job, returns job_id 3. OCR service processes in background (Celery worker) 4. Frontend polls job status with progress 5. On completion, extracted schedules returned 6. User reviews and confirms schedule creation ``` ### Manual Extraction Endpoint ``` POST /extract/manual Content-Type: multipart/form-data Request: - file: PDF document - vehicle_id: UUID (for context) Response (immediate): { "jobId": "abc123", "status": "pending", "estimatedSeconds": 120 } GET /jobs/{jobId} Response (completed): { "jobId": "abc123", "status": "completed", "progress": 100, "result": { "success": true, "vehicleInfo": { "make": "Honda", "model": "Civic", "year": 2024 }, "maintenanceSchedules": [ { "service": "Engine Oil", "intervalMiles": 5000, "intervalMonths": 6, "details": "Use 0W-20 synthetic", "confidence": 0.92 }, { "service": "Engine Air Filter", "intervalMiles": 30000, "intervalMonths": 24, "details": "Replace element", "confidence": 0.88 } ], "rawTables": [...], "processingTimeMs": 95000 } } ``` ### PDF Processing Pipeline ``` PDF Upload ↓ Check for text layer (PyMuPDF) ↓ ├── Has text: Extract directly └── Scanned: Render to images @ 300 DPI ↓ Table detection (img2table / PaddleOCR Layout) ↓ OCR on table regions ↓ Pattern matching for maintenance data ↓ Structured schedule extraction ``` ### Pattern Matching **Mileage Intervals:** ```python MILEAGE_PATTERNS = [ r'every\s+([\d,]+)\s*(?:miles?|mi)', r'([\d,]+)\s*(?:miles?|mi)\s*(?:or|/)', r'at\s+([\d,]+)\s*(?:miles?|mi)', ] ``` **Time Intervals:** ```python TIME_PATTERNS = [ r'every\s+(\d+)\s*months?', r'(\d+)\s*months?\s*(?:or|/)', r'annually', # → 12 months r'semi-annual', # → 6 months ] ``` **Service Types (map to maintenance subtypes):** ```python SERVICE_MAPPING = { 'oil change': ['Engine Oil', 'Oil Filter'], 'engine oil': ['Engine Oil'], 'oil filter': ['Oil Filter'], 'air filter': ['Engine Air Filter'], 'cabin filter': ['Cabin Air Filter'], 'tire rotation': ['Tire Rotation'], 'brake inspection': ['Brake Inspection'], 'coolant': ['Engine Coolant'], 'transmission fluid': ['Transmission Fluid'], 'spark plug': ['Spark Plugs'], # ... etc } ``` **Fluid Specifications:** ```python FLUID_PATTERNS = [ r'(\d+W-\d+)', # Oil viscosity: 0W-20, 5W-30 r'(ATF[- ]?\w+)', # Transmission: ATF-Z1 r'(DOT\s*\d)', # Brake fluid: DOT 4 ] ``` ### Table Extraction - Use img2table for image-based tables - Use Camelot for native PDF tables - Identify maintenance schedule tables by: - Headers containing "mileage", "interval", "service" - Row structure with numeric intervals - Multi-column format typical of schedules ## Directory Structure ``` ocr/app/ ├── extractors/ │ └── manual_extractor.py # Manual-specific logic ├── preprocessors/ │ └── pdf_preprocessor.py # PDF handling ├── patterns/ │ ├── maintenance_patterns.py │ └── service_mapping.py # Service name normalization ├── table_extraction/ │ ├── __init__.py │ ├── detector.py # Table detection │ └── parser.py # Table content parsing └── workers/ └── manual_worker.py # Celery task ``` ## Integration with Maintenance Feature ```typescript // After user confirms extracted schedules: for (const schedule of extractedSchedules) { await maintenanceApi.createSchedule({ vehicleId, category: 'routine_maintenance', subtypes: mapServiceToSubtypes(schedule.service), intervalMonths: schedule.intervalMonths, intervalMiles: schedule.intervalMiles, scheduleType: 'interval', emailNotifications: false, }); } ``` ## Acceptance Criteria - [ ] PDF upload creates async job - [ ] Job polling returns progress updates - [ ] Native PDF text extraction works - [ ] Scanned PDF OCR works (slower) - [ ] Table detection identifies maintenance schedules - [ ] Mileage intervals extracted correctly - [ ] Time intervals extracted correctly - [ ] Service types mapped to maintenance subtypes - [ ] Fluid specifications captured in details - [ ] Confidence scoring for each schedule - [ ] Processing completes within 5 minutes for typical manuals - [ ] Large file handling (up to 200MB) - [ ] Error handling for corrupt/invalid PDFs - [ ] Progress feedback during processing ## Technical Notes - Celery worker with dedicated queue for CPU-intensive tasks - Store intermediate results in Redis for progress tracking - Consider chunking large PDFs (process in batches of pages) - Cache extracted schedules by document hash - Log table detection results for pattern improvement ## Out of Scope - Frontend upload UI (enhance existing documents feature) - Real-time processing (always async for manuals) - Non-English manual support (English only for MVP) - Insurance/registration document extraction (Phase 2)
egullickson added the
status
backlog
type
feature
labels 2026-02-01 18:48:39 +00:00
egullickson added
status
in-progress
and removed
status
backlog
labels 2026-02-02 03:20:38 +00:00
egullickson added
status
review
and removed
status
in-progress
labels 2026-02-02 03:30:44 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: egullickson/motovaultpro#71