feat: Owner's Manual OCR Pipeline (#71) #79

egullickson · 2026-02-02T03:30:37Z

egullickson commented

2026-02-02 03:30:37 +00:00

Summary

Implement async PDF processing pipeline for owner's manuals
Add maintenance pattern matching for mileage/time intervals and fluid specs
Add service name mapping to maintenance subtypes
Add table detection and parsing for schedule extraction
Add POST /extract/manual endpoint returning job_id for async processing
Add Redis job queue support for large PDF processing with progress tracking

Files Changed

New Files

ocr/app/patterns/maintenance_patterns.py - Mileage, time, fluid spec patterns
ocr/app/patterns/service_mapping.py - Service name to subtype mapping
ocr/app/preprocessors/pdf_preprocessor.py - PDF text/image extraction
ocr/app/table_extraction/detector.py - Table detection in images/text
ocr/app/table_extraction/parser.py - Table content parsing
ocr/app/extractors/manual_extractor.py - Main extraction orchestrator
ocr/tests/test_maintenance_patterns.py - Pattern matching tests
ocr/tests/test_service_mapping.py - Service mapping tests
ocr/tests/test_table_parser.py - Table parsing tests

Modified Files

ocr/app/models/schemas.py - Manual extraction response models
ocr/app/routers/extract.py - POST /extract/manual endpoint
ocr/app/services/job_queue.py - Manual job queue methods
ocr/requirements.txt - Added PyMuPDF dependency

Test Plan

Verify pattern matching extracts mileage intervals correctly
Verify service names map to correct maintenance subtypes
Test PDF with text layer extracts schedules directly
Test scanned PDF triggers OCR pipeline
Verify async job returns progress updates
Test with sample owner's manual PDFs
Verify processing completes within 5 minutes for typical manuals

Closes #71

## Summary - Implement async PDF processing pipeline for owner's manuals - Add maintenance pattern matching for mileage/time intervals and fluid specs - Add service name mapping to maintenance subtypes - Add table detection and parsing for schedule extraction - Add POST /extract/manual endpoint returning job_id for async processing - Add Redis job queue support for large PDF processing with progress tracking ## Files Changed ### New Files - `ocr/app/patterns/maintenance_patterns.py` - Mileage, time, fluid spec patterns - `ocr/app/patterns/service_mapping.py` - Service name to subtype mapping - `ocr/app/preprocessors/pdf_preprocessor.py` - PDF text/image extraction - `ocr/app/table_extraction/detector.py` - Table detection in images/text - `ocr/app/table_extraction/parser.py` - Table content parsing - `ocr/app/extractors/manual_extractor.py` - Main extraction orchestrator - `ocr/tests/test_maintenance_patterns.py` - Pattern matching tests - `ocr/tests/test_service_mapping.py` - Service mapping tests - `ocr/tests/test_table_parser.py` - Table parsing tests ### Modified Files - `ocr/app/models/schemas.py` - Manual extraction response models - `ocr/app/routers/extract.py` - POST /extract/manual endpoint - `ocr/app/services/job_queue.py` - Manual job queue methods - `ocr/requirements.txt` - Added PyMuPDF dependency ## Test Plan - [ ] Verify pattern matching extracts mileage intervals correctly - [ ] Verify service names map to correct maintenance subtypes - [ ] Test PDF with text layer extracts schedules directly - [ ] Test scanned PDF triggers OCR pipeline - [ ] Verify async job returns progress updates - [ ] Test with sample owner's manual PDFs - [ ] Verify processing completes within 5 minutes for typical manuals Closes #71

egullickson added 1 commit 2026-02-02 03:30:38 +00:00

feat: add owner's manual OCR pipeline (refs #71 )

Deploy to Staging / Build Images (pull_request) Successful in 3m1s

Details

Deploy to Staging / Deploy to Staging (pull_request) Successful in 31s

Details

Deploy to Staging / Verify Staging (pull_request) Successful in 2m19s

Details

Deploy to Staging / Notify Staging Ready (pull_request) Successful in 7s

Details

Deploy to Staging / Notify Staging Failure (pull_request) Has been skipped

Details

3eb54211cb

Implement async PDF processing for owner's manuals with maintenance
schedule extraction:

- Add PDF preprocessor with PyMuPDF for text/scanned PDF handling
- Add maintenance pattern matching (mileage, time, fluid specs)
- Add service name mapping to maintenance subtypes
- Add table detection and parsing for schedule tables
- Add manual extractor orchestrating the complete pipeline
- Add POST /extract/manual endpoint for async job submission
- Add Redis job queue support for manual extraction jobs
- Add progress tracking during processing

Processing pipeline:
1. Analyze PDF structure (text layer vs scanned)
2. Find maintenance schedule sections
3. Extract text or OCR scanned pages at 300 DPI
4. Detect and parse maintenance tables
5. Normalize service names and extract intervals
6. Return structured maintenance schedules with confidence scores

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

egullickson merged commit 93594ca4d8 into main

2026-02-02 03:37:34 +00:00

egullickson deleted branch issue-71-manual-ocr-pipeline

2026-02-02 03:37:34 +00:00

egullickson referenced this issue from a commit

2026-02-02 03:37:35 +00:00

Merge pull request 'feat: Owner's Manual OCR Pipeline (#71)' (#79) from issue-71-manual-ocr-pipeline into main

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: egullickson/motovaultpro#79