feat: Manual extractor Gemini rewrite (#129) #134

Closed
opened 2026-02-11 03:04:54 +00:00 by egullickson · 1 comment
Owner

Relates to #129

Milestone 5: Manual Extractor Gemini Rewrite

Rewrite the manual extractor to use Gemini instead of the traditional OCR pipeline.

Files

  • ocr/app/extractors/manual_extractor.py (rewrite)
  • ocr/app/routers/extract.py (update manual endpoint)
  • ocr/app/models/schemas.py (update if needed)

Requirements

  • Rewrite ManualExtractor.extract() to use GeminiEngine instead of traditional OCR pipeline
  • Keep existing data structures: ExtractedSchedule, ManualExtractionResult, VehicleInfo
  • Map Gemini response serviceName values to existing 27 maintenance subtypes via fuzzy matching
  • Preserve progress callback pattern for job queue integration
  • Remove unused imports and dependencies on table_extraction, patterns after rewrite

Acceptance Criteria

  • ManualExtractor.extract(pdf_bytes) calls Gemini and returns ManualExtractionResult
  • ExtractedSchedule items include matched subtypes from the 27 routine_maintenance categories
  • Progress callbacks fire at appropriate intervals during processing
  • Existing job queue flow (submit -> poll -> complete) works with new extractor
  • Traditional OCR pipeline code (table_detector, maintenance_patterns) no longer called

Tests

  • Test files: ocr/tests/test_manual_extractor.py (rewrite existing)
  • Test type: unit (mock GeminiEngine)
  • Scenarios:
    • Normal: PDF with maintenance schedule returns extracted items with subtypes
    • Edge: PDF with unusual service names still maps to closest subtype
    • Edge: Empty Gemini response returns empty schedules list
    • Error: Gemini call failure returns ManualExtractionResult with error
Relates to #129 ## Milestone 5: Manual Extractor Gemini Rewrite Rewrite the manual extractor to use Gemini instead of the traditional OCR pipeline. ### Files - `ocr/app/extractors/manual_extractor.py` (rewrite) - `ocr/app/routers/extract.py` (update manual endpoint) - `ocr/app/models/schemas.py` (update if needed) ### Requirements - Rewrite `ManualExtractor.extract()` to use GeminiEngine instead of traditional OCR pipeline - Keep existing data structures: `ExtractedSchedule`, `ManualExtractionResult`, `VehicleInfo` - Map Gemini response `serviceName` values to existing 27 maintenance subtypes via fuzzy matching - Preserve progress callback pattern for job queue integration - Remove unused imports and dependencies on table_extraction, patterns after rewrite ### Acceptance Criteria - [ ] ManualExtractor.extract(pdf_bytes) calls Gemini and returns ManualExtractionResult - [ ] ExtractedSchedule items include matched subtypes from the 27 routine_maintenance categories - [ ] Progress callbacks fire at appropriate intervals during processing - [ ] Existing job queue flow (submit -> poll -> complete) works with new extractor - [ ] Traditional OCR pipeline code (table_detector, maintenance_patterns) no longer called ### Tests - **Test files**: `ocr/tests/test_manual_extractor.py` (rewrite existing) - **Test type**: unit (mock GeminiEngine) - **Scenarios**: - Normal: PDF with maintenance schedule returns extracted items with subtypes - Edge: PDF with unusual service names still maps to closest subtype - Edge: Empty Gemini response returns empty schedules list - Error: Gemini call failure returns ManualExtractionResult with error
egullickson added the
status
backlog
type
feature
labels 2026-02-11 03:12:40 +00:00
egullickson added
status
in-progress
and removed
status
backlog
labels 2026-02-11 16:19:18 +00:00
Author
Owner

Milestone: Manual Extractor Gemini Rewrite

Phase: Execution | Agent: Developer | Status: PASS

Changes Made

ocr/app/extractors/manual_extractor.py (rewrite)

  • Replaced traditional OCR pipeline (table_detector, table_parser, maintenance_patterns, pdf_preprocessor) with GeminiEngine
  • ManualExtractor.extract(pdf_bytes) now calls GeminiEngine.extract_maintenance(pdf_bytes) for semantic PDF understanding
  • Maps each Gemini serviceName to system maintenance subtypes via ServiceMapper.map_service_fuzzy()
  • Preserved progress callback pattern with 4 stages: 5% (sending), 50% (mapping), 90% (finalizing), 100% (complete)
  • Kept all existing data structures: ExtractedSchedule, VehicleInfo, ManualExtractionResult
  • Removed imports: PIL, create_engine, OcrConfig, pdf_preprocessor, table_detector, table_parser, maintenance_matcher
  • Removed private methods: _process_text_page, _process_scanned_page, _normalize_schedules, _extract_vehicle_info, _parse_vehicle_from_title, _parse_vehicle_from_text
  • Net: 58 insertions, 310 deletions

ocr/app/routers/extract.py -- no changes needed

  • Verified process_manual_job works unchanged with rewritten extractor (same interface and return type)

ocr/tests/test_manual_extractor.py (new)

  • 8 tests covering all acceptance criteria scenarios:
    • Normal: PDF with maintenance schedule returns extracted items with mapped subtypes
    • Normal: Progress callbacks fire at 5/50/90/100%
    • Edge: Unusual service names fuzzy match to subtypes
    • Edge: Unmapped service names use Gemini name with default confidence
    • Edge: Empty Gemini response returns empty schedules list
    • Error: Gemini failure returns ManualExtractionResult with error
    • Error: Unexpected exception caught and returned
    • Integration: Result contains all fields needed by job queue flow

Acceptance Criteria Status

  • ManualExtractor.extract(pdf_bytes) calls Gemini and returns ManualExtractionResult
  • ExtractedSchedule items include matched subtypes from the 27 routine_maintenance categories
  • Progress callbacks fire at appropriate intervals during processing
  • Existing job queue flow (submit -> poll -> complete) works with new extractor
  • Traditional OCR pipeline code (table_detector, maintenance_patterns) no longer called

Test Results

All 8 new tests pass. No existing tests broken (8 pre-existing failures unrelated to this change).

Verdict: PASS | Next: QR post-implementation review

## Milestone: Manual Extractor Gemini Rewrite **Phase**: Execution | **Agent**: Developer | **Status**: PASS ### Changes Made **`ocr/app/extractors/manual_extractor.py` (rewrite)** - Replaced traditional OCR pipeline (table_detector, table_parser, maintenance_patterns, pdf_preprocessor) with GeminiEngine - `ManualExtractor.extract(pdf_bytes)` now calls `GeminiEngine.extract_maintenance(pdf_bytes)` for semantic PDF understanding - Maps each Gemini `serviceName` to system maintenance subtypes via `ServiceMapper.map_service_fuzzy()` - Preserved progress callback pattern with 4 stages: 5% (sending), 50% (mapping), 90% (finalizing), 100% (complete) - Kept all existing data structures: `ExtractedSchedule`, `VehicleInfo`, `ManualExtractionResult` - Removed imports: PIL, create_engine, OcrConfig, pdf_preprocessor, table_detector, table_parser, maintenance_matcher - Removed private methods: `_process_text_page`, `_process_scanned_page`, `_normalize_schedules`, `_extract_vehicle_info`, `_parse_vehicle_from_title`, `_parse_vehicle_from_text` - Net: 58 insertions, 310 deletions **`ocr/app/routers/extract.py` -- no changes needed** - Verified `process_manual_job` works unchanged with rewritten extractor (same interface and return type) **`ocr/tests/test_manual_extractor.py` (new)** - 8 tests covering all acceptance criteria scenarios: - Normal: PDF with maintenance schedule returns extracted items with mapped subtypes - Normal: Progress callbacks fire at 5/50/90/100% - Edge: Unusual service names fuzzy match to subtypes - Edge: Unmapped service names use Gemini name with default confidence - Edge: Empty Gemini response returns empty schedules list - Error: Gemini failure returns ManualExtractionResult with error - Error: Unexpected exception caught and returned - Integration: Result contains all fields needed by job queue flow ### Acceptance Criteria Status - [x] ManualExtractor.extract(pdf_bytes) calls Gemini and returns ManualExtractionResult - [x] ExtractedSchedule items include matched subtypes from the 27 routine_maintenance categories - [x] Progress callbacks fire at appropriate intervals during processing - [x] Existing job queue flow (submit -> poll -> complete) works with new extractor - [x] Traditional OCR pipeline code (table_detector, maintenance_patterns) no longer called ### Test Results All 8 new tests pass. No existing tests broken (8 pre-existing failures unrelated to this change). *Verdict*: PASS | *Next*: QR post-implementation review
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: egullickson/motovaultpro#134