feat: Manual extractor Gemini rewrite (#129) #143

Closed
opened 2026-02-11 03:49:57 +00:00 by egullickson · 1 comment
Owner

Relates to #129

Milestone 5: Manual Extractor Gemini Rewrite

Files

  • ocr/app/extractors/manual_extractor.py (rewrite)
  • ocr/app/routers/extract.py (update manual endpoint)
  • ocr/app/models/schemas.py (update if needed)

Requirements

  • ManualExtractor.extract() delegates to GeminiEngine for PDF processing and structured maintenance data extraction
  • Keep existing data structures: ExtractedSchedule, ManualExtractionResult, VehicleInfo
  • Map Gemini response serviceName values to existing 27 maintenance subtypes via fuzzy matching
  • manual_extractor.py has no dependencies on table_extraction, patterns, or layout analysis modules
  • Simplified 3-step progress callback pattern:
    • 10% "Preparing extraction" (before Gemini call)
    • 50% "Processing with Gemini" (after submitting to Gemini, during wait)
    • 95% "Mapping results" (after Gemini returns, during subtype mapping)
    • 100% "Complete"
    • Note: No sub-progress during Gemini API call (single blocking request, 10-60s)
  • Job queue flow (submit -> poll -> complete) functions correctly with ManualExtractor
  • ManualExtractor does not call table_detector, maintenance_patterns, or layout analysis

Acceptance Criteria

  • ManualExtractor.extract(pdf_bytes) calls Gemini and returns ManualExtractionResult
  • ExtractedSchedule items include matched subtypes from the 27 routine_maintenance categories
  • Progress callbacks fire at 10%, 50%, 95%, 100% intervals
  • Job queue flow (submit -> poll -> complete) functions correctly with ManualExtractor
  • ManualExtractor does not call table_detector, maintenance_patterns, or layout analysis
  • PDF with unusual service names maps to closest subtype via fuzzy matching

Tests

  • Test files: ocr/tests/test_manual_extractor.py (rewrite existing)
  • Test type: unit (mock GeminiEngine)
  • Scenarios:
    • Normal: PDF with maintenance schedule returns extracted items with subtypes
    • Edge: PDF with unusual service names maps to closest subtype
    • Edge: Empty Gemini response returns empty schedules list
    • Normal: Progress callbacks fire at expected intervals (10%, 50%, 95%, 100%)
    • Error: Gemini call failure returns ManualExtractionResult with error
Relates to #129 ## Milestone 5: Manual Extractor Gemini Rewrite ### Files - `ocr/app/extractors/manual_extractor.py` (rewrite) - `ocr/app/routers/extract.py` (update manual endpoint) - `ocr/app/models/schemas.py` (update if needed) ### Requirements - `ManualExtractor.extract()` delegates to GeminiEngine for PDF processing and structured maintenance data extraction - Keep existing data structures: `ExtractedSchedule`, `ManualExtractionResult`, `VehicleInfo` - Map Gemini response `serviceName` values to existing 27 maintenance subtypes via fuzzy matching - `manual_extractor.py` has no dependencies on `table_extraction`, `patterns`, or layout analysis modules - Simplified 3-step progress callback pattern: - 10% "Preparing extraction" (before Gemini call) - 50% "Processing with Gemini" (after submitting to Gemini, during wait) - 95% "Mapping results" (after Gemini returns, during subtype mapping) - 100% "Complete" - Note: No sub-progress during Gemini API call (single blocking request, 10-60s) - Job queue flow (submit -> poll -> complete) functions correctly with ManualExtractor - ManualExtractor does not call table_detector, maintenance_patterns, or layout analysis ### Acceptance Criteria - `ManualExtractor.extract(pdf_bytes)` calls Gemini and returns `ManualExtractionResult` - `ExtractedSchedule` items include matched subtypes from the 27 routine_maintenance categories - Progress callbacks fire at 10%, 50%, 95%, 100% intervals - Job queue flow (submit -> poll -> complete) functions correctly with ManualExtractor - ManualExtractor does not call table_detector, maintenance_patterns, or layout analysis - PDF with unusual service names maps to closest subtype via fuzzy matching ### Tests - **Test files**: `ocr/tests/test_manual_extractor.py` (rewrite existing) - **Test type**: unit (mock GeminiEngine) - **Scenarios**: - Normal: PDF with maintenance schedule returns extracted items with subtypes - Edge: PDF with unusual service names maps to closest subtype - Edge: Empty Gemini response returns empty schedules list - Normal: Progress callbacks fire at expected intervals (10%, 50%, 95%, 100%) - Error: Gemini call failure returns ManualExtractionResult with error
egullickson added the
status
backlog
type
feature
labels 2026-02-11 03:51:15 +00:00
egullickson added
status
in-progress
and removed
status
backlog
labels 2026-02-11 20:36:07 +00:00
Author
Owner

Milestone: Manual Extractor Gemini Rewrite

Phase: Execution | Agent: Developer | Status: PASS

Changes Made

ocr/app/extractors/manual_extractor.py (rewrite)

  • Progress callbacks updated from 5%/50%/90%/100% to spec-aligned 10%/50%/95%/100%
  • Messages updated: "Preparing extraction" (10%), "Processing with Gemini" (50%), "Mapping results" (95%), "Complete" (100%)
  • 50% now fires before the blocking Gemini call (shows during wait)
  • Removed redundant "Finalizing results" step; single 100% "Complete" callback
  • No dependencies on table_extraction, patterns (except service_mapping), or layout analysis

ocr/app/routers/extract.py (update)

  • Updated POST /extract/manual docstring to reflect Gemini-based pipeline (was referencing old table detection flow)

ocr/tests/test_manual_extractor.py (rewrite)

  • Updated progress assertions from 5/50/90/100 to 10/50/95/100

Test Results

  • All 8 manual extractor tests pass (normal extraction, fuzzy matching, empty response, error handling, job queue integration, progress callbacks)
  • 8 pre-existing failures in unrelated test files (currency, date, engine, maintenance patterns, receipt, service mapping)

Acceptance Criteria Status

  • ManualExtractor.extract(pdf_bytes) calls Gemini and returns ManualExtractionResult
  • ExtractedSchedule items include matched subtypes from 27 routine_maintenance categories
  • Progress callbacks fire at 10%, 50%, 95%, 100% intervals
  • Job queue flow (submit -> poll -> complete) functions correctly with ManualExtractor
  • ManualExtractor does not call table_detector, maintenance_patterns, or layout analysis
  • PDF with unusual service names maps to closest subtype via fuzzy matching

Verdict: PASS | Next: QR post-implementation review

## Milestone: Manual Extractor Gemini Rewrite **Phase**: Execution | **Agent**: Developer | **Status**: PASS ### Changes Made **`ocr/app/extractors/manual_extractor.py`** (rewrite) - Progress callbacks updated from 5%/50%/90%/100% to spec-aligned 10%/50%/95%/100% - Messages updated: "Preparing extraction" (10%), "Processing with Gemini" (50%), "Mapping results" (95%), "Complete" (100%) - 50% now fires before the blocking Gemini call (shows during wait) - Removed redundant "Finalizing results" step; single 100% "Complete" callback - No dependencies on table_extraction, patterns (except service_mapping), or layout analysis **`ocr/app/routers/extract.py`** (update) - Updated POST /extract/manual docstring to reflect Gemini-based pipeline (was referencing old table detection flow) **`ocr/tests/test_manual_extractor.py`** (rewrite) - Updated progress assertions from 5/50/90/100 to 10/50/95/100 ### Test Results - All 8 manual extractor tests pass (normal extraction, fuzzy matching, empty response, error handling, job queue integration, progress callbacks) - 8 pre-existing failures in unrelated test files (currency, date, engine, maintenance patterns, receipt, service mapping) ### Acceptance Criteria Status - [x] `ManualExtractor.extract(pdf_bytes)` calls Gemini and returns `ManualExtractionResult` - [x] `ExtractedSchedule` items include matched subtypes from 27 routine_maintenance categories - [x] Progress callbacks fire at 10%, 50%, 95%, 100% intervals - [x] Job queue flow (submit -> poll -> complete) functions correctly with ManualExtractor - [x] ManualExtractor does not call table_detector, maintenance_patterns, or layout analysis - [x] PDF with unusual service names maps to closest subtype via fuzzy matching *Verdict*: PASS | *Next*: QR post-implementation review
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: egullickson/motovaultpro#143