feat: Manual extractor Gemini rewrite (#129) #143

New Issue

egullickson · 2026-02-11T03:49:57Z

egullickson commented

2026-02-11 03:49:57 +00:00

Relates to #129

Milestone 5: Manual Extractor Gemini Rewrite

Files

ocr/app/extractors/manual_extractor.py (rewrite)
ocr/app/routers/extract.py (update manual endpoint)
ocr/app/models/schemas.py (update if needed)

Requirements

ManualExtractor.extract() delegates to GeminiEngine for PDF processing and structured maintenance data extraction
Keep existing data structures: ExtractedSchedule, ManualExtractionResult, VehicleInfo
Map Gemini response serviceName values to existing 27 maintenance subtypes via fuzzy matching
manual_extractor.py has no dependencies on table_extraction, patterns, or layout analysis modules
Simplified 3-step progress callback pattern:
- 10% "Preparing extraction" (before Gemini call)
- 50% "Processing with Gemini" (after submitting to Gemini, during wait)
- 95% "Mapping results" (after Gemini returns, during subtype mapping)
- 100% "Complete"
- Note: No sub-progress during Gemini API call (single blocking request, 10-60s)
Job queue flow (submit -> poll -> complete) functions correctly with ManualExtractor
ManualExtractor does not call table_detector, maintenance_patterns, or layout analysis

Acceptance Criteria

ManualExtractor.extract(pdf_bytes) calls Gemini and returns ManualExtractionResult
ExtractedSchedule items include matched subtypes from the 27 routine_maintenance categories
Progress callbacks fire at 10%, 50%, 95%, 100% intervals
Job queue flow (submit -> poll -> complete) functions correctly with ManualExtractor
ManualExtractor does not call table_detector, maintenance_patterns, or layout analysis
PDF with unusual service names maps to closest subtype via fuzzy matching

Tests

Test files: ocr/tests/test_manual_extractor.py (rewrite existing)
Test type: unit (mock GeminiEngine)
Scenarios:
- Normal: PDF with maintenance schedule returns extracted items with subtypes
- Edge: PDF with unusual service names maps to closest subtype
- Edge: Empty Gemini response returns empty schedules list
- Normal: Progress callbacks fire at expected intervals (10%, 50%, 95%, 100%)
- Error: Gemini call failure returns ManualExtractionResult with error

Relates to #129 ## Milestone 5: Manual Extractor Gemini Rewrite ### Files - `ocr/app/extractors/manual_extractor.py` (rewrite) - `ocr/app/routers/extract.py` (update manual endpoint) - `ocr/app/models/schemas.py` (update if needed) ### Requirements - `ManualExtractor.extract()` delegates to GeminiEngine for PDF processing and structured maintenance data extraction - Keep existing data structures: `ExtractedSchedule`, `ManualExtractionResult`, `VehicleInfo` - Map Gemini response `serviceName` values to existing 27 maintenance subtypes via fuzzy matching - `manual_extractor.py` has no dependencies on `table_extraction`, `patterns`, or layout analysis modules - Simplified 3-step progress callback pattern: - 10% "Preparing extraction" (before Gemini call) - 50% "Processing with Gemini" (after submitting to Gemini, during wait) - 95% "Mapping results" (after Gemini returns, during subtype mapping) - 100% "Complete" - Note: No sub-progress during Gemini API call (single blocking request, 10-60s) - Job queue flow (submit -> poll -> complete) functions correctly with ManualExtractor - ManualExtractor does not call table_detector, maintenance_patterns, or layout analysis ### Acceptance Criteria - `ManualExtractor.extract(pdf_bytes)` calls Gemini and returns `ManualExtractionResult` - `ExtractedSchedule` items include matched subtypes from the 27 routine_maintenance categories - Progress callbacks fire at 10%, 50%, 95%, 100% intervals - Job queue flow (submit -> poll -> complete) functions correctly with ManualExtractor - ManualExtractor does not call table_detector, maintenance_patterns, or layout analysis - PDF with unusual service names maps to closest subtype via fuzzy matching ### Tests - **Test files**: `ocr/tests/test_manual_extractor.py` (rewrite existing) - **Test type**: unit (mock GeminiEngine) - **Scenarios**: - Normal: PDF with maintenance schedule returns extracted items with subtypes - Edge: PDF with unusual service names maps to closest subtype - Edge: Empty Gemini response returns empty schedules list - Normal: Progress callbacks fire at expected intervals (10%, 50%, 95%, 100%) - Error: Gemini call failure returns ManualExtractionResult with error

egullickson added the

labels 2026-02-11 03:51:15 +00:00

egullickson referenced this issue

2026-02-11 03:53:01 +00:00

feat: Expand OCR with fuel receipt scanning and owners manual maintenance extraction #129

egullickson added

and removed

labels 2026-02-11 20:36:07 +00:00

egullickson commented

2026-02-11 20:40:27 +00:00

Milestone: Manual Extractor Gemini Rewrite

Phase: Execution | Agent: Developer | Status: PASS

Changes Made

ocr/app/extractors/manual_extractor.py (rewrite)

Progress callbacks updated from 5%/50%/90%/100% to spec-aligned 10%/50%/95%/100%
Messages updated: "Preparing extraction" (10%), "Processing with Gemini" (50%), "Mapping results" (95%), "Complete" (100%)
50% now fires before the blocking Gemini call (shows during wait)
Removed redundant "Finalizing results" step; single 100% "Complete" callback
No dependencies on table_extraction, patterns (except service_mapping), or layout analysis

ocr/app/routers/extract.py (update)

Updated POST /extract/manual docstring to reflect Gemini-based pipeline (was referencing old table detection flow)

ocr/tests/test_manual_extractor.py (rewrite)

Updated progress assertions from 5/50/90/100 to 10/50/95/100

Test Results

All 8 manual extractor tests pass (normal extraction, fuzzy matching, empty response, error handling, job queue integration, progress callbacks)
8 pre-existing failures in unrelated test files (currency, date, engine, maintenance patterns, receipt, service mapping)

Acceptance Criteria Status

ManualExtractor.extract(pdf_bytes) calls Gemini and returns ManualExtractionResult
ExtractedSchedule items include matched subtypes from 27 routine_maintenance categories
Progress callbacks fire at 10%, 50%, 95%, 100% intervals
Job queue flow (submit -> poll -> complete) functions correctly with ManualExtractor
ManualExtractor does not call table_detector, maintenance_patterns, or layout analysis
PDF with unusual service names maps to closest subtype via fuzzy matching

Verdict: PASS | Next: QR post-implementation review

## Milestone: Manual Extractor Gemini Rewrite **Phase**: Execution | **Agent**: Developer | **Status**: PASS ### Changes Made **`ocr/app/extractors/manual_extractor.py`** (rewrite) - Progress callbacks updated from 5%/50%/90%/100% to spec-aligned 10%/50%/95%/100% - Messages updated: "Preparing extraction" (10%), "Processing with Gemini" (50%), "Mapping results" (95%), "Complete" (100%) - 50% now fires before the blocking Gemini call (shows during wait) - Removed redundant "Finalizing results" step; single 100% "Complete" callback - No dependencies on table_extraction, patterns (except service_mapping), or layout analysis **`ocr/app/routers/extract.py`** (update) - Updated POST /extract/manual docstring to reflect Gemini-based pipeline (was referencing old table detection flow) **`ocr/tests/test_manual_extractor.py`** (rewrite) - Updated progress assertions from 5/50/90/100 to 10/50/95/100 ### Test Results - All 8 manual extractor tests pass (normal extraction, fuzzy matching, empty response, error handling, job queue integration, progress callbacks) - 8 pre-existing failures in unrelated test files (currency, date, engine, maintenance patterns, receipt, service mapping) ### Acceptance Criteria Status - [x] `ManualExtractor.extract(pdf_bytes)` calls Gemini and returns `ManualExtractionResult` - [x] `ExtractedSchedule` items include matched subtypes from 27 routine_maintenance categories - [x] Progress callbacks fire at 10%, 50%, 95%, 100% intervals - [x] Job queue flow (submit -> poll -> complete) functions correctly with ManualExtractor - [x] ManualExtractor does not call table_detector, maintenance_patterns, or layout analysis - [x] PDF with unusual service names maps to closest subtype via fuzzy matching *Verdict*: PASS | *Next*: QR post-implementation review

egullickson referenced this issue from a commit

2026-02-11 21:27:47 +00:00

feat: rewrite ManualExtractor progress to spec-aligned 10/50/95/100 pattern (refs #143)

egullickson referenced a pull request that will close this issue

2026-02-11 21:28:11 +00:00

feat: Expand OCR with fuel receipt scanning and maintenance extraction (#129) #147

egullickson closed this issue

2026-02-13 02:25:57 +00:00

Sign in to join this conversation.