feat: Manual extractor Gemini rewrite (#129) #134

New Issue

egullickson · 2026-02-11T03:04:54Z

egullickson commented

2026-02-11 03:04:54 +00:00

Relates to #129

Milestone 5: Manual Extractor Gemini Rewrite

Rewrite the manual extractor to use Gemini instead of the traditional OCR pipeline.

Files

ocr/app/extractors/manual_extractor.py (rewrite)
ocr/app/routers/extract.py (update manual endpoint)
ocr/app/models/schemas.py (update if needed)

Requirements

Rewrite ManualExtractor.extract() to use GeminiEngine instead of traditional OCR pipeline
Keep existing data structures: ExtractedSchedule, ManualExtractionResult, VehicleInfo
Map Gemini response serviceName values to existing 27 maintenance subtypes via fuzzy matching
Preserve progress callback pattern for job queue integration
Remove unused imports and dependencies on table_extraction, patterns after rewrite

Acceptance Criteria

ManualExtractor.extract(pdf_bytes) calls Gemini and returns ManualExtractionResult
ExtractedSchedule items include matched subtypes from the 27 routine_maintenance categories
Progress callbacks fire at appropriate intervals during processing
Existing job queue flow (submit -> poll -> complete) works with new extractor
Traditional OCR pipeline code (table_detector, maintenance_patterns) no longer called

Tests

Test files: ocr/tests/test_manual_extractor.py (rewrite existing)
Test type: unit (mock GeminiEngine)
Scenarios:
- Normal: PDF with maintenance schedule returns extracted items with subtypes
- Edge: PDF with unusual service names still maps to closest subtype
- Edge: Empty Gemini response returns empty schedules list
- Error: Gemini call failure returns ManualExtractionResult with error

Relates to #129 ## Milestone 5: Manual Extractor Gemini Rewrite Rewrite the manual extractor to use Gemini instead of the traditional OCR pipeline. ### Files - `ocr/app/extractors/manual_extractor.py` (rewrite) - `ocr/app/routers/extract.py` (update manual endpoint) - `ocr/app/models/schemas.py` (update if needed) ### Requirements - Rewrite `ManualExtractor.extract()` to use GeminiEngine instead of traditional OCR pipeline - Keep existing data structures: `ExtractedSchedule`, `ManualExtractionResult`, `VehicleInfo` - Map Gemini response `serviceName` values to existing 27 maintenance subtypes via fuzzy matching - Preserve progress callback pattern for job queue integration - Remove unused imports and dependencies on table_extraction, patterns after rewrite ### Acceptance Criteria - [ ] ManualExtractor.extract(pdf_bytes) calls Gemini and returns ManualExtractionResult - [ ] ExtractedSchedule items include matched subtypes from the 27 routine_maintenance categories - [ ] Progress callbacks fire at appropriate intervals during processing - [ ] Existing job queue flow (submit -> poll -> complete) works with new extractor - [ ] Traditional OCR pipeline code (table_detector, maintenance_patterns) no longer called ### Tests - **Test files**: `ocr/tests/test_manual_extractor.py` (rewrite existing) - **Test type**: unit (mock GeminiEngine) - **Scenarios**: - Normal: PDF with maintenance schedule returns extracted items with subtypes - Edge: PDF with unusual service names still maps to closest subtype - Edge: Empty Gemini response returns empty schedules list - Error: Gemini call failure returns ManualExtractionResult with error

egullickson added the

labels 2026-02-11 03:12:40 +00:00

egullickson added

and removed

labels 2026-02-11 16:19:18 +00:00

egullickson commented

2026-02-11 16:24:32 +00:00

Milestone: Manual Extractor Gemini Rewrite

Phase: Execution | Agent: Developer | Status: PASS

Changes Made

ocr/app/extractors/manual_extractor.py (rewrite)

Replaced traditional OCR pipeline (table_detector, table_parser, maintenance_patterns, pdf_preprocessor) with GeminiEngine
ManualExtractor.extract(pdf_bytes) now calls GeminiEngine.extract_maintenance(pdf_bytes) for semantic PDF understanding
Maps each Gemini serviceName to system maintenance subtypes via ServiceMapper.map_service_fuzzy()
Preserved progress callback pattern with 4 stages: 5% (sending), 50% (mapping), 90% (finalizing), 100% (complete)
Kept all existing data structures: ExtractedSchedule, VehicleInfo, ManualExtractionResult
Removed imports: PIL, create_engine, OcrConfig, pdf_preprocessor, table_detector, table_parser, maintenance_matcher
Removed private methods: _process_text_page, _process_scanned_page, _normalize_schedules, _extract_vehicle_info, _parse_vehicle_from_title, _parse_vehicle_from_text
Net: 58 insertions, 310 deletions

ocr/app/routers/extract.py -- no changes needed

Verified process_manual_job works unchanged with rewritten extractor (same interface and return type)

ocr/tests/test_manual_extractor.py (new)

8 tests covering all acceptance criteria scenarios:
- Normal: PDF with maintenance schedule returns extracted items with mapped subtypes
- Normal: Progress callbacks fire at 5/50/90/100%
- Edge: Unusual service names fuzzy match to subtypes
- Edge: Unmapped service names use Gemini name with default confidence
- Edge: Empty Gemini response returns empty schedules list
- Error: Gemini failure returns ManualExtractionResult with error
- Error: Unexpected exception caught and returned
- Integration: Result contains all fields needed by job queue flow

Acceptance Criteria Status

ManualExtractor.extract(pdf_bytes) calls Gemini and returns ManualExtractionResult
ExtractedSchedule items include matched subtypes from the 27 routine_maintenance categories
Progress callbacks fire at appropriate intervals during processing
Existing job queue flow (submit -> poll -> complete) works with new extractor
Traditional OCR pipeline code (table_detector, maintenance_patterns) no longer called

Test Results

All 8 new tests pass. No existing tests broken (8 pre-existing failures unrelated to this change).

Verdict: PASS | Next: QR post-implementation review

## Milestone: Manual Extractor Gemini Rewrite **Phase**: Execution | **Agent**: Developer | **Status**: PASS ### Changes Made **`ocr/app/extractors/manual_extractor.py` (rewrite)** - Replaced traditional OCR pipeline (table_detector, table_parser, maintenance_patterns, pdf_preprocessor) with GeminiEngine - `ManualExtractor.extract(pdf_bytes)` now calls `GeminiEngine.extract_maintenance(pdf_bytes)` for semantic PDF understanding - Maps each Gemini `serviceName` to system maintenance subtypes via `ServiceMapper.map_service_fuzzy()` - Preserved progress callback pattern with 4 stages: 5% (sending), 50% (mapping), 90% (finalizing), 100% (complete) - Kept all existing data structures: `ExtractedSchedule`, `VehicleInfo`, `ManualExtractionResult` - Removed imports: PIL, create_engine, OcrConfig, pdf_preprocessor, table_detector, table_parser, maintenance_matcher - Removed private methods: `_process_text_page`, `_process_scanned_page`, `_normalize_schedules`, `_extract_vehicle_info`, `_parse_vehicle_from_title`, `_parse_vehicle_from_text` - Net: 58 insertions, 310 deletions **`ocr/app/routers/extract.py` -- no changes needed** - Verified `process_manual_job` works unchanged with rewritten extractor (same interface and return type) **`ocr/tests/test_manual_extractor.py` (new)** - 8 tests covering all acceptance criteria scenarios: - Normal: PDF with maintenance schedule returns extracted items with mapped subtypes - Normal: Progress callbacks fire at 5/50/90/100% - Edge: Unusual service names fuzzy match to subtypes - Edge: Unmapped service names use Gemini name with default confidence - Edge: Empty Gemini response returns empty schedules list - Error: Gemini failure returns ManualExtractionResult with error - Error: Unexpected exception caught and returned - Integration: Result contains all fields needed by job queue flow ### Acceptance Criteria Status - [x] ManualExtractor.extract(pdf_bytes) calls Gemini and returns ManualExtractionResult - [x] ExtractedSchedule items include matched subtypes from the 27 routine_maintenance categories - [x] Progress callbacks fire at appropriate intervals during processing - [x] Existing job queue flow (submit -> poll -> complete) works with new extractor - [x] Traditional OCR pipeline code (table_detector, maintenance_patterns) no longer called ### Test Results All 8 new tests pass. No existing tests broken (8 pre-existing failures unrelated to this change). *Verdict*: PASS | *Next*: QR post-implementation review

egullickson referenced this issue from a commit

2026-02-11 21:27:47 +00:00

feat: rewrite ManualExtractor to use Gemini engine (refs #134)

egullickson referenced a pull request that will close this issue

2026-02-11 21:28:11 +00:00

feat: Expand OCR with fuel receipt scanning and maintenance extraction (#129) #147

egullickson closed this issue

2026-02-13 02:25:56 +00:00

Sign in to join this conversation.