docs: update CLAUDE.md indexes and README for OCR expansion (refs #137)

Add/update documentation across backend, Python OCR service, and frontend for receipt scanning, manual extraction, and Gemini integration. Create new CLAUDE.md files for engines/, fuel-logs/, documents/, and maintenance/ features. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11 11:04:19 -06:00
parent 40df5e5b58
commit ab0d8463be
11 changed files with 385 additions and 45 deletions
--- a/ocr/app/CLAUDE.md
+++ b/ocr/app/CLAUDE.md
@@ -1,23 +1,25 @@
 # ocr/app/

+Python OCR microservice (FastAPI). Primary engine: PaddleOCR PP-OCRv4 with optional Google Vision cloud fallback. Gemini 2.5 Flash for maintenance manual PDF extraction (standalone module, not an OcrEngine subclass).
+
 ## Files

 | File | What | When to read |
 | ---- | ---- | ------------ |
 | `main.py` | FastAPI application entry point | Route registration, app setup |
-| `config.py` | Configuration settings | Environment variables, settings |
+| `config.py` | Configuration settings (OCR engines, Vertex AI, Redis, Vision API limits) | Environment variables, settings |
 | `__init__.py` | Package init | Package structure |

 ## Subdirectories

 | Directory | What | When to read |
 | --------- | ---- | ------------ |
-| `engines/` | OCR engine abstraction (PaddleOCR primary, Google Vision fallback) | Engine changes, adding new engines |
-| `extractors/` | Data extraction logic | Adding new extraction types |
+| `engines/` | OCR engine abstraction (PaddleOCR, Google Vision, Hybrid) and Gemini module | Engine changes, adding new engines |
+| `extractors/` | Domain-specific data extraction (receipts, fuel receipts, maintenance manuals) | Adding new extraction types, modifying extraction logic |
 | `models/` | Data models and schemas | Request/response types |
-| `patterns/` | Regex and parsing patterns | Pattern matching rules |
+| `patterns/` | Regex patterns and service name mapping (27 maintenance subtypes) | Pattern matching rules, service categorization |
 | `preprocessors/` | Image preprocessing pipeline | Image preparation before OCR |
-| `routers/` | FastAPI route handlers | API endpoint changes |
-| `services/` | Business logic services | Core OCR processing |
-| `table_extraction/` | Table detection and parsing | Structured data extraction |
+| `routers/` | FastAPI route handlers (/extract, /extract/receipt, /extract/manual, /jobs) | API endpoint changes |
+| `services/` | Business logic services (job queue with Redis) | Core OCR processing, async job management |
+| `table_extraction/` | Table detection and parsing | Structured data extraction from images |
 | `validators/` | Input validation | Validation rules |
--- a/ocr/app/engines/CLAUDE.md
+++ b/ocr/app/engines/CLAUDE.md
@@ -0,0 +1,33 @@
+# ocr/app/engines/
+
+OCR engine abstraction layer. Two categories of engines:
+
+1. **OcrEngine subclasses** (image-to-text): PaddleOCR, Google Vision, Hybrid. Accept image bytes, return text + confidence + word boxes.
+2. **GeminiEngine** (PDF-to-structured-data): Standalone module for maintenance schedule extraction via Vertex AI. Accepts PDF bytes, returns structured JSON. Not an OcrEngine subclass because the interface signatures differ.
+
+## Files
+
+| File | What | When to read |
+| ---- | ---- | ------------ |
+| `__init__.py` | Public engine API exports (OcrEngine, create_engine, exceptions) | Importing engine interfaces |
+| `base_engine.py` | OcrEngine ABC, OcrConfig, OcrEngineResult, WordBox, exception hierarchy | Engine interface contract, adding new engines |
+| `paddle_engine.py` | PaddleOCR PP-OCRv4 primary engine | Local OCR debugging, accuracy tuning |
+| `cloud_engine.py` | Google Vision TEXT_DETECTION fallback engine (WIF authentication) | Cloud OCR configuration, API quota |
+| `hybrid_engine.py` | Combines primary + fallback engine with confidence threshold switching | Engine selection logic, fallback behavior |
+| `engine_factory.py` | Factory function and engine registry for instantiation | Adding new engine types |
+| `gemini_engine.py` | Gemini 2.5 Flash integration for maintenance schedule extraction (Vertex AI SDK, 20MB PDF limit, structured JSON output) | Manual extraction debugging, Gemini configuration |
+
+## Engine Selection
+
+```
+create_engine(config)
+    |
+    +-- Primary: PaddleOCR (local, fast, no API limits)
+    |
+    +-- Fallback: Google Vision (cloud, 1000/month limit)
+    |
+    v
+HybridEngine (tries primary, falls back if confidence < threshold)
+```
+
+GeminiEngine is created independently by ManualExtractor, not through the engine factory.