feat: Improve OCR process - replace Tesseract with PaddleOCR and add cloud fallback for VIN scanning #115
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem / User Need
The current OCR pipeline (Tesseract 5.x primary engine) fails on even simple phone camera images. VIN scanning from the "Add Vehicle" screen has never worked reliably in production. The recent fix attempt (PR #114, refs #113) was improperly approved and merged -- it addressed VIN fragment concatenation but did not solve the fundamental Tesseract accuracy problem. Additionally, the free-form crop tool is currently non-functional after that merge.
Evidence
Prior Art
docs/ocr-pipeline-tech-stack.mdScope
VIN scanning only -- get VIN photo capture working reliably as proof-of-concept for the new OCR engine. Fuel receipts, maintenance receipts, and owner's manual parsing will follow in separate issues once the engine is validated.
Proposed Solution: Hybrid OCR Architecture
Primary Engine: PaddleOCR (self-hosted)
Fallback Engine: Cloud API (Google Vision or AWS Textract)
Engine Evaluation Criteria
During planning, evaluate and select based on:
Changes Required
1. OCR Engine Replacement (mvp-ocr container)
2. Fix Broken Free-Form Crop Tool
3. Fix PR #114 Regression
4. VIN Pipeline Integration Testing
Acceptance Criteria
Technical Reference
OCR Benchmark Sources
Affected Components
ocr/ocr/app/extractors/ocr/app/preprocessors/ocr/app/validators/frontend/src/features/vehicles/backend/src/features/ocr/ocr/app/config.pysecrets/docs/ocr-pipeline-tech-stack.mdRelated Issues
Research Note: docTR as Additional Engine Candidate
During research, docTR (by Mindee) emerged as a strong candidate that should be evaluated alongside PaddleOCR during planning:
docTR Highlights
Recommended Evaluation During Planning
Recommendation: Since this issue is VIN-only, docTR may be the better primary engine for this scope. PaddleOCR becomes important when expanding to receipts and manuals (where table extraction matters). The existing extraction pipeline in
ocr/app/extractors/andocr/app/patterns/already handles structured data extraction, which compensates for docTR's lack of native structure support.Full Research Sources
Plan: Replace Tesseract with PaddleOCR + Optional Cloud Fallback
Phase: Planning | Agent: Orchestrator | Status: AWAITING_REVIEW
Pre-Planning Summary
Codebase Analysis completed on all affected areas (16 files examined). Key findings:
vin_extractor.pyandocr_service.pyvia directpytesseractcallsvin_preprocessor.py) and validators (vin_validator.py) are engine-independentocr-client.ts) is a thin HTTP proxy -- engine-independentDecision Critic evaluated cloud API selection. Verdict: REVISE
Architecture: OCR Engine Abstraction
Unchanged layers (engine-independent):
Sub-Issues (milestones map 1:1)
Milestone 1: Engine Abstraction Layer (refs #116)
New files:
ocr/app/engines/__init__.pyocr/app/engines/base_engine.py--OcrEngineABCocr/app/engines/paddle_engine.py-- PaddleOCR PP-OCRv4 wrapperocr/app/engines/tesseract_engine.py-- pytesseract wrapper (backward compat)ocr/app/engines/engine_factory.py-- Factory from configEngine interface:
Config updates (
config.py):OCR_PRIMARY_ENGINE: "paddleocr" (default) | "tesseract"OCR_CONFIDENCE_THRESHOLD: 0.6 (for fallback trigger)Dependencies (
requirements.txt):paddlepaddle>=2.6.0(CPU),paddleocr>=2.8.0pytesseract>=0.3.10(backward compat)Milestone 2: VIN Extractor Migration (refs #117)
Modified files:
ocr/app/extractors/vin_extractor.pyimport pytesseractwith engine factory import_perform_ocr()internals:pytesseract.image_to_data()->engine.recognize()_try_alternate_ocr()PSM fallbacks with PaddleOCR angle detectionocr/app/services/ocr_service.pypytesseract.image_to_data()with engine interfacepytesseract.pytesseract.tesseract_cmdinitializationocr/app/extractors/receipt_extractor.py(if uses Tesseract directly)Preserved (unchanged):
ocr/app/preprocessors/vin_preprocessor.py-- produces image bytes (engine-agnostic)ocr/app/validators/vin_validator.py-- operates on text strings (engine-agnostic)ocr/app/routers/extract.py-- calls extractor.extract() (engine-agnostic)backend/src/features/ocr/-- HTTP proxy (engine-agnostic)frontend/-- camera/crop/display (engine-agnostic)Key adaptation: PaddleOCR returns
[[[box], (text, confidence)]]format vs Tesseract's dict format. The engine abstraction normalizes this.Milestone 3: Optional Cloud Fallback (refs #118)
New files:
ocr/app/engines/cloud_engine.py-- Google Vision TEXT_DETECTIONocr/app/engines/hybrid_engine.py-- Primary + fallback logicHybridEngine logic:
Config:
OCR_FALLBACK_ENGINE: "google_vision" | "none" (default: "none")OCR_FALLBACK_THRESHOLD: 0.6 (trigger cloud when primary < this)GOOGLE_VISION_KEY_PATH: "/run/secrets/google-vision-key.json" (optional)Design notes:
Milestone 4: Docker/Infrastructure (refs #119)
Modified files:
ocr/Dockerfiletesseract-ocrapt package (optional backward compat)ocr/requirements.txt-- add paddlepaddle, paddleocr, google-cloud-visiondocker-compose.ymlOCR_PRIMARY_ENGINE,OCR_FALLBACK_ENGINE,OCR_FALLBACK_THRESHOLD./secrets/app/google-vision-key.json:/run/secrets/google-vision-key.json:roGOOGLE_APPLICATION_CREDENTIALS: /run/secrets/google-vision-key.jsondocker-compose.staging.yml-- same env var overridesImage size consideration: PaddleOCR with models adds ~500MB-1GB to container. Acceptable for self-hosted single-tenant.
Milestone 5: Crop Tool Fix (refs #120)
Investigation approach:
data-crop-containerselector works for dimension calculationFiles to investigate/modify:
frontend/src/shared/components/CameraCapture/CropTool.tsxfrontend/src/shared/components/CameraCapture/useImageCrop.tsfrontend/src/shared/components/CameraCapture/CameraCapture.tsxNote: This milestone runs in parallel with M1-M4 (no OCR engine dependency).
Milestone 6: Tests and Documentation (refs #121)
Test updates:
ocr/tests/test_vin_extraction.py-- adapt for PaddleOCR outputocr/tests/test_engine_abstraction.py(new) -- engine interface, factory, hybridnpm test,npm run lint,npm run type-checkDocumentation updates:
docs/ocr-pipeline-tech-stack.md-- update architecture to PaddleOCR primaryocr/CLAUDE.md-- add engines/ directory referenceExecution Order
Branch:
issue-115-improve-ocr-paddleocrPR: Title:
feat: Improve OCR process - replace Tesseract with PaddleOCR (#115)PR body:
Fixes #115, Fixes #116, Fixes #117, Fixes #118, Fixes #119, Fixes #120, Fixes #121Risk Assessment
Verdict: AWAITING_REVIEW | Next: QR plan-completeness review
QR Review: plan-completeness
Phase: Plan-Review | Agent: Quality Reviewer | Status: PASS (with amendments)
Findings
RESOLVED - Sub-issues already exist: Sub-issues #116-#121 were created with labels (status/backlog + type/*) and milestone (Sprint 2026-02-02) before the plan was posted. QR agent could not verify via Gitea MCP but they exist.
HIGH - Missing coverage for two acceptance criteria (ADDRESSED below):
VinOcrReviewModal.tsx(high/medium/low indicators). No changes needed, but plan should note this as already satisfied.HIGH - Cloud fallback acceptance criterion ambiguity (ADDRESSED below):
SHOULD_FIX - OcrConfig abstraction: Accept feedback. Will use a
hints: dict[str, Any]pattern for engine-specific config alongside common fields.SHOULD_FIX - Test parallelization: Noted. M6 stays as final gate for simplicity, but engine abstraction tests will be written alongside M1 implementation.
Plan Amendments
VinOcrReviewModal.tsxhints: dictpattern for engine-specific paramsVerdict: PASS | Next: TW plan-scrub
TW Review: plan-scrub
Phase: Plan-Review | Agent: Technical Writer | Status: PASS
Summary
All 6 categories passed: Clarity, Terminology, File Paths, Naming, Ambiguity, AI Context Efficiency.
Should-Fix Clarifications (non-blocking)
ocr-client.tsexplicitly, not genericbackend/src/features/ocr/ocr_service.pylisted as both modified AND preserved -- it IS modified (remove pytesseract)single_line/single_wordconfig hints are PaddleOCR interpretations, not PSM modes/extract/receipt,/extract/manualAll file paths verified. Naming conventions follow project patterns. Each milestone is executable with just plan + sub-issue description.
Verdict: PASS | Next: QR plan-code
QR Review: plan-code
Phase: Plan-Review | Agent: Quality Reviewer | Status: PASS (with amendments)
RULE 0 Findings (addressed)
Cloud engine lazy initialization: CloudEngine must use lazy loading -- don't crash container if secret is missing/invalid at startup. Initialize
_client = None, load on firstrecognize()call.EngineUnavailableErrorCloud API timeout handling: HybridEngine must wrap cloud calls with explicit timeout (5s). On timeout, return primary result.
timeout=5.0in cloud recognize, catchTimeout, return primaryPaddleOCR model download verification: Dockerfile must verify models downloaded successfully during build, not fail silently at runtime.
RUN python -c "from paddleocr import PaddleOCR; PaddleOCR(use_angle_cls=True, lang='en')"verification stepEngine exception hierarchy: Define
EngineError,EngineUnavailableError,EngineProcessingErrorin base_engine.py. All engines must raise these, not raw library exceptions.RULE 1 Findings (addressed)
Python naming: Follows PEP 8 and matches existing codebase (PascalCase classes, snake_case fields). Explicit note added.
Android testing: Added Android Chrome to M5 (#120) test matrix alongside iOS Safari and desktop Chrome.
Receipt/manual integration tests: Added
pytest test_receipt_extraction.pyto M6 (#121).RULE 2 Findings (noted)
Factory simplicity: Keep as factory function, not class. Single responsibility: instantiation only. Fallback logic in HybridEngine, config validation in OcrConfig.
Dead code removal: M2 explicitly removes
_try_alternate_ocr()method and PSM constants. VIN_WHITELIST moves to engine config.Verdict: PASS | Next: QR plan-docs
QR Review: plan-docs
Phase: Plan-Review | Agent: Quality Reviewer | Status: PASS (with amendments)
HIGH Findings (addressed)
Missing
ocr/app/CLAUDE.mdupdate: Newengines/directory must be added toocr/app/CLAUDE.mdsubdirectories table.ocr/app/CLAUDE.mdwith engines/ entryIncomplete
docs/ocr-pipeline-tech-stack.mdspec: Must specify concrete changes -- swap primary/fallback in flowchart, update tech stack table, add cloud fallback row, update Dockerfile section.Missing
.ai/context.jsonupdate: mvp-ocr description must reflect PaddleOCR primary + optional cloud.SHOULD_FIX Findings (noted)
docs/ARCHITECTURE-OVERVIEW.md: Optional update to mention pluggable engine architecture.Verdict: PASS | Next: Plan APPROVED -- ready for execution
Plan Status: APPROVED
Phase: Planning | Agent: Orchestrator | Status: APPROVED
Review Cycle Complete
Amendments Incorporated from Reviews
From QR plan-completeness:
From TW plan-scrub:
ocr-client.tsexplicitly in backend referencesocr_service.pyis modified in M2 (not preserved)From QR plan-code:
From QR plan-docs:
ocr/app/CLAUDE.mdwith engines/ entrydocs/ocr-pipeline-tech-stack.md.ai/context.jsonOCR descriptionSub-Issues
Next Steps
issue-115-improve-ocr-paddleocrfrom mainDecision Critic: docTR vs PaddleOCR for VIN-Only OCR
Phase: Planning | Agent: Decision Critic | Status: STAND (PaddleOCR confirmed)
Context
Research note (comment #1298) suggested docTR may be better for VIN-only scope based on 10/10 pure OCR accuracy vs PaddleOCR 9/10. Decision Critic evaluated whether the plan should switch to docTR.
Verdict: STAND -- PaddleOCR remains the correct choice
The research note contained factual errors that, when corrected, strengthen PaddleOCR:
Why PaddleOCR wins for VIN scene text
VIN scanning photographs car door jambs, dashboard plates, and registration cards under varied lighting and angles. This is a scene text problem, not a document text problem:
No plan changes needed
The engine abstraction layer (OcrEngine ABC) means docTR can be added as an alternative engine in the future if VIN-specific fine-tuning is performed.
Sources
Milestone 1: Engine Abstraction Layer (refs #116)
Phase: Execution | Agent: Developer | Status: PASS
Changes Committed
Branch:
issue-115-improve-ocr-paddleocrCommit:
ebc633f-feat: add OCR engine abstraction layer (refs #116)New Files
ocr/app/engines/__init__.pyocr/app/engines/base_engine.pyOcrEngineABC,OcrConfig,OcrEngineResult,WordBoxdataclasses, exception hierarchy (EngineError,EngineUnavailableError,EngineProcessingError)ocr/app/engines/paddle_engine.pyPaddleOcrEngine- PP-OCRv4 wrapper with lazy init, angle classification, CPU-only, char whitelist filteringocr/app/engines/tesseract_engine.pyTesseractEngine- pytesseract wrapper mapping OcrConfig to PSM modes and whitelist configocr/app/engines/engine_factory.pycreate_engine()factory function with dynamic import from engine registryModified Files
ocr/app/config.pyOCR_PRIMARY_ENGINE(default: "paddleocr") andOCR_CONFIDENCE_THRESHOLD(default: 0.6) env varsocr/requirements.txtpaddlepaddle>=2.6.0,paddleocr>=2.8.0Plan Compliance
recognize()returning structuredOcrEngineResult(text, confidence, word boxes)OcrConfig.char_whitelistOcrConfig.single_line/single_wordmap to PaddleOCR angle detection and Tesseract PSM 7/8OcrConfig.hintsdict for engine-specific parameters (per QR plan-completeness amendment)TesseractEnginewrapperVerdict: PASS | Next: M2 (#117) - Migrate VIN extractor to engine abstraction
Milestone 2: VIN Extractor Migration (refs #117)
Phase: Execution | Agent: Developer | Status: PASS
Changes Committed
Branch:
issue-115-improve-ocr-paddleocrCommit:
013fb0c-feat: migrate VIN/receipt extractors and OCR service to engine abstraction (refs #117)Modified Files
ocr/app/extractors/vin_extractor.pypytesseract.image_to_data()withengine.recognize()via OcrConfig; replaced PSM mode fallbacks (7, 8, 11, 13) with engine-agnostic single-line/single-word configs; VIN char whitelist passed via OcrConfig for post-OCR filtering; updated debug logs from Tesseract-specific "PSM 6" to engine-agnostic "Primary OCR"ocr/app/services/ocr_service.pypytesseract.image_to_data()withengine.recognize(); removed dead_process_ocr_data()method (Tesseract dict processing now handled by engine abstraction); updated module docstringocr/app/extractors/receipt_extractor.pypytesseract.image_to_string()withengine.recognize(); removed PSM parameter from_perform_ocr()Removed Imports (across all 3 files)
import pytesseractfrom PIL import Image(where no longer needed)import io(where no longer needed)from app.config import settings(where only used for tesseract_cmd)Added Imports (across all 3 files)
from app.engines import OcrConfig, create_enginePlan Compliance
engine.recognize()instead of pytesseract directlyOcrConfig.char_whitelist(PaddleOCR does post-filter, Tesseract uses config flag)OcrEngineResult(0.0-1.0 range from all engines)_process_ocr_data()from ocr_service.pyNote
ocr/app/extractors/manual_extractor.pystill uses pytesseract directly. Not in scope for #117 (not listed in plan). Can be migrated in M6 or a follow-up issue.Verdict: PASS | Next: M3 (#118) - Optional Google Vision cloud fallback
Milestone 3: Optional Google Vision Cloud Fallback (refs #118)
Phase: Execution | Agent: Developer | Status: PASS
Changes Committed
Branch:
issue-115-improve-ocr-paddleocrCommit:
4ef942c-feat: add optional Google Vision cloud fallback engine (refs #118)New Files
ocr/app/engines/cloud_engine.pyCloudEngine- Google Vision TEXT_DETECTION wrapper with lazy client initialization. RaisesEngineUnavailableErrorwhen secret is missing (not at startup, only on firstrecognize()call). Applies char whitelist filtering to both word-level and full-text results. Uses 0.95 default confidence (Vision API does not return per-word confidence in TEXT_DETECTION).ocr/app/engines/hybrid_engine.pyHybridEngine- Primary + fallback engine with confidence threshold. Calls primary first; if confidence < threshold and fallback is configured, calls fallback. Returns higher-confidence result. 5-second timeout guard on cloud calls. Graceful degradation: returns primary result on any fallback failure.Modified Files
ocr/app/config.pyOCR_FALLBACK_ENGINE(default: "none"),OCR_FALLBACK_THRESHOLD(default: 0.6),GOOGLE_VISION_KEY_PATH(default: "/run/secrets/google-vision-key.json")ocr/app/engines/engine_factory.py_create_single_engine()+create_engine(). Factory now auto-wraps primary inHybridEnginewhenOCR_FALLBACK_ENGINE != "none". Fallback creation failure is non-fatal (logs warning, returns primary only). Addedgoogle_visionto engine registry.ocr/app/engines/__init__.pyocr/requirements.txtgoogle-cloud-vision>=3.7.0Plan Compliance
OCR_FALLBACK_ENGINE=none) per Decision Critic verdictOCR_FALLBACK_THRESHOLDAcceptance Criteria Status
Verdict: PASS | Next: M4 (#119) - Docker/infrastructure updates
Milestone 4: Docker/Infrastructure Updates (refs #119)
Phase: Execution | Agent: Developer | Status: PASS
Changes Committed
Branch:
issue-115-improve-ocr-paddleocrCommit:
9b64173-chore: update Docker and compose files for PaddleOCR engine (refs #119)Modified Files
ocr/Dockerfilelibtesseract-devwithlibgomp1(OpenMP for PaddlePaddle); added PP-OCRv4 model pre-download and verification during build; added engine documentation header; kepttesseract-ocr+tesseract-ocr-engfor backward compatdocker-compose.ymlOCR_PRIMARY_ENGINE,OCR_FALLBACK_ENGINE,OCR_FALLBACK_THRESHOLD,GOOGLE_VISION_KEY_PATHenv vars to mvp-ocr; added commented Google Vision volume mount with enable instructionsdocker-compose.staging.ymldocker-compose.prod.ymlNew Files
secrets/app/google-vision-key.json.examplePlan Compliance
libtesseract-devremoved (unused; pytesseract uses binary, not C library)libgomp1added for PaddlePaddle OpenMP requirementtesseract-ocr+tesseract-ocr-eng)OCR_FALLBACK_ENGINE=none)Acceptance Criteria Status
Note
Container image size will increase ~500MB-1GB due to PaddleOCR models. Acceptable for single-tenant deployment. Final image size to be documented after first build in M6.
Verdict: PASS | Next: M5 (#120) - Fix crop tool regression
Milestone 5: Fix Crop Tool Regression (refs #120)
Phase: Execution | Agent: Developer | Status: PASS
Root Cause Analysis
The free-form crop tool introduced by PR #114 had three bugs that made it non-functional:
Bug 1 (Critical): Stale
cropAreaRefinhandleDragEndcropAreaRefwas synced viauseEffect(passive effect, runs after browser paint)touchend/mouseupfired, the ref still held the value from before the lasthandleMovecallcropAreaRefwas still{ width: 0, height: 0 }fromhandleDrawStartminSizecheck always failed, socropDrawnnever becametrueand the confirm button stayed disabledBug 2 (High):
minSizecheck incompatible with aspect ratioaspectRatio = 6, constrainingheight = width / 6handleDragEndrequired BOTHwidth >= 10%ANDheight >= 10%height >= 10%requiredwidth >= 60%(drawing across 60% of the image!)Bug 3 (Minor): Drawing mode bounds overflow
y + heightcould exceed 100%Changes Committed
Branch:
issue-115-improve-ocr-paddleocrCommit:
3c1a090-fix: resolve crop tool regression with stale ref and aspect ratio minSize (refs #120)Modified Files
frontend/src/shared/components/CameraCapture/useImageCrop.tsuseEffect-basedcropAreaRefsync; added direct synchronous ref updates inhandleDrawStart,handleMove(drawing mode), andhandleMove(handle-drag mode); fixedhandleDragEndminSizecheck to only verify width when aspect ratio constrains height; added bounds clamping for aspect-ratio-forced heightVerification
npm run lint --quiet)npx tsc --noEmit)Acceptance Criteria Status
Verdict: PASS | Next: M6 (#121) - Tests and documentation
Milestone 6: Tests and Documentation (refs #121)
Phase: Execution | Agent: Developer | Status: PASS
Changes Committed
Branch:
issue-115-improve-ocr-paddleocrCommit:
47c5676-chore: update OCR tests and documentation (refs #121)New Files
ocr/tests/test_engine_abstraction.pyModified Files
ocr/tests/test_vin_extraction.pyTestVinExtractorEngineIntegrationclass: tests verifying VinExtractor calls engine.recognize() with correct OcrConfig (VIN whitelist, angle_cls, single_line/word modes); tests for_calculate_base_confidence(empty, weighted blend, single value)docs/ocr-pipeline-tech-stack.mddocs/CLAUDE.mdocr/CLAUDE.mdapp/engines/subdirectory entryocr/app/CLAUDE.mdengines/directory to subdirectories table.ai/context.jsonPlan Compliance
.ai/context.jsonupdated per QR plan-docs amendmentocr/app/CLAUDE.mdincludes engines/ directory per QR plan-docs amendmentVerification
Test Coverage Summary
test_engine_abstraction.pytest_vin_extraction.pyVerdict: PASS | Next: All 6 milestones complete. Ready for PR.
QR Review: Final Review
Phase: Review | Agent: Quality Agent | Status: PASS
RULE 0 (CRITICAL) - Production Reliability: PASS
No violations found:
.json.exampleplaceholder committed (not real key)RULE 1 (HIGH) - Project Conformance: PASS
RULE 2 (SHOULD_FIX) - Structural Quality: NOTED (non-blocking)
TesseractEngine retained for backward compatibility: 115 lines of code, not used in default config but available via
OCR_PRIMARY_ENGINE=tesseract. Justified by plan decision to keep as configurable fallback.manual_extractor.py still uses pytesseract directly: Not in scope for this PR (noted in M2 milestone comment). Recommend follow-up issue to migrate to engine abstraction.
Test Coverage
PR #122 Verdict
APPROVED FOR MERGE
All RULE 0 and RULE 1 gates pass. RULE 2 items are non-blocking and documented.
Verdict: PASS | Next: Merge PR, move to status/done