feat: Improve OCR process - replace Tesseract with PaddleOCR (#115) #122

Merged
egullickson merged 16 commits from issue-115-improve-ocr-paddleocr into main 2026-02-08 01:13:35 +00:00

16 Commits

Author SHA1 Message Date
Eric Gullickson
9a2b12c5dc fix: No matches
All checks were successful
Deploy to Staging / Build Images (pull_request) Successful in 37s
Deploy to Staging / Deploy to Staging (pull_request) Successful in 22s
Deploy to Staging / Verify Staging (pull_request) Successful in 8s
Deploy to Staging / Notify Staging Ready (pull_request) Successful in 8s
Deploy to Staging / Notify Staging Failure (pull_request) Has been skipped
2026-02-07 16:35:28 -06:00
Eric Gullickson
3adbb10ff6 fix: OCR Timout still
All checks were successful
Deploy to Staging / Build Images (pull_request) Successful in 3m23s
Deploy to Staging / Deploy to Staging (pull_request) Successful in 51s
Deploy to Staging / Verify Staging (pull_request) Successful in 8s
Deploy to Staging / Notify Staging Ready (pull_request) Successful in 7s
Deploy to Staging / Notify Staging Failure (pull_request) Has been skipped
2026-02-07 16:26:10 -06:00
Eric Gullickson
fcffb0bb43 fix: PaddleOCR timeout
All checks were successful
Deploy to Staging / Build Images (pull_request) Successful in 3m20s
Deploy to Staging / Deploy to Staging (pull_request) Successful in 22s
Deploy to Staging / Verify Staging (pull_request) Successful in 9s
Deploy to Staging / Notify Staging Ready (pull_request) Successful in 8s
Deploy to Staging / Notify Staging Failure (pull_request) Has been skipped
2026-02-07 16:18:14 -06:00
Eric Gullickson
9d2d4e57b7 fix: PaddleOCR error
All checks were successful
Deploy to Staging / Build Images (pull_request) Successful in 36s
Deploy to Staging / Deploy to Staging (pull_request) Successful in 52s
Deploy to Staging / Verify Staging (pull_request) Successful in 9s
Deploy to Staging / Notify Staging Ready (pull_request) Successful in 7s
Deploy to Staging / Notify Staging Failure (pull_request) Has been skipped
2026-02-07 16:12:07 -06:00
Eric Gullickson
0499c902a8 fix: Crop box broken
All checks were successful
Deploy to Staging / Build Images (pull_request) Successful in 3m22s
Deploy to Staging / Deploy to Staging (pull_request) Successful in 22s
Deploy to Staging / Verify Staging (pull_request) Successful in 8s
Deploy to Staging / Notify Staging Ready (pull_request) Successful in 7s
Deploy to Staging / Notify Staging Failure (pull_request) Has been skipped
2026-02-07 16:00:23 -06:00
Eric Gullickson
dab4a3bdf3 fix: PaddleOCR error
All checks were successful
Deploy to Staging / Build Images (pull_request) Successful in 3m46s
Deploy to Staging / Deploy to Staging (pull_request) Successful in 51s
Deploy to Staging / Verify Staging (pull_request) Successful in 8s
Deploy to Staging / Notify Staging Ready (pull_request) Successful in 7s
Deploy to Staging / Notify Staging Failure (pull_request) Has been skipped
2026-02-07 15:51:04 -06:00
Eric Gullickson
639ca117f1 fix: Update PaddleOCR API
All checks were successful
Deploy to Staging / Build Images (pull_request) Successful in 5m6s
Deploy to Staging / Deploy to Staging (pull_request) Successful in 51s
Deploy to Staging / Verify Staging (pull_request) Successful in 8s
Deploy to Staging / Notify Staging Ready (pull_request) Successful in 8s
Deploy to Staging / Notify Staging Failure (pull_request) Has been skipped
2026-02-07 14:44:06 -06:00
Eric Gullickson
b9fe222f12 fix: Build errors and tesseract removal
Some checks failed
Deploy to Staging / Build Images (pull_request) Failing after 4m14s
Deploy to Staging / Deploy to Staging (pull_request) Has been skipped
Deploy to Staging / Verify Staging (pull_request) Has been skipped
Deploy to Staging / Notify Staging Ready (pull_request) Has been skipped
Deploy to Staging / Notify Staging Failure (pull_request) Successful in 8s
2026-02-07 12:12:04 -06:00
Eric Gullickson
cf114fad3c fix: build errors for OpenCV
Some checks failed
Deploy to Staging / Build Images (pull_request) Failing after 3m16s
Deploy to Staging / Deploy to Staging (pull_request) Has been skipped
Deploy to Staging / Verify Staging (pull_request) Has been skipped
Deploy to Staging / Notify Staging Ready (pull_request) Has been skipped
Deploy to Staging / Notify Staging Failure (pull_request) Successful in 8s
2026-02-07 11:58:00 -06:00
Eric Gullickson
47c5676498 chore: update OCR tests and documentation (refs #121)
Some checks failed
Deploy to Staging / Build Images (pull_request) Failing after 7m4s
Deploy to Staging / Deploy to Staging (pull_request) Has been skipped
Deploy to Staging / Verify Staging (pull_request) Has been skipped
Deploy to Staging / Notify Staging Ready (pull_request) Has been skipped
Deploy to Staging / Notify Staging Failure (pull_request) Successful in 7s
Add engine abstraction tests and update docs to reflect PaddleOCR primary
architecture with optional Google Vision cloud fallback.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-07 11:42:51 -06:00
Eric Gullickson
1e96baca6f fix: workflow contract 2026-02-07 11:32:36 -06:00
Eric Gullickson
3c1a090ae3 fix: resolve crop tool regression with stale ref and aspect ratio minSize (refs #120)
Three bugs fixed in the draw-first crop tool introduced by PR #114:

1. Stale cropAreaRef: replaced useEffect-based ref sync with direct
   synchronous updates in handleMove and handleDrawStart. The useEffect
   ran after browser paint, so handleDragEnd read stale values (often
   {width:0, height:0}), preventing cropDrawn from being set.

2. Aspect ratio minSize: when aspectRatio=6 (VIN mode), height=width/6
   required width>=60% to pass the height>=10% check. Now only checks
   width>=minSize when aspect ratio constrains height.

3. Bounds clamping: aspect-ratio-forced height could push crop area
   past 100% of container. Now clamps y position to keep within bounds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-07 11:29:16 -06:00
Eric Gullickson
9b6417379b chore: update Docker and compose files for PaddleOCR engine (refs #119)
- Replace libtesseract-dev with libgomp1 (OpenMP for PaddlePaddle)
- Pre-download PP-OCRv4 models during Docker build
- Add OCR engine env vars to all compose files (base, staging, prod)
- Add optional Google Vision secret mount (commented, enable on demand)
- Create google-vision-key.json.example placeholder

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-07 11:17:44 -06:00
Eric Gullickson
4ef942cb9d feat: add optional Google Vision cloud fallback engine (refs #118)
CloudEngine wraps Google Vision TEXT_DETECTION with lazy init.
HybridEngine runs primary engine, falls back to cloud when confidence
is below threshold. Disabled by default (OCR_FALLBACK_ENGINE=none).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-07 11:12:08 -06:00
Eric Gullickson
013fb0c67a feat: migrate VIN/receipt extractors and OCR service to engine abstraction (refs #117)
Replace direct pytesseract calls with OcrEngine interface in vin_extractor.py,
receipt_extractor.py, and ocr_service.py. PSM mode fallbacks replaced with
engine-agnostic single-line/single-word configs. Dead _process_ocr_data removed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-07 10:56:27 -06:00
Eric Gullickson
ebc633fb36 feat: add OCR engine abstraction layer (refs #116)
Introduce pluggable OcrEngine ABC with PaddleOCR PP-OCRv4 as primary
engine and Tesseract wrapper for backward compatibility. Engine factory
reads OCR_PRIMARY_ENGINE config to instantiate the correct engine.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-07 10:47:40 -06:00