The min-channel correctly extracts contrast (white text=255 vs green
sticker bg=130), but Tesseract expects dark text on light background.
Without inversion, the grayscale-only path returned empty text for
every PSM mode because Tesseract couldn't see bright-on-dark text.
Invert via bitwise_not: text becomes 0 (black), sticker bg becomes
125 (gray). Fixes all three OCR paths (adaptive, grayscale, Otsu).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two fixes:
1. Always use min-channel for color images instead of gated comparison
that was falling back to standard grayscale (which has only 23%
contrast for white-on-green VIN stickers).
2. Add grayscale-only OCR path (CLAHE + denoise, no thresholding)
between adaptive and Otsu attempts. Tesseract's LSTM engine is
designed to handle grayscale input directly and often outperforms
binarized input where thresholding creates artifacts.
Pipeline order: adaptive threshold → grayscale-only → Otsu threshold
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace std-based channel selection (which incorrectly picked green for
green-tinted VIN stickers) with per-pixel min(B,G,R). White text stays
255 in all channels while colored backgrounds drop to their weakest
channel value, giving 2x contrast improvement. Add morphological
opening after thresholding to remove noise speckles from car body
surface that were confusing Tesseract's page segmentation.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
White text on green VIN stickers has only ~12% contrast in standard
grayscale conversion because the green channel dominates luminance.
The new _best_contrast_channel method evaluates each RGB channel's
standard deviation and selects the one with highest contrast, giving
~2x improvement for green-tinted VIN stickers. Falls back to standard
grayscale for neutral-colored images.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implement async PDF processing for owner's manuals with maintenance
schedule extraction:
- Add PDF preprocessor with PyMuPDF for text/scanned PDF handling
- Add maintenance pattern matching (mileage, time, fluid specs)
- Add service name mapping to maintenance subtypes
- Add table detection and parsing for schedule tables
- Add manual extractor orchestrating the complete pipeline
- Add POST /extract/manual endpoint for async job submission
- Add Redis job queue support for manual extraction jobs
- Add progress tracking during processing
Processing pipeline:
1. Analyze PDF structure (text layer vs scanned)
2. Find maintenance schedule sections
3. Extract text or OCR scanned pages at 300 DPI
4. Detect and parse maintenance tables
5. Normalize service names and extract intervals
6. Return structured maintenance schedules with confidence scores
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implement receipt-specific OCR extraction for fuel receipts:
- Pattern matching modules for date, currency, and fuel data extraction
- Receipt-optimized image preprocessing for thermal receipts
- POST /extract/receipt endpoint with field extraction
- Confidence scoring per extracted field
- Cross-validation of fuel receipt data
- Unit tests for all pattern matchers
Extracted fields: merchantName, transactionDate, totalAmount,
fuelQuantity, pricePerUnit, fuelGrade
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>