fix: remove char whitelist incompatible with Tesseract LSTM (refs #113)
All checks were successful
Deploy to Staging / Build Images (pull_request) Successful in 36s
Deploy to Staging / Deploy to Staging (pull_request) Successful in 51s
Deploy to Staging / Verify Staging (pull_request) Successful in 8s
Deploy to Staging / Notify Staging Ready (pull_request) Successful in 8s
Deploy to Staging / Notify Staging Failure (pull_request) Has been skipped

tessedit_char_whitelist does not work with OEM 1 (LSTM engine) and
causes empty/erratic output. This was the root cause of Tesseract
returning empty text despite clear, well-preprocessed images.
Character filtering is already handled post-OCR by the VIN validator's
correct_ocr_errors() method (I->1, O->0, Q->0, etc).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Eric Gullickson
2026-02-06 21:52:08 -06:00
parent ae5221c759
commit 432b3bda36

View File

@@ -299,11 +299,12 @@ class VinExtractor(BaseExtractor):
# Configure Tesseract for VIN extraction
# OEM 1 = LSTM neural network engine (best accuracy)
# Disable dictionaries since VINs are not dictionary words
# NOTE: tessedit_char_whitelist does NOT work with OEM 1 (LSTM).
# Using it causes empty/erratic output. Character filtering is
# handled post-OCR by vin_validator.correct_ocr_errors() instead.
config = (
f"--psm {psm} "
f"--oem 1 "
f"-c tessedit_char_whitelist={self.VIN_WHITELIST} "
f"-c load_system_dawg=false "
f"-c load_freq_dawg=false"
)