fix: remove char whitelist incompatible with Tesseract LSTM (refs #113)
All checks were successful
Deploy to Staging / Build Images (pull_request) Successful in 36s
Deploy to Staging / Deploy to Staging (pull_request) Successful in 51s
Deploy to Staging / Verify Staging (pull_request) Successful in 8s
Deploy to Staging / Notify Staging Ready (pull_request) Successful in 8s
Deploy to Staging / Notify Staging Failure (pull_request) Has been skipped
All checks were successful
Deploy to Staging / Build Images (pull_request) Successful in 36s
Deploy to Staging / Deploy to Staging (pull_request) Successful in 51s
Deploy to Staging / Verify Staging (pull_request) Successful in 8s
Deploy to Staging / Notify Staging Ready (pull_request) Successful in 8s
Deploy to Staging / Notify Staging Failure (pull_request) Has been skipped
tessedit_char_whitelist does not work with OEM 1 (LSTM engine) and causes empty/erratic output. This was the root cause of Tesseract returning empty text despite clear, well-preprocessed images. Character filtering is already handled post-OCR by the VIN validator's correct_ocr_errors() method (I->1, O->0, Q->0, etc). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -299,11 +299,12 @@ class VinExtractor(BaseExtractor):
|
||||
|
||||
# Configure Tesseract for VIN extraction
|
||||
# OEM 1 = LSTM neural network engine (best accuracy)
|
||||
# Disable dictionaries since VINs are not dictionary words
|
||||
# NOTE: tessedit_char_whitelist does NOT work with OEM 1 (LSTM).
|
||||
# Using it causes empty/erratic output. Character filtering is
|
||||
# handled post-OCR by vin_validator.correct_ocr_errors() instead.
|
||||
config = (
|
||||
f"--psm {psm} "
|
||||
f"--oem 1 "
|
||||
f"-c tessedit_char_whitelist={self.VIN_WHITELIST} "
|
||||
f"-c load_system_dawg=false "
|
||||
f"-c load_freq_dawg=false"
|
||||
)
|
||||
|
||||
Reference in New Issue
Block a user