fix: resolve VIN OCR scanning failures on all images (refs #113)
All checks were successful
Deploy to Staging / Build Images (pull_request) Successful in 35s
Deploy to Staging / Deploy to Staging (pull_request) Successful in 51s
Deploy to Staging / Verify Staging (pull_request) Successful in 2m31s
Deploy to Staging / Notify Staging Ready (pull_request) Successful in 8s
Deploy to Staging / Notify Staging Failure (pull_request) Has been skipped
All checks were successful
Deploy to Staging / Build Images (pull_request) Successful in 35s
Deploy to Staging / Deploy to Staging (pull_request) Successful in 51s
Deploy to Staging / Verify Staging (pull_request) Successful in 2m31s
Deploy to Staging / Notify Staging Ready (pull_request) Successful in 8s
Deploy to Staging / Notify Staging Failure (pull_request) Has been skipped
Root cause: Tesseract fragments VINs into multiple words but candidate extraction required continuous 17-char sequences, rejecting all results. Changes: - Fix candidate extraction to concatenate adjacent OCR fragments - Disable Tesseract dictionaries (VINs are not dictionary words) - Set OEM 1 (LSTM engine) for better accuracy - Add PSM 11 (sparse text) and PSM 13 (raw line) fallback modes - Add Otsu's thresholding as alternative preprocessing pipeline - Upscale small images to meet Tesseract's 300 DPI requirement - Remove incorrect B->8 and S->5 transliterations (valid VIN chars) - Fix pre-existing test bug in check digit expected value Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -43,9 +43,9 @@ class TestVinValidator:
|
||||
result = validator.calculate_check_digit("1HGBH41JXMN109186")
|
||||
assert result == "X"
|
||||
|
||||
# 5YJSA1E28HF123456 has check digit 2 at position 9
|
||||
# 5YJSA1E28HF123456 has check digit at position 9
|
||||
result = validator.calculate_check_digit("5YJSA1E28HF123456")
|
||||
assert result == "8" # Verify this is correct for this VIN
|
||||
assert result == "5"
|
||||
|
||||
def test_validate_check_digit_valid(self) -> None:
|
||||
"""Test check digit validation with valid VIN."""
|
||||
@@ -161,6 +161,27 @@ class TestVinValidator:
|
||||
assert len(candidates) >= 1
|
||||
assert candidates[0][0] == "1HGBH41JXMN109186"
|
||||
|
||||
def test_extract_candidates_fragmented_vin(self) -> None:
|
||||
"""Test candidate extraction handles space-fragmented VINs from OCR."""
|
||||
validator = VinValidator()
|
||||
|
||||
# Tesseract often fragments VINs into multiple words
|
||||
text = "1HGBH 41JXMN 109186"
|
||||
candidates = validator.extract_candidates(text)
|
||||
|
||||
assert len(candidates) >= 1
|
||||
assert candidates[0][0] == "1HGBH41JXMN109186"
|
||||
|
||||
def test_extract_candidates_dash_fragmented_vin(self) -> None:
|
||||
"""Test candidate extraction handles dash-separated VINs."""
|
||||
validator = VinValidator()
|
||||
|
||||
text = "1HGBH41J-XMN109186"
|
||||
candidates = validator.extract_candidates(text)
|
||||
|
||||
assert len(candidates) >= 1
|
||||
assert candidates[0][0] == "1HGBH41JXMN109186"
|
||||
|
||||
def test_extract_candidates_no_vin(self) -> None:
|
||||
"""Test candidate extraction with no VIN."""
|
||||
validator = VinValidator()
|
||||
|
||||
Reference in New Issue
Block a user