chore: add PDF support to receipt OCR pipeline (refs #182)

The receipt extractor only accepted image MIME types, rejecting PDFs at the OCR layer. Added application/pdf to supported types and PDF-to-image conversion (first page at 300 DPI) before OCR preprocessing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 21:22:40 -06:00
parent 83bacf0e2f
commit 653c535165
3 changed files with 33 additions and 4 deletions
--- a/ocr/app/extractors/maintenance_receipt_extractor.py
+++ b/ocr/app/extractors/maintenance_receipt_extractor.py
@@ -98,7 +98,7 @@ class MaintenanceReceiptExtractor:
        """Extract maintenance receipt fields from an image.

        Args:
-            image_bytes: Raw image bytes (HEIC, JPEG, PNG).
+            image_bytes: Raw image or PDF bytes (HEIC, JPEG, PNG, PDF).
            content_type: MIME type (auto-detected if not provided).

        Returns: