feat: Receipt OCR Pipeline #69

Closed
opened 2026-02-01 18:47:42 +00:00 by egullickson · 0 comments
Owner

Overview

Implement receipt-specific OCR extraction in the OCR service, extracting fuel purchase data: date, total amount, gallons, price per unit, and station name.

Parent Issue: #12 (OCR-powered smart capture)
Priority: P2 - Fuel Receipt OCR
Dependencies: OCR Service Container Setup, Core OCR API Integration

Scope

Receipt Extraction Endpoint

POST /extract/receipt
Content-Type: multipart/form-data

Request:
  - file: image (HEIC, JPEG, PNG)
  - receipt_type: "fuel" (optional, for specialized extraction)

Response:
{
  "success": true,
  "receiptType": "fuel",
  "extractedFields": {
    "merchantName": { "value": "Shell", "confidence": 0.92 },
    "transactionDate": { "value": "2024-01-15", "confidence": 0.88 },
    "totalAmount": { "value": 45.67, "confidence": 0.95 },
    "fuelQuantity": { "value": 12.5, "confidence": 0.90 },
    "pricePerUnit": { "value": 3.65, "confidence": 0.87 },
    "fuelGrade": { "value": "87", "confidence": 0.75 }
  },
  "rawText": "SHELL\\n123 MAIN ST\\n...",
  "processingTimeMs": 1450
}

Image Preprocessing Pipeline

Input Image
    ↓
HEIC Conversion (pillow-heif) if needed
    ↓
Grayscale conversion
    ↓
Perspective correction (if needed)
    ↓
High contrast enhancement
    ↓
Adaptive thresholding (receipt-optimized)
    ↓
OCR with Tesseract
    ↓
Field extraction with regex + NLP

Field Extraction Patterns

Date Patterns:

DATE_PATTERNS = [
    r'\d{1,2}/\d{1,2}/\d{2,4}',      # 01/15/2024
    r'\d{1,2}-\d{1,2}-\d{2,4}',      # 01-15-2024
    r'[A-Za-z]{3}\s+\d{1,2},?\s+\d{4}',  # Jan 15, 2024
]

Amount Patterns:

TOTAL_PATTERNS = [
    r'TOTAL[:\s]*\$?([\d,]+\.\d{2})',
    r'AMOUNT[:\s]*\$?([\d,]+\.\d{2})',
    r'SALE[:\s]*\$?([\d,]+\.\d{2})',
]

GALLONS_PATTERNS = [
    r'([\d.]+)\s*(?:GAL|GALLONS?)',
    r'GALLONS?[:\s]*([\d.]+)',
]

PRICE_PATTERNS = [
    r'\$?([\d.]+)/GAL',
    r'PRICE[:\s]*\$?([\d.]+)',
]

Fuel Grade Patterns:

GRADE_PATTERNS = [
    r'(?:REGULAR|REG)\s*(\d{2})',     # REGULAR 87
    r'(?:PLUS|MID)\s*(\d{2})',        # PLUS 89
    r'(?:PREMIUM|PREM|SUPER)\s*(\d{2})',  # PREMIUM 93
    r'UNLEADED\s*(\d{2})',
    r'DIESEL',
]

Merchant Detection

  • Common gas station names: Shell, Chevron, Exxon, Mobil, BP, etc.
  • Address parsing for station location
  • Logo detection (optional future enhancement)

Directory Structure

ocr/app/
├── extractors/
│   ├── receipt_extractor.py   # Receipt-specific logic
│   └── fuel_receipt.py        # Fuel receipt specialization
├── preprocessors/
│   └── receipt_preprocessor.py  # Receipt-optimized preprocessing
└── patterns/
    ├── __init__.py
    ├── date_patterns.py
    ├── currency_patterns.py
    └── fuel_patterns.py

Test Cases

Input Expected Output
Clear gas station receipt All fields extracted, confidence > 85%
Faded thermal receipt Most fields extracted, lower confidence
Crumpled receipt Best effort, some fields may fail
Non-fuel receipt receiptType: "unknown", generic extraction
Foreign receipt Date/amount extraction, localized formats

Acceptance Criteria

  • Endpoint accepts HEIC, JPEG, PNG images
  • HEIC conversion works correctly
  • Preprocessing optimized for thermal receipts
  • Date extraction with multiple format support
  • Total amount extraction accurate
  • Gallons/liters quantity extraction
  • Price per unit extraction
  • Fuel grade detection
  • Merchant name extraction
  • Confidence scoring for each field
  • Processing time < 3 seconds
  • Unit tests for pattern matching
  • Integration tests with sample receipts

Technical Notes

  • Thermal receipts often have low contrast - preprocessing critical
  • Amount values may need currency symbol handling
  • Consider imperial/metric unit detection
  • Some receipts have multiple transactions - extract latest/largest
  • Log extraction attempts for pattern improvement

Out of Scope

  • Camera capture UI (see #12c)
  • FuelLogForm integration (see #12g)
  • Maintenance receipt extraction (Phase 2)
## Overview Implement receipt-specific OCR extraction in the OCR service, extracting fuel purchase data: date, total amount, gallons, price per unit, and station name. **Parent Issue**: #12 (OCR-powered smart capture) **Priority**: P2 - Fuel Receipt OCR **Dependencies**: OCR Service Container Setup, Core OCR API Integration ## Scope ### Receipt Extraction Endpoint ``` POST /extract/receipt Content-Type: multipart/form-data Request: - file: image (HEIC, JPEG, PNG) - receipt_type: "fuel" (optional, for specialized extraction) Response: { "success": true, "receiptType": "fuel", "extractedFields": { "merchantName": { "value": "Shell", "confidence": 0.92 }, "transactionDate": { "value": "2024-01-15", "confidence": 0.88 }, "totalAmount": { "value": 45.67, "confidence": 0.95 }, "fuelQuantity": { "value": 12.5, "confidence": 0.90 }, "pricePerUnit": { "value": 3.65, "confidence": 0.87 }, "fuelGrade": { "value": "87", "confidence": 0.75 } }, "rawText": "SHELL\\n123 MAIN ST\\n...", "processingTimeMs": 1450 } ``` ### Image Preprocessing Pipeline ``` Input Image ↓ HEIC Conversion (pillow-heif) if needed ↓ Grayscale conversion ↓ Perspective correction (if needed) ↓ High contrast enhancement ↓ Adaptive thresholding (receipt-optimized) ↓ OCR with Tesseract ↓ Field extraction with regex + NLP ``` ### Field Extraction Patterns **Date Patterns:** ```python DATE_PATTERNS = [ r'\d{1,2}/\d{1,2}/\d{2,4}', # 01/15/2024 r'\d{1,2}-\d{1,2}-\d{2,4}', # 01-15-2024 r'[A-Za-z]{3}\s+\d{1,2},?\s+\d{4}', # Jan 15, 2024 ] ``` **Amount Patterns:** ```python TOTAL_PATTERNS = [ r'TOTAL[:\s]*\$?([\d,]+\.\d{2})', r'AMOUNT[:\s]*\$?([\d,]+\.\d{2})', r'SALE[:\s]*\$?([\d,]+\.\d{2})', ] GALLONS_PATTERNS = [ r'([\d.]+)\s*(?:GAL|GALLONS?)', r'GALLONS?[:\s]*([\d.]+)', ] PRICE_PATTERNS = [ r'\$?([\d.]+)/GAL', r'PRICE[:\s]*\$?([\d.]+)', ] ``` **Fuel Grade Patterns:** ```python GRADE_PATTERNS = [ r'(?:REGULAR|REG)\s*(\d{2})', # REGULAR 87 r'(?:PLUS|MID)\s*(\d{2})', # PLUS 89 r'(?:PREMIUM|PREM|SUPER)\s*(\d{2})', # PREMIUM 93 r'UNLEADED\s*(\d{2})', r'DIESEL', ] ``` ### Merchant Detection - Common gas station names: Shell, Chevron, Exxon, Mobil, BP, etc. - Address parsing for station location - Logo detection (optional future enhancement) ## Directory Structure ``` ocr/app/ ├── extractors/ │ ├── receipt_extractor.py # Receipt-specific logic │ └── fuel_receipt.py # Fuel receipt specialization ├── preprocessors/ │ └── receipt_preprocessor.py # Receipt-optimized preprocessing └── patterns/ ├── __init__.py ├── date_patterns.py ├── currency_patterns.py └── fuel_patterns.py ``` ## Test Cases | Input | Expected Output | |-------|-----------------| | Clear gas station receipt | All fields extracted, confidence > 85% | | Faded thermal receipt | Most fields extracted, lower confidence | | Crumpled receipt | Best effort, some fields may fail | | Non-fuel receipt | receiptType: "unknown", generic extraction | | Foreign receipt | Date/amount extraction, localized formats | ## Acceptance Criteria - [ ] Endpoint accepts HEIC, JPEG, PNG images - [ ] HEIC conversion works correctly - [ ] Preprocessing optimized for thermal receipts - [ ] Date extraction with multiple format support - [ ] Total amount extraction accurate - [ ] Gallons/liters quantity extraction - [ ] Price per unit extraction - [ ] Fuel grade detection - [ ] Merchant name extraction - [ ] Confidence scoring for each field - [ ] Processing time < 3 seconds - [ ] Unit tests for pattern matching - [ ] Integration tests with sample receipts ## Technical Notes - Thermal receipts often have low contrast - preprocessing critical - Amount values may need currency symbol handling - Consider imperial/metric unit detection - Some receipts have multiple transactions - extract latest/largest - Log extraction attempts for pattern improvement ## Out of Scope - Camera capture UI (see #12c) - FuelLogForm integration (see #12g) - Maintenance receipt extraction (Phase 2)
egullickson added the
status
backlog
type
feature
labels 2026-02-01 18:48:38 +00:00
egullickson added
status
in-progress
and removed
status
backlog
labels 2026-02-02 02:35:47 +00:00
egullickson added
status
review
and removed
status
in-progress
labels 2026-02-02 02:43:53 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: egullickson/motovaultpro#69