Files
motovaultpro/ocr/app/main.py
Eric Gullickson 3eb54211cb
All checks were successful
Deploy to Staging / Build Images (pull_request) Successful in 3m1s
Deploy to Staging / Deploy to Staging (pull_request) Successful in 31s
Deploy to Staging / Verify Staging (pull_request) Successful in 2m19s
Deploy to Staging / Notify Staging Ready (pull_request) Successful in 7s
Deploy to Staging / Notify Staging Failure (pull_request) Has been skipped
feat: add owner's manual OCR pipeline (refs #71)
Implement async PDF processing for owner's manuals with maintenance
schedule extraction:

- Add PDF preprocessor with PyMuPDF for text/scanned PDF handling
- Add maintenance pattern matching (mileage, time, fluid specs)
- Add service name mapping to maintenance subtypes
- Add table detection and parsing for schedule tables
- Add manual extractor orchestrating the complete pipeline
- Add POST /extract/manual endpoint for async job submission
- Add Redis job queue support for manual extraction jobs
- Add progress tracking during processing

Processing pipeline:
1. Analyze PDF structure (text layer vs scanned)
2. Find maintenance schedule sections
3. Extract text or OCR scanned pages at 300 DPI
4. Detect and parse maintenance tables
5. Normalize service names and extract intervals
6. Return structured maintenance schedules with confidence scores

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-01 21:30:20 -06:00

65 lines
1.8 KiB
Python

"""OCR Service FastAPI Application."""
import logging
from contextlib import asynccontextmanager
from typing import AsyncIterator
from fastapi import FastAPI
from app.config import settings
from app.routers import extract_router, jobs_router
from app.services import job_queue
# Configure logging
logging.basicConfig(
level=getattr(logging, settings.log_level.upper(), logging.INFO),
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
)
logger = logging.getLogger(__name__)
@asynccontextmanager
async def lifespan(app: FastAPI) -> AsyncIterator[None]:
"""Application lifespan handler for startup/shutdown."""
# Startup
logger.info("OCR service starting up")
yield
# Shutdown
logger.info("OCR service shutting down")
await job_queue.close()
app = FastAPI(
title="MotoVaultPro OCR Service",
description="OCR processing service for vehicle documents",
version="1.0.0",
lifespan=lifespan,
)
# Include routers
app.include_router(extract_router)
app.include_router(jobs_router)
@app.get("/health")
async def health_check() -> dict:
"""Health check endpoint for container orchestration."""
return {"status": "healthy"}
@app.get("/")
async def root() -> dict:
"""Root endpoint with service information."""
return {
"service": "mvp-ocr",
"version": "1.0.0",
"log_level": settings.log_level,
"endpoints": [
"POST /extract - Synchronous OCR extraction",
"POST /extract/vin - VIN-specific extraction with validation",
"POST /extract/receipt - Receipt extraction (fuel, general)",
"POST /extract/manual - Owner's manual extraction (async)",
"POST /jobs - Submit async OCR job",
"GET /jobs/{job_id} - Get async job status",
],
}