feat: OCR Service Container Setup #64

Closed
opened 2026-02-01 18:46:02 +00:00 by egullickson · 2 comments
Owner

Overview

Add Python OCR container (mvp-ocr) to the docker-compose architecture. This is the foundation for all OCR functionality.

Parent Issue: #12 (OCR-powered smart capture)
Priority: P0 - Foundation
Dependencies: None

Scope

Container Setup

  • Add mvp-ocr service to docker-compose.yml
  • Python 3.11-slim base image
  • FastAPI framework with uvicorn
  • Health check endpoint at /health

Core Dependencies (requirements.txt)

# API Framework
fastapi>=0.100.0
uvicorn[standard]>=0.23.0
python-multipart>=0.0.6

# File Detection & Handling
python-magic>=0.4.27
pillow>=10.0.0
pillow-heif>=0.13.0

# Image Preprocessing
opencv-python-headless>=4.8.0
numpy>=1.24.0

# OCR Engines
pytesseract>=0.3.10

# Note: PaddleOCR deferred to later issue if needed for fallback

System Package Requirements

apt-get install tesseract-ocr tesseract-ocr-eng libtesseract-dev
apt-get install libheif-examples libheif-dev
apt-get install libgl1-mesa-glx libglib2.0-0
apt-get install libmagic1

Directory Structure

ocr/
├── Dockerfile
├── requirements.txt
├── app/
│   ├── __init__.py
│   ├── main.py          # FastAPI app with health endpoint
│   └── config.py        # Environment configuration
└── tests/
    └── test_health.py

Docker Compose Integration

mvp-ocr:
  build:
    context: ./ocr
    dockerfile: Dockerfile
  container_name: mvp-ocr
  restart: unless-stopped
  environment:
    - LOG_LEVEL=info
  networks:
    - mvp-network
  healthcheck:
    test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
    interval: 30s
    timeout: 10s
    retries: 3

Acceptance Criteria

  • mvp-ocr container builds successfully
  • Container starts and passes health check
  • FastAPI app responds at /health with {"status": "healthy"}
  • pillow-heif can load HEIC images (unit test)
  • Tesseract can process a test image (unit test)
  • Container integrates with existing mvp-network
  • make setup and make rebuild work with 6 containers

Technical Notes

  • Reference: docs/ocr-pipeline-tech-stack.md for full architecture
  • HEIC conversion happens server-side via pillow-heif (confirmed decision)
  • Start minimal - PaddleOCR can be added later if Tesseract accuracy insufficient

Out of Scope

  • OCR endpoints (see #12b)
  • Celery/async job queue (see #12b)
  • VIN/receipt specific logic (see #12d, #12f)
## Overview Add Python OCR container (mvp-ocr) to the docker-compose architecture. This is the foundation for all OCR functionality. **Parent Issue**: #12 (OCR-powered smart capture) **Priority**: P0 - Foundation **Dependencies**: None ## Scope ### Container Setup - Add `mvp-ocr` service to `docker-compose.yml` - Python 3.11-slim base image - FastAPI framework with uvicorn - Health check endpoint at `/health` ### Core Dependencies (requirements.txt) ``` # API Framework fastapi>=0.100.0 uvicorn[standard]>=0.23.0 python-multipart>=0.0.6 # File Detection & Handling python-magic>=0.4.27 pillow>=10.0.0 pillow-heif>=0.13.0 # Image Preprocessing opencv-python-headless>=4.8.0 numpy>=1.24.0 # OCR Engines pytesseract>=0.3.10 # Note: PaddleOCR deferred to later issue if needed for fallback ``` ### System Package Requirements ```bash apt-get install tesseract-ocr tesseract-ocr-eng libtesseract-dev apt-get install libheif-examples libheif-dev apt-get install libgl1-mesa-glx libglib2.0-0 apt-get install libmagic1 ``` ### Directory Structure ``` ocr/ ├── Dockerfile ├── requirements.txt ├── app/ │ ├── __init__.py │ ├── main.py # FastAPI app with health endpoint │ └── config.py # Environment configuration └── tests/ └── test_health.py ``` ### Docker Compose Integration ```yaml mvp-ocr: build: context: ./ocr dockerfile: Dockerfile container_name: mvp-ocr restart: unless-stopped environment: - LOG_LEVEL=info networks: - mvp-network healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8000/health"] interval: 30s timeout: 10s retries: 3 ``` ## Acceptance Criteria - [ ] `mvp-ocr` container builds successfully - [ ] Container starts and passes health check - [ ] FastAPI app responds at `/health` with `{"status": "healthy"}` - [ ] pillow-heif can load HEIC images (unit test) - [ ] Tesseract can process a test image (unit test) - [ ] Container integrates with existing mvp-network - [ ] `make setup` and `make rebuild` work with 6 containers ## Technical Notes - Reference: `docs/ocr-pipeline-tech-stack.md` for full architecture - HEIC conversion happens server-side via pillow-heif (confirmed decision) - Start minimal - PaddleOCR can be added later if Tesseract accuracy insufficient ## Out of Scope - OCR endpoints (see #12b) - Celery/async job queue (see #12b) - VIN/receipt specific logic (see #12d, #12f)
egullickson added the
status
backlog
type
feature
labels 2026-02-01 18:48:34 +00:00
egullickson added
status
in-progress
and removed
status
backlog
labels 2026-02-01 19:00:36 +00:00
egullickson added
status
review
and removed
status
in-progress
labels 2026-02-01 19:06:32 +00:00
Author
Owner

Implementation Complete

PR #72 implements the OCR service container with all acceptance criteria met:

Verified

  • mvp-ocr container builds successfully
  • Container starts and passes health check (healthy status)
  • FastAPI app responds at /health with {"status": "healthy"}
  • pillow-heif can load HEIC images (unit test passed)
  • Tesseract can process test images (unit test passed)
  • Container integrates with existing backend network
  • docker-compose.yml now has 6 services configured

Files Created

  • ocr/Dockerfile - Python 3.11-slim with system dependencies
  • ocr/requirements.txt - FastAPI, pillow-heif, pytesseract, opencv
  • ocr/app/main.py - FastAPI app with health endpoint
  • ocr/app/config.py - Environment configuration
  • ocr/tests/test_health.py - Unit tests for health, HEIC, and Tesseract

Ready for review.

## Implementation Complete PR #72 implements the OCR service container with all acceptance criteria met: ### Verified - `mvp-ocr` container builds successfully - Container starts and passes health check (healthy status) - FastAPI app responds at `/health` with `{"status": "healthy"}` - pillow-heif can load HEIC images (unit test passed) - Tesseract can process test images (unit test passed) - Container integrates with existing backend network - docker-compose.yml now has 6 services configured ### Files Created - `ocr/Dockerfile` - Python 3.11-slim with system dependencies - `ocr/requirements.txt` - FastAPI, pillow-heif, pytesseract, opencv - `ocr/app/main.py` - FastAPI app with health endpoint - `ocr/app/config.py` - Environment configuration - `ocr/tests/test_health.py` - Unit tests for health, HEIC, and Tesseract Ready for review.
Author
Owner

CI/CD Fix Pushed

Fixed the deployment error unable to prepare context: path "/opt/motovaultpro/ocr" not found.

Root cause: The OCR service was configured with build: context: ./ocr which works for local development, but CI/CD deploys pre-built images from the registry. The ocr/ directory isn't synced to the deploy path.

Fix: Updated all CI/CD workflows to build, push, and deploy the OCR image from the registry:

File Change
staging.yaml Build and push ocr:$SHA image, pull OCR image, include in health checks
production.yaml Pull OCR image, start mvp-ocr as shared service
docker-compose.staging.yml Added mvp-ocr with image: ${OCR_IMAGE} override
docker-compose.blue-green.yml Added mvp-ocr with image: ${OCR_IMAGE} override

The OCR service is a shared service (like postgres/redis), not part of blue-green deployment.

## CI/CD Fix Pushed Fixed the deployment error `unable to prepare context: path "/opt/motovaultpro/ocr" not found`. **Root cause:** The OCR service was configured with `build: context: ./ocr` which works for local development, but CI/CD deploys pre-built images from the registry. The `ocr/` directory isn't synced to the deploy path. **Fix:** Updated all CI/CD workflows to build, push, and deploy the OCR image from the registry: | File | Change | |------|--------| | `staging.yaml` | Build and push `ocr:$SHA` image, pull OCR image, include in health checks | | `production.yaml` | Pull OCR image, start mvp-ocr as shared service | | `docker-compose.staging.yml` | Added `mvp-ocr` with `image: ${OCR_IMAGE}` override | | `docker-compose.blue-green.yml` | Added `mvp-ocr` with `image: ${OCR_IMAGE}` override | The OCR service is a **shared service** (like postgres/redis), not part of blue-green deployment.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: egullickson/motovaultpro#64