[Chore]: Upgrade container image versions and migrate Promtail to Grafana Alloy #95

Closed
opened 2026-02-05 23:30:30 +00:00 by egullickson · 2 comments
Owner

Problem / User Need

After implementing the unified logging system (#80-#87), the Promtail container fails with a Docker API version incompatibility error:

mvp-promtail | level=error ts=2026-02-05T03:27:10.039025327Z caller=refresh.go:99
component=docker_discovery discovery=docker config=containers/unix:///var/run/docker.sock:80
msg="Unable to refresh target groups"
err="error while listing containers: Error response from daemon: client version 1.42 is too old.
Minimum supported API version is 1.44, please upgrade your client to a newer version"

Root cause: Docker Engine v29 raised the minimum supported API version to 1.44. Promtail 2.9.0 (deployed in #86) embeds Docker client API v1.42, which is below this minimum. Additionally, several container images deployed in the logging stack are significantly outdated.

Beyond the logging stack, the Python OCR container uses Python 3.11, which should be upgraded to at least 3.12+.

Current State vs Target State

Component Current Image Current Version Target Image Target Version Reason
Promtail grafana/promtail:2.9.0 2.9.0 grafana/alloy:v1.12.2 1.12.2 Promtail deprecated (EOL March 2, 2026). Alloy is official replacement. Fixes Docker API v1.44 error.
Loki grafana/loki:2.9.0 2.9.0 grafana/loki:3.6.1 3.6.1 Major version upgrade. Current version uses deprecated BoltDB shipper + schema v11.
Grafana grafana/grafana:10.0.0 10.0.0 grafana/grafana:12.4.0 12.4.0 Two major versions behind. Note: grafana/grafana-oss repo deprecated as of 12.4.0 - use grafana/grafana.
Python (OCR) python:3.11-slim 3.11.x python:3.13-slim 3.13.x Performance improvements, better error messages, security fixes.

Proposed Solution

1. Replace Promtail with Grafana Alloy

Promtail is officially deprecated by Grafana Labs. LTS ends February 28, 2026. EOL is March 2, 2026. The official successor is Grafana Alloy (OpenTelemetry Collector distribution with programmable pipelines).

Migration approach:

  • Replace mvp-promtail container with mvp-alloy using grafana/alloy:v1.12.2
  • Convert config/promtail/config.yml to Alloy config format using alloy convert --source-format=promtail
  • Alloy uses discovery.docker + loki.source.docker for container log collection (replaces docker_sd_configs)
  • This resolves the Docker API v1.44 compatibility error

Current Promtail config (to be converted):

scrape_configs:
  - job_name: containers
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 5s

Alloy equivalent (using native Docker components):

discovery.docker "containers" {
  host = "unix:///var/run/docker.sock"
  refresh_interval = "5s"
}

discovery.relabel "containers" {
  targets = discovery.docker.containers.targets
  rule {
    source_labels = ["__meta_docker_container_name"]
    regex         = "/(.*)"
    target_label  = "container"
  }
  rule {
    source_labels = ["__meta_docker_container_label_com_docker_compose_service"]
    target_label  = "service"
  }
}

loki.source.docker "containers" {
  host       = "unix:///var/run/docker.sock"
  targets    = discovery.relabel.containers.output
  forward_to = [loki.write.default.receiver]
}

loki.write "default" {
  endpoint {
    url = "http://mvp-loki:3100/loki/api/v1/push"
  }
}

2. Upgrade Loki 2.9.0 to 3.6.1

Breaking changes to address:

Change Impact Action Required
Schema v11 + BoltDB shipper deprecated Storage incompatible with Structured Metadata Migrate to tsdb index + v13 schema, OR set allow_structured_metadata: false initially
service_name label auto-assigned New label on all ingested logs Review label limits (default reduced from 30 to 15)
Max line size default 256KB Logs over 256KB truncated Acceptable for this project
Metrics prefix change cortex_* renamed to loki_* No dashboards reference these yet
Bloom block format change (3.3+) Old bloom blocks incompatible No bloom blocks exist (fresh install)

Recommended Loki config migration (current uses boltdb-shipper + schema: v11):

# Current (2.9.0)
schema_config:
  configs:
    - from: 2020-01-01
      store: boltdb-shipper
      object_store: filesystem
      schema: v11

# Target (3.6.1) - Option A: Disable structured metadata initially
schema_config:
  configs:
    - from: 2020-01-01
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

# OR set allow_structured_metadata: false to keep v11 temporarily

Since this is a fresh logging stack (deployed Feb 2026), there is no existing data to migrate. A clean cutover to tsdb + v13 schema is the simplest approach.

3. Upgrade Grafana 10.0.0 to 12.4.0

  • Two major versions behind (10 -> 11 -> 12)
  • The Loki datasource provisioning config (config/grafana/datasources/loki.yml) should remain compatible
  • Note: Starting with 12.4.0, grafana/grafana-oss Docker Hub repo is deprecated. Use grafana/grafana instead.
  • Default admin password configuration remains the same via GF_SECURITY_ADMIN_PASSWORD

4. Upgrade Python 3.11 to 3.13

  • OCR container base image python:3.11-slim to python:3.13-slim
  • Python 3.13 latest stable: 3.13.11
  • Key improvements: performance, better error messages, security fixes
  • OCR dependencies (Tesseract bindings, Pillow) have Python 3.13 support
  • Verify all pip dependencies in OCR Dockerfile install cleanly on 3.13

Non-goals / Out of Scope

  • Migrating Grafana Alloy to collect metrics (only log collection for now)
  • OpenTelemetry tracing integration via Alloy
  • Upgrading Node.js or npm versions (separate issue if needed)
  • Upgrading Traefik version
  • Upgrading PostgreSQL or Redis versions

Acceptance Criteria

Promtail to Alloy Migration

  • mvp-promtail container replaced with mvp-alloy using grafana/alloy:v1.12.2
  • Alloy config created at config/alloy/config.alloy (replaces config/promtail/config.yml)
  • Docker service discovery (discovery.docker) works without API version errors
  • Container logs from all 6 application containers appear in Loki
  • Old Promtail config files cleaned up

Loki Upgrade

  • Loki upgraded to grafana/loki:3.6.1
  • Loki config updated for v3.6 compatibility (tsdb + v13 schema)
  • Loki starts without errors and accepts log pushes
  • 30-day retention policy preserved

Grafana Upgrade

  • Grafana upgraded to grafana/grafana:12.4.0
  • Loki datasource auto-provisioned and functional
  • Log queries return results in Explore view
  • IP whitelist middleware still restricts access to RFC1918 ranges

Python Upgrade

  • OCR container base image updated to python:3.13-slim
  • All OCR dependencies install successfully
  • OCR functionality works end-to-end (VIN image processing)

General

  • All 9 containers start without errors
  • docker compose up -d succeeds cleanly
  • CI/CD pipelines updated if image references exist there
  • Documentation updated (CLAUDE.md, README.md, docs/LOGGING.md) with new version numbers
  • Container count remains 9 (6 application + 3 logging)

Files to Modify

  • docker-compose.yml - Update image versions, rename mvp-promtail to mvp-alloy
  • config/promtail/config.yml - Remove (replaced by Alloy config)
  • Create: config/alloy/config.alloy - New Alloy configuration
  • config/loki/config.yml - Update for Loki 3.6 compatibility
  • backend/ocr/Dockerfile (or equivalent) - Update Python base image
  • docs/LOGGING.md - Update version references and Alloy details
  • README.md - Update if container names referenced
  • CLAUDE.md - Update if version/container details referenced
  • .gitea/workflows/staging.yaml - Update if image versions referenced
  • .gitea/workflows/production.yaml - Update if image versions referenced

Research Sources

Test Plan

Smoke tests:

  • All 9 containers start without errors in docker compose logs
  • No Docker API version errors from mvp-alloy
  • Loki accepts writes from Alloy at /loki/api/v1/push

Functional tests:

  • Generate application traffic and verify logs appear in Grafana
  • Query logs by container name, service label, and log level
  • Correlation IDs (requestId) still propagate correctly
  • OCR processes a VIN image successfully with Python 3.13

Regression tests:

  • Grafana IP whitelist still blocks public access
  • Log rotation still functions (10m x 3 per container)
  • CI/CD pipelines deploy without errors
## Problem / User Need After implementing the unified logging system (#80-#87), the Promtail container fails with a Docker API version incompatibility error: ``` mvp-promtail | level=error ts=2026-02-05T03:27:10.039025327Z caller=refresh.go:99 component=docker_discovery discovery=docker config=containers/unix:///var/run/docker.sock:80 msg="Unable to refresh target groups" err="error while listing containers: Error response from daemon: client version 1.42 is too old. Minimum supported API version is 1.44, please upgrade your client to a newer version" ``` **Root cause**: Docker Engine v29 raised the minimum supported API version to 1.44. Promtail 2.9.0 (deployed in #86) embeds Docker client API v1.42, which is below this minimum. Additionally, several container images deployed in the logging stack are significantly outdated. Beyond the logging stack, the Python OCR container uses Python 3.11, which should be upgraded to at least 3.12+. ## Current State vs Target State | Component | Current Image | Current Version | Target Image | Target Version | Reason | |-----------|--------------|-----------------|--------------|----------------|--------| | Promtail | `grafana/promtail:2.9.0` | 2.9.0 | `grafana/alloy:v1.12.2` | 1.12.2 | **Promtail deprecated** (EOL March 2, 2026). Alloy is official replacement. Fixes Docker API v1.44 error. | | Loki | `grafana/loki:2.9.0` | 2.9.0 | `grafana/loki:3.6.1` | 3.6.1 | Major version upgrade. Current version uses deprecated BoltDB shipper + schema v11. | | Grafana | `grafana/grafana:10.0.0` | 10.0.0 | `grafana/grafana:12.4.0` | 12.4.0 | Two major versions behind. Note: `grafana/grafana-oss` repo deprecated as of 12.4.0 - use `grafana/grafana`. | | Python (OCR) | `python:3.11-slim` | 3.11.x | `python:3.13-slim` | 3.13.x | Performance improvements, better error messages, security fixes. | ## Proposed Solution ### 1. Replace Promtail with Grafana Alloy Promtail is officially deprecated by Grafana Labs. LTS ends February 28, 2026. EOL is March 2, 2026. The official successor is **Grafana Alloy** (OpenTelemetry Collector distribution with programmable pipelines). **Migration approach**: - Replace `mvp-promtail` container with `mvp-alloy` using `grafana/alloy:v1.12.2` - Convert `config/promtail/config.yml` to Alloy config format using `alloy convert --source-format=promtail` - Alloy uses `discovery.docker` + `loki.source.docker` for container log collection (replaces `docker_sd_configs`) - This resolves the Docker API v1.44 compatibility error **Current Promtail config** (to be converted): ```yaml scrape_configs: - job_name: containers docker_sd_configs: - host: unix:///var/run/docker.sock refresh_interval: 5s ``` **Alloy equivalent** (using native Docker components): ```alloy discovery.docker "containers" { host = "unix:///var/run/docker.sock" refresh_interval = "5s" } discovery.relabel "containers" { targets = discovery.docker.containers.targets rule { source_labels = ["__meta_docker_container_name"] regex = "/(.*)" target_label = "container" } rule { source_labels = ["__meta_docker_container_label_com_docker_compose_service"] target_label = "service" } } loki.source.docker "containers" { host = "unix:///var/run/docker.sock" targets = discovery.relabel.containers.output forward_to = [loki.write.default.receiver] } loki.write "default" { endpoint { url = "http://mvp-loki:3100/loki/api/v1/push" } } ``` ### 2. Upgrade Loki 2.9.0 to 3.6.1 **Breaking changes to address**: | Change | Impact | Action Required | |--------|--------|-----------------| | Schema v11 + BoltDB shipper deprecated | Storage incompatible with Structured Metadata | Migrate to `tsdb` index + `v13` schema, OR set `allow_structured_metadata: false` initially | | `service_name` label auto-assigned | New label on all ingested logs | Review label limits (default reduced from 30 to 15) | | Max line size default 256KB | Logs over 256KB truncated | Acceptable for this project | | Metrics prefix change | `cortex_*` renamed to `loki_*` | No dashboards reference these yet | | Bloom block format change (3.3+) | Old bloom blocks incompatible | No bloom blocks exist (fresh install) | **Recommended Loki config migration** (current uses `boltdb-shipper` + `schema: v11`): ```yaml # Current (2.9.0) schema_config: configs: - from: 2020-01-01 store: boltdb-shipper object_store: filesystem schema: v11 # Target (3.6.1) - Option A: Disable structured metadata initially schema_config: configs: - from: 2020-01-01 store: tsdb object_store: filesystem schema: v13 index: prefix: index_ period: 24h # OR set allow_structured_metadata: false to keep v11 temporarily ``` Since this is a fresh logging stack (deployed Feb 2026), there is no existing data to migrate. A clean cutover to tsdb + v13 schema is the simplest approach. ### 3. Upgrade Grafana 10.0.0 to 12.4.0 - Two major versions behind (10 -> 11 -> 12) - The Loki datasource provisioning config (`config/grafana/datasources/loki.yml`) should remain compatible - Note: Starting with 12.4.0, `grafana/grafana-oss` Docker Hub repo is deprecated. Use `grafana/grafana` instead. - Default admin password configuration remains the same via `GF_SECURITY_ADMIN_PASSWORD` ### 4. Upgrade Python 3.11 to 3.13 - OCR container base image `python:3.11-slim` to `python:3.13-slim` - Python 3.13 latest stable: 3.13.11 - Key improvements: performance, better error messages, security fixes - OCR dependencies (Tesseract bindings, Pillow) have Python 3.13 support - Verify all pip dependencies in OCR Dockerfile install cleanly on 3.13 ## Non-goals / Out of Scope - Migrating Grafana Alloy to collect metrics (only log collection for now) - OpenTelemetry tracing integration via Alloy - Upgrading Node.js or npm versions (separate issue if needed) - Upgrading Traefik version - Upgrading PostgreSQL or Redis versions ## Acceptance Criteria ### Promtail to Alloy Migration - [ ] `mvp-promtail` container replaced with `mvp-alloy` using `grafana/alloy:v1.12.2` - [ ] Alloy config created at `config/alloy/config.alloy` (replaces `config/promtail/config.yml`) - [ ] Docker service discovery (`discovery.docker`) works without API version errors - [ ] Container logs from all 6 application containers appear in Loki - [ ] Old Promtail config files cleaned up ### Loki Upgrade - [ ] Loki upgraded to `grafana/loki:3.6.1` - [ ] Loki config updated for v3.6 compatibility (tsdb + v13 schema) - [ ] Loki starts without errors and accepts log pushes - [ ] 30-day retention policy preserved ### Grafana Upgrade - [ ] Grafana upgraded to `grafana/grafana:12.4.0` - [ ] Loki datasource auto-provisioned and functional - [ ] Log queries return results in Explore view - [ ] IP whitelist middleware still restricts access to RFC1918 ranges ### Python Upgrade - [ ] OCR container base image updated to `python:3.13-slim` - [ ] All OCR dependencies install successfully - [ ] OCR functionality works end-to-end (VIN image processing) ### General - [ ] All 9 containers start without errors - [ ] `docker compose up -d` succeeds cleanly - [ ] CI/CD pipelines updated if image references exist there - [ ] Documentation updated (CLAUDE.md, README.md, docs/LOGGING.md) with new version numbers - [ ] Container count remains 9 (6 application + 3 logging) ## Files to Modify - `docker-compose.yml` - Update image versions, rename `mvp-promtail` to `mvp-alloy` - `config/promtail/config.yml` - Remove (replaced by Alloy config) - Create: `config/alloy/config.alloy` - New Alloy configuration - `config/loki/config.yml` - Update for Loki 3.6 compatibility - `backend/ocr/Dockerfile` (or equivalent) - Update Python base image - `docs/LOGGING.md` - Update version references and Alloy details - `README.md` - Update if container names referenced - `CLAUDE.md` - Update if version/container details referenced - `.gitea/workflows/staging.yaml` - Update if image versions referenced - `.gitea/workflows/production.yaml` - Update if image versions referenced ## Research Sources - [Grafana Loki Release Notes v3.6](https://grafana.com/docs/loki/latest/release-notes/v3-6/) - [Loki v3.0 Breaking Changes](https://grafana.com/docs/loki/latest/release-notes/v3-0/) - [Upgrade Loki from 2.x](https://grafana.com/docs/loki/latest/setup/upgrade/) - [Promtail Deprecation and Alloy Migration](https://grafana.com/docs/loki/latest/send-data/promtail/) - [Migrate from Promtail to Grafana Alloy](https://grafana.com/docs/alloy/latest/set-up/migrate/from-promtail/) - [Alloy loki.source.docker](https://grafana.com/docs/alloy/latest/reference/components/loki/loki.source.docker/) - [Docker Engine v29 API Version Breaking Change](https://www.docker.com/blog/docker-engine-version-29/) - [Docker API Minimum Version Fix](https://support.gitlab.com/hc/en-us/articles/23529961961372) - [Grafana Alloy Releases](https://github.com/grafana/alloy/releases) - [Python Docker Images](https://hub.docker.com/_/python) - [Best Docker Base Image for Python (Feb 2026)](https://pythonspeed.com/articles/base-image-python-docker-images/) - [Run Grafana Docker Image](https://grafana.com/docs/grafana/latest/setup-grafana/installation/docker/) ## Test Plan **Smoke tests:** - All 9 containers start without errors in `docker compose logs` - No Docker API version errors from `mvp-alloy` - Loki accepts writes from Alloy at `/loki/api/v1/push` **Functional tests:** - Generate application traffic and verify logs appear in Grafana - Query logs by container name, service label, and log level - Correlation IDs (requestId) still propagate correctly - OCR processes a VIN image successfully with Python 3.13 **Regression tests:** - Grafana IP whitelist still blocks public access - Log rotation still functions (10m x 3 per container) - CI/CD pipelines deploy without errors
egullickson added the
status
backlog
type
chore
labels 2026-02-05 23:30:35 +00:00
egullickson added
status
in-progress
and removed
status
backlog
labels 2026-02-05 23:33:23 +00:00
Author
Owner

Plan: Container Image Upgrades and Promtail-to-Alloy Migration

Phase: Planning | Agent: Orchestrator | Status: AWAITING_REVIEW


Sub-Issues Created

Sub-Issue Title Type Dependencies
#96 Update base image mirror script chore None (first)
#97 Replace Promtail with Grafana Alloy chore Blocked by #96
#98 Upgrade Loki 2.9.0 to 3.6.1 chore Blocked by #96
#99 Upgrade Grafana 10.0.0 to 12.4.0 chore Blocked by #96, best after #98
#100 Upgrade OCR Python 3.11 to 3.13 chore Blocked by #96
#101 Update documentation docs Blocked by #97-#100

Decision Critic Results

Four key decisions were evaluated through a 7-step structured critique:

Decision 1: Loki Schema Migration - VERIFIED with nuance

  • Proposed: Clean cutover to tsdb + v13 schema
  • Finding: boltdb-shipper+v11 is deprecated in Loki 3.6.1 but still works. Clean cutover is correct for a fresh install (C1 VERIFIED - no historical data). Fallback: if tsdb causes issues, temporarily set allow_structured_metadata: false with v11.

Decision 2: Container Naming - STAND

  • Proposed: Rename mvp-promtail to mvp-alloy
  • Finding: All references identified (A5 VERIFIED). Only 4 files need name changes. Clarity benefit outweighs change cost (J2).

Decision 3: Execution Ordering - REVISED

  • Original: Mirror -> Alloy+Loki -> Grafana -> Python -> Docs
  • Revised: Mirror -> Alloy (URGENT, lands first) -> Loki -> Grafana -> Python -> Docs
  • Rationale: Alloy migration is the critical fix for the Docker API error. Must be isolatable if later milestones fail. Python upgrade (C7 UNCERTAIN - 3.13 wheel availability) should NOT block the Alloy fix.

Decision 4: Loki Volume Handling - STAND

  • Proposed: Clear mvp_loki_data volume
  • Finding: Fresh install with days of data at most (C1 VERIFIED). No compliance requirements. Add a note in the migration step to verify the volume is clearable.

Codebase Analysis Summary

10 files require changes across 4 categories:

Category Files Changes
Container orchestration docker-compose.yml 3 image versions + rename service
Build infrastructure scripts/ci/mirror-base-images.sh 4 image version lines
CI/CD pipeline .gitea/workflows/production.yaml 1 container name reference (line 174)
Service configuration config/promtail/config.yml (DELETE), config/alloy/config.alloy (CREATE), config/loki/config.yml Schema migration + new Alloy config
Container build ocr/Dockerfile 1 base image version (line 7)
Documentation docs/LOGGING.md, CLAUDE.md, .ai/context.json Promtail->Alloy references, version updates

Hardcoded references found: mvp-loki:3100 in 3 locations (Alloy config, Grafana datasource, Loki healthcheck) - no change needed, container name preserved.


Milestone Plan

Milestone 1: Update Base Image Mirror Script (#96)

Agent: Platform Agent | Files: 1 | Risk: Low

  1. Edit scripts/ci/mirror-base-images.sh:
    • Line 17: python:3.11-slim -> python:3.13-slim
    • Line 20: grafana/loki:2.9.0 -> grafana/loki:3.6.1
    • Line 21: grafana/promtail:2.9.0 -> grafana/alloy:v1.12.2
    • Line 22: grafana/grafana:10.0.0 -> grafana/grafana:12.4.0

Verification: Mirror script syntax is valid bash.


Milestone 2: Replace Promtail with Grafana Alloy (#97)

Agent: Platform Agent | Files: 4 | Risk: Medium (critical fix)

  1. Edit docker-compose.yml (lines 289-307):

    • Rename service mvp-promtail -> mvp-alloy
    • Image: grafana/promtail:2.9.0 -> grafana/alloy:v1.12.2
    • Container name: mvp-promtail -> mvp-alloy
    • Config mount: ./config/promtail/config.yml:/etc/promtail/config.yml:ro -> ./config/alloy/config.alloy:/etc/alloy/config.alloy
    • Command: -config.file=/etc/promtail/config.yml -> run --server.http.listen-addr=0.0.0.0:12345 --storage.path=/var/lib/alloy/data /etc/alloy/config.alloy
    • Keep: Docker socket mount, container logs mount, depends_on mvp-loki, backend network
  2. Create config/alloy/config.alloy with Docker discovery, relabeling, and Loki push configuration.

  3. Delete config/promtail/ directory.

  4. Edit .gitea/workflows/production.yaml (line 174):

    • mvp-promtail -> mvp-alloy in shared services start command.

Verification: Alloy container starts, no Docker API version errors, logs appear in Loki.


Milestone 3: Upgrade Loki to 3.6.1 (#98)

Agent: Platform Agent | Files: 2 | Risk: Medium (schema migration)

  1. Edit docker-compose.yml (line 269):

    • grafana/loki:2.9.0 -> grafana/loki:3.6.1
  2. Rewrite config/loki/config.yml:

    • Schema: v11 -> v13
    • Store: boltdb-shipper -> tsdb
    • Storage paths: boltdb-shipper-active -> tsdb-index, boltdb-shipper-cache -> tsdb-cache
    • Keep: retention 720h, filesystem object store, port 3100, auth disabled
    • Fallback: If tsdb causes startup errors, revert to v11 schema with allow_structured_metadata: false
  3. Note: mvp_loki_data volume should be cleared on deployment since storage format changes. Add step to production deployment or document manual volume clear.

Verification: Loki starts, /ready healthcheck passes, accepts pushes from Alloy.


Milestone 4: Upgrade Grafana to 12.4.0 (#99)

Agent: Platform Agent | Files: 1 | Risk: Low

  1. Edit docker-compose.yml (line 311):

    • grafana/grafana:10.0.0 -> grafana/grafana:12.4.0
  2. Verify config/grafana/datasources/loki.yml works unchanged (A3 VERIFIED - apiVersion 1 stable since Grafana 5.x).

Verification: Grafana starts, healthcheck passes, Loki datasource provisioned, Explore view returns log queries.


Milestone 5: Upgrade Python OCR to 3.13 (#100)

Agent: Platform Agent | Files: 1-2 | Risk: Medium (C7 UNCERTAIN)

  1. Edit ocr/Dockerfile (line 7):

    • python:3.11-slim -> python:3.13-slim
  2. Build and verify all pip dependencies install (especially opencv-python-headless, pillow-heif, PyMuPDF).

    • If build fails: Pin to python:3.12-slim as safe fallback. Update mirror script accordingly.

Verification: Container builds, health endpoint responds, VIN image processing works.


Milestone 6: Documentation Updates (#101)

Agent: Technical Writer | Files: 3-4 | Risk: Low

  1. docs/LOGGING.md:

    • Replace "Promtail" with "Alloy" in architecture diagram
    • Update troubleshooting: docker logs mvp-promtail -> docker logs mvp-alloy
    • Add migration note about Alloy replacing Promtail
  2. CLAUDE.md (line 176):

    • "3 logging (Loki, Promtail, Grafana)" -> "3 logging (Loki, Alloy, Grafana)"
  3. .ai/context.json:

    • Fix container_count from 5 to 9
    • Update container references if present
  4. README.md: Update architecture description if Promtail is mentioned.

Verification: All documentation references are consistent with deployed state.


Risk Register

ID Risk Likelihood Impact Mitigation
R1 Python 3.13 dependency build failure Medium Low (isolated) Fallback to 3.12-slim; does not block Alloy fix
R2 Logging downtime during Loki+Alloy migration Low Medium Deploy Alloy first (Milestone 2), Loki upgrade separate (Milestone 3)
R3 Loki tsdb schema startup errors Low Medium Fallback to v11 schema with allow_structured_metadata: false
R4 Missed mvp-promtail reference in codebase Very Low Low Three parallel codebase analyses confirmed all references

Execution Notes

  • Branch: issue-95-upgrade-container-images
  • Commit convention: chore: {description} (refs #95)
  • Single PR linking to #95, with sub-issues referenced in commit messages
  • Milestone ordering is strict: 1 -> 2 -> 3 -> 4 -> 5 -> 6 (Alloy must land before Loki upgrade)
  • QR post-implementation review after each milestone, results posted as issue comments

Verdict: PLAN READY FOR REVIEW | Next: QR plan-completeness

## Plan: Container Image Upgrades and Promtail-to-Alloy Migration **Phase**: Planning | **Agent**: Orchestrator | **Status**: AWAITING_REVIEW --- ### Sub-Issues Created | Sub-Issue | Title | Type | Dependencies | |-----------|-------|------|--------------| | #96 | Update base image mirror script | chore | None (first) | | #97 | Replace Promtail with Grafana Alloy | chore | Blocked by #96 | | #98 | Upgrade Loki 2.9.0 to 3.6.1 | chore | Blocked by #96 | | #99 | Upgrade Grafana 10.0.0 to 12.4.0 | chore | Blocked by #96, best after #98 | | #100 | Upgrade OCR Python 3.11 to 3.13 | chore | Blocked by #96 | | #101 | Update documentation | docs | Blocked by #97-#100 | --- ### Decision Critic Results Four key decisions were evaluated through a 7-step structured critique: **Decision 1: Loki Schema Migration** - VERIFIED with nuance - Proposed: Clean cutover to tsdb + v13 schema - Finding: boltdb-shipper+v11 is deprecated in Loki 3.6.1 but still works. Clean cutover is correct for a fresh install (C1 VERIFIED - no historical data). Fallback: if tsdb causes issues, temporarily set `allow_structured_metadata: false` with v11. **Decision 2: Container Naming** - STAND - Proposed: Rename `mvp-promtail` to `mvp-alloy` - Finding: All references identified (A5 VERIFIED). Only 4 files need name changes. Clarity benefit outweighs change cost (J2). **Decision 3: Execution Ordering** - REVISED - Original: Mirror -> Alloy+Loki -> Grafana -> Python -> Docs - Revised: Mirror -> Alloy (URGENT, lands first) -> Loki -> Grafana -> Python -> Docs - Rationale: Alloy migration is the critical fix for the Docker API error. Must be isolatable if later milestones fail. Python upgrade (C7 UNCERTAIN - 3.13 wheel availability) should NOT block the Alloy fix. **Decision 4: Loki Volume Handling** - STAND - Proposed: Clear `mvp_loki_data` volume - Finding: Fresh install with days of data at most (C1 VERIFIED). No compliance requirements. Add a note in the migration step to verify the volume is clearable. --- ### Codebase Analysis Summary **10 files require changes** across 4 categories: | Category | Files | Changes | |----------|-------|---------| | Container orchestration | `docker-compose.yml` | 3 image versions + rename service | | Build infrastructure | `scripts/ci/mirror-base-images.sh` | 4 image version lines | | CI/CD pipeline | `.gitea/workflows/production.yaml` | 1 container name reference (line 174) | | Service configuration | `config/promtail/config.yml` (DELETE), `config/alloy/config.alloy` (CREATE), `config/loki/config.yml` | Schema migration + new Alloy config | | Container build | `ocr/Dockerfile` | 1 base image version (line 7) | | Documentation | `docs/LOGGING.md`, `CLAUDE.md`, `.ai/context.json` | Promtail->Alloy references, version updates | **Hardcoded references found**: `mvp-loki:3100` in 3 locations (Alloy config, Grafana datasource, Loki healthcheck) - no change needed, container name preserved. --- ### Milestone Plan #### Milestone 1: Update Base Image Mirror Script (#96) **Agent**: Platform Agent | **Files**: 1 | **Risk**: Low 1. Edit `scripts/ci/mirror-base-images.sh`: - Line 17: `python:3.11-slim` -> `python:3.13-slim` - Line 20: `grafana/loki:2.9.0` -> `grafana/loki:3.6.1` - Line 21: `grafana/promtail:2.9.0` -> `grafana/alloy:v1.12.2` - Line 22: `grafana/grafana:10.0.0` -> `grafana/grafana:12.4.0` **Verification**: Mirror script syntax is valid bash. --- #### Milestone 2: Replace Promtail with Grafana Alloy (#97) **Agent**: Platform Agent | **Files**: 4 | **Risk**: Medium (critical fix) 1. Edit `docker-compose.yml` (lines 289-307): - Rename service `mvp-promtail` -> `mvp-alloy` - Image: `grafana/promtail:2.9.0` -> `grafana/alloy:v1.12.2` - Container name: `mvp-promtail` -> `mvp-alloy` - Config mount: `./config/promtail/config.yml:/etc/promtail/config.yml:ro` -> `./config/alloy/config.alloy:/etc/alloy/config.alloy` - Command: `-config.file=/etc/promtail/config.yml` -> `run --server.http.listen-addr=0.0.0.0:12345 --storage.path=/var/lib/alloy/data /etc/alloy/config.alloy` - Keep: Docker socket mount, container logs mount, depends_on mvp-loki, backend network 2. Create `config/alloy/config.alloy` with Docker discovery, relabeling, and Loki push configuration. 3. Delete `config/promtail/` directory. 4. Edit `.gitea/workflows/production.yaml` (line 174): - `mvp-promtail` -> `mvp-alloy` in shared services start command. **Verification**: Alloy container starts, no Docker API version errors, logs appear in Loki. --- #### Milestone 3: Upgrade Loki to 3.6.1 (#98) **Agent**: Platform Agent | **Files**: 2 | **Risk**: Medium (schema migration) 1. Edit `docker-compose.yml` (line 269): - `grafana/loki:2.9.0` -> `grafana/loki:3.6.1` 2. Rewrite `config/loki/config.yml`: - Schema: `v11` -> `v13` - Store: `boltdb-shipper` -> `tsdb` - Storage paths: `boltdb-shipper-active` -> `tsdb-index`, `boltdb-shipper-cache` -> `tsdb-cache` - Keep: retention 720h, filesystem object store, port 3100, auth disabled - **Fallback**: If tsdb causes startup errors, revert to v11 schema with `allow_structured_metadata: false` 3. Note: `mvp_loki_data` volume should be cleared on deployment since storage format changes. Add step to production deployment or document manual volume clear. **Verification**: Loki starts, /ready healthcheck passes, accepts pushes from Alloy. --- #### Milestone 4: Upgrade Grafana to 12.4.0 (#99) **Agent**: Platform Agent | **Files**: 1 | **Risk**: Low 1. Edit `docker-compose.yml` (line 311): - `grafana/grafana:10.0.0` -> `grafana/grafana:12.4.0` 2. Verify `config/grafana/datasources/loki.yml` works unchanged (A3 VERIFIED - apiVersion 1 stable since Grafana 5.x). **Verification**: Grafana starts, healthcheck passes, Loki datasource provisioned, Explore view returns log queries. --- #### Milestone 5: Upgrade Python OCR to 3.13 (#100) **Agent**: Platform Agent | **Files**: 1-2 | **Risk**: Medium (C7 UNCERTAIN) 1. Edit `ocr/Dockerfile` (line 7): - `python:3.11-slim` -> `python:3.13-slim` 2. Build and verify all pip dependencies install (especially opencv-python-headless, pillow-heif, PyMuPDF). - **If build fails**: Pin to `python:3.12-slim` as safe fallback. Update mirror script accordingly. **Verification**: Container builds, health endpoint responds, VIN image processing works. --- #### Milestone 6: Documentation Updates (#101) **Agent**: Technical Writer | **Files**: 3-4 | **Risk**: Low 1. `docs/LOGGING.md`: - Replace "Promtail" with "Alloy" in architecture diagram - Update troubleshooting: `docker logs mvp-promtail` -> `docker logs mvp-alloy` - Add migration note about Alloy replacing Promtail 2. `CLAUDE.md` (line 176): - "3 logging (Loki, Promtail, Grafana)" -> "3 logging (Loki, Alloy, Grafana)" 3. `.ai/context.json`: - Fix `container_count` from 5 to 9 - Update container references if present 4. `README.md`: Update architecture description if Promtail is mentioned. **Verification**: All documentation references are consistent with deployed state. --- ### Risk Register | ID | Risk | Likelihood | Impact | Mitigation | |----|------|-----------|--------|------------| | R1 | Python 3.13 dependency build failure | Medium | Low (isolated) | Fallback to 3.12-slim; does not block Alloy fix | | R2 | Logging downtime during Loki+Alloy migration | Low | Medium | Deploy Alloy first (Milestone 2), Loki upgrade separate (Milestone 3) | | R3 | Loki tsdb schema startup errors | Low | Medium | Fallback to v11 schema with allow_structured_metadata: false | | R4 | Missed mvp-promtail reference in codebase | Very Low | Low | Three parallel codebase analyses confirmed all references | ### Execution Notes - **Branch**: `issue-95-upgrade-container-images` - **Commit convention**: `chore: {description} (refs #95)` - **Single PR** linking to #95, with sub-issues referenced in commit messages - **Milestone ordering is strict**: 1 -> 2 -> 3 -> 4 -> 5 -> 6 (Alloy must land before Loki upgrade) - **QR post-implementation review** after each milestone, results posted as issue comments --- *Verdict*: PLAN READY FOR REVIEW | *Next*: QR plan-completeness
Author
Owner

QR Review: Plan Review Cycle Results

Phase: Plan-Review | Agent: Quality Reviewer + Technical Writer | Status: PASS


Gate 1: QR plan-completeness - PASS_WITH_CONCERNS

Valid finding: Add Alloy config validation command as a fallback step in Milestone 2.

Dismissed: Most concerns (missing verification steps, dependency ordering) are already present in the plan. Each milestone has explicit verification criteria and the execution notes state "Milestone ordering is strict: 1 -> 2 -> 3 -> 4 -> 5 -> 6."


Gate 2: TW plan-scrub - PASS_WITH_CONCERNS

Valid finding: Add rollback note to execution section.

Dismissed: Temporal contamination and terminology concerns are not applicable - the plan clearly distinguishes "Create" (new files) from "Edit" (existing files), and uses consistent container name terminology throughout.


Gate 3: QR plan-code (RULE 0/1/2) - PASS_WITH_CONCERNS

RULE 0 findings - addressed by existing plan:

  • Docker socket mount: Already in the plan ("Keep: Docker socket mount"). The current docker-compose.yml already mounts /var/run/docker.sock:/var/run/docker.sock:ro.
  • Loki schema migration: Decision Critic already evaluated this (C1 VERIFIED - fresh install, no historical data). Volume clear is documented.

RULE 1 findings - addressed by codebase analysis:

  • CI/CD references: Three parallel codebase analyses confirmed mvp-promtail appears only in docker-compose.yml, production.yaml line 174, config/promtail/config.yml, and docs/LOGGING.md. All are covered in milestones 2 and 6.
  • Alloy port 12345: This is Alloy's internal HTTP/metrics port. It does not need to be exposed. No existing health checks or dashboards target Promtail's port 9080 either.

RULE 2 finding - already mitigated:

  • Python 3.13 compatibility: Decision Critic flagged this as C7 UNCERTAIN. Plan includes fallback to python:3.12-slim.

Gate 4: QR plan-docs - PASS_WITH_CONCERNS

Valid findings to incorporate into Milestone 6:

  1. Additional files need "5-container" -> "9-container" updates:
    • docs/DATABASE-SCHEMA.md line 3
    • config/CLAUDE.md line 3
  2. .ai/context.json needs two changes (line 3 architecture string + line 293 container_count)
  3. Migration note in docs/LOGGING.md should include Alloy version number and migration reason

Plan Revisions

Incorporating valid findings from all four gates:

Milestone 2 addition: Add Alloy config syntax validation step:

docker run --rm -v ./config/alloy:/etc/alloy grafana/alloy:v1.12.2 fmt /etc/alloy/config.alloy

Milestone 6 additions:

  • Add docs/DATABASE-SCHEMA.md and config/CLAUDE.md to documentation updates (fix "5-container" references)
  • Specify .ai/context.json has two update locations
  • Include Alloy version and migration reason in LOGGING.md note

Execution notes addition:
"If any milestone fails during execution, the branch preserves all prior milestone commits. The urgent Alloy fix (Milestone 2) is positioned early so it can be cherry-picked if later milestones encounter issues."


Verdict: PASS | Next: Create branch and begin execution

## QR Review: Plan Review Cycle Results **Phase**: Plan-Review | **Agent**: Quality Reviewer + Technical Writer | **Status**: PASS --- ### Gate 1: QR plan-completeness - PASS_WITH_CONCERNS **Valid finding**: Add Alloy config validation command as a fallback step in Milestone 2. **Dismissed**: Most concerns (missing verification steps, dependency ordering) are already present in the plan. Each milestone has explicit verification criteria and the execution notes state "Milestone ordering is strict: 1 -> 2 -> 3 -> 4 -> 5 -> 6." --- ### Gate 2: TW plan-scrub - PASS_WITH_CONCERNS **Valid finding**: Add rollback note to execution section. **Dismissed**: Temporal contamination and terminology concerns are not applicable - the plan clearly distinguishes "Create" (new files) from "Edit" (existing files), and uses consistent container name terminology throughout. --- ### Gate 3: QR plan-code (RULE 0/1/2) - PASS_WITH_CONCERNS **RULE 0 findings - addressed by existing plan:** - Docker socket mount: Already in the plan ("Keep: Docker socket mount"). The current docker-compose.yml already mounts `/var/run/docker.sock:/var/run/docker.sock:ro`. - Loki schema migration: Decision Critic already evaluated this (C1 VERIFIED - fresh install, no historical data). Volume clear is documented. **RULE 1 findings - addressed by codebase analysis:** - CI/CD references: Three parallel codebase analyses confirmed `mvp-promtail` appears only in docker-compose.yml, production.yaml line 174, config/promtail/config.yml, and docs/LOGGING.md. All are covered in milestones 2 and 6. - Alloy port 12345: This is Alloy's internal HTTP/metrics port. It does not need to be exposed. No existing health checks or dashboards target Promtail's port 9080 either. **RULE 2 finding - already mitigated:** - Python 3.13 compatibility: Decision Critic flagged this as C7 UNCERTAIN. Plan includes fallback to python:3.12-slim. --- ### Gate 4: QR plan-docs - PASS_WITH_CONCERNS **Valid findings to incorporate into Milestone 6:** 1. Additional files need "5-container" -> "9-container" updates: - `docs/DATABASE-SCHEMA.md` line 3 - `config/CLAUDE.md` line 3 2. `.ai/context.json` needs two changes (line 3 architecture string + line 293 container_count) 3. Migration note in docs/LOGGING.md should include Alloy version number and migration reason --- ### Plan Revisions Incorporating valid findings from all four gates: **Milestone 2 addition**: Add Alloy config syntax validation step: ``` docker run --rm -v ./config/alloy:/etc/alloy grafana/alloy:v1.12.2 fmt /etc/alloy/config.alloy ``` **Milestone 6 additions**: - Add `docs/DATABASE-SCHEMA.md` and `config/CLAUDE.md` to documentation updates (fix "5-container" references) - Specify `.ai/context.json` has two update locations - Include Alloy version and migration reason in LOGGING.md note **Execution notes addition**: "If any milestone fails during execution, the branch preserves all prior milestone commits. The urgent Alloy fix (Milestone 2) is positioned early so it can be cherry-picked if later milestones encounter issues." --- *Verdict*: PASS | *Next*: Create branch and begin execution
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: egullickson/motovaultpro#95