egullickson/motovaultpro

Fork 0

Files

Eric Gullickson 4b2b318aff

Deploy to Staging / Build Images (pull_request) Successful in 36s

Details

Deploy to Staging / Deploy to Staging (pull_request) Successful in 51s

Details

Deploy to Staging / Verify Staging (pull_request) Successful in 2m36s

Details

Deploy to Staging / Notify Staging Ready (pull_request) Successful in 8s

Details

Deploy to Staging / Notify Staging Failure (pull_request) Has been skipped

Details

feat: add Grafana alerting rules and documentation (refs #111 )

Configure Grafana Unified Alerting with file-based provisioned alert
rules, contact points, and notification policies. Add stable UID to
Loki datasource for alert rule references. Update LOGGING.md with
dashboard descriptions, alerting rules table, and LogQL query reference.

Alert rules: Error Rate Spike (critical), Container Silence for
backend/postgres/redis (warning), 5xx Response Spike (critical).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-06 10:19:00 -06:00

6.6 KiB

Raw Blame History

Unified Logging System

MotoVaultPro uses a unified logging system with centralized log aggregation.

Overview

Single Control: One LOG_LEVEL environment variable controls all containers
Correlation IDs: requestId field traces requests across services
Centralized Aggregation: Grafana + Loki for log querying and visualization

LOG_LEVEL Values

Level	Frontend	Backend	PostgreSQL	Redis	Traefik
DEBUG	debug	debug	all queries, 0ms	debug	DEBUG
INFO	info	info	DDL only, 500ms	verbose	INFO
WARN	warn	warn	errors, 1000ms	notice	WARN
ERROR	error	error	errors only	warning	ERROR

Environment Defaults

Environment	LOG_LEVEL	Purpose
Development	DEBUG	Full debugging locally
Staging	DEBUG	Full debugging in staging
Production	INFO	Standard production logging

Correlation IDs

All logs include a requestId field (UUID v4) for tracing requests:

Traefik: Forwards X-Request-Id if present
Backend: Generates UUID if X-Request-Id missing, includes in all logs
Frontend: Includes requestId in API call logs

Example Log Entry

{
  "level": "info",
  "time": "2024-01-15T10:30:00.000Z",
  "requestId": "550e8400-e29b-41d4-a716-446655440000",
  "msg": "Request processed",
  "method": "GET",
  "path": "/api/vehicles",
  "status": 200,
  "duration": 45
}

Grafana Access

URL: https://logs.motovaultpro.com
Default credentials: admin/admin (change on first login)

Dashboards

Four provisioned dashboards are available in the MotoVaultPro folder:

Dashboard	Purpose	Key Panels
Application Overview	System-wide health at a glance	Container log volume, error rate gauge, log level distribution, container health status, request count
API Performance	Backend latency and throughput analysis	Request rate, response time percentiles (p50/p95/p99), status code distribution, slowest endpoints
Error Investigation	Debugging and root cause analysis	Error log stream, errors by container/endpoint, stack trace viewer, correlation ID lookup, recent 5xx responses
Infrastructure	Container-level logs and platform monitoring	Per-container throughput, PostgreSQL/Redis/Traefik/OCR logs, Loki ingestion rate

All dashboards refresh every 30 seconds and default to a 1-hour time window. Dashboard JSON files are in config/grafana/dashboards/ and provisioned via config/grafana/provisioning/dashboards.yml.

Alerting Rules

Grafana Unified Alerting is configured with file-based provisioned rules. Alert rules are evaluated every 1 minute and must fire continuously for 5 minutes before triggering.

Alert	Severity	Condition	Description
Error Rate Spike	critical	Error rate > 5% over 5m	Fires when the percentage of error-level logs across all mvp-* containers exceeds 5%
Container Silence: mvp-backend	warning	No logs for 5m	Fires when the backend container stops producing logs
Container Silence: mvp-postgres	warning	No logs for 5m	Fires when the database container stops producing logs
Container Silence: mvp-redis	warning	No logs for 5m	Fires when the cache container stops producing logs
5xx Response Spike	critical	> 10 5xx responses in 5m	Fires when the backend produces more than 10 HTTP 5xx responses

Alert configuration files are in config/grafana/alerting/:

alert-rules.yml - Alert rule definitions with LogQL queries
contact-points.yml - Notification endpoints (webhook placeholder for future email/Slack)
notification-policies.yml - Routing rules that group alerts by name and severity

LogQL Query Reference

Common Debugging Queries

Query by requestId:

{container="mvp-backend"} |= "550e8400-e29b-41d4"

Query all errors:

{container=~"mvp-.*"} | json | level="error"

Query slow requests (>500ms):

{container="mvp-backend"} | json | msg="Request processed" | duration > 500

Error Analysis

Count errors per container over time:

sum by (container) (count_over_time({container=~"mvp-.*"} | json | level="error" [5m]))

Error rate as percentage:

sum(count_over_time({container=~"mvp-.*"} | json | level="error" [5m]))
  / sum(count_over_time({container=~"mvp-.*"} [5m])) * 100

HTTP Status Analysis

All 5xx responses:

{container="mvp-backend"} | json | msg="Request processed" | status >= 500

Request count by status code:

sum by (status) (count_over_time({container="mvp-backend"} | json | msg="Request processed" [5m]))

Container-Specific Queries

PostgreSQL errors:

{container="mvp-postgres"} |~ "ERROR|FATAL|PANIC"

Traefik access logs:

{container="mvp-traefik"} | json

OCR processing errors:

{container="mvp-ocr"} |~ "ERROR|Exception|Traceback"

Configuration

Logging configuration is generated by scripts/ci/generate-log-config.sh:

# Generate DEBUG level config
./scripts/ci/generate-log-config.sh DEBUG

# Generate INFO level config
./scripts/ci/generate-log-config.sh INFO

This creates .env.logging which is sourced by docker-compose.

Architecture

+-----------------------------------------------------------------------+
|                         CI/CD PIPELINE                                |
|  LOG_LEVEL --> generate-log-config.sh --> .env.logging                |
+-----------------------------------------------------------------------+
                                |
                                v
+-----------------------------------------------------------------------+
|                       APPLICATION LAYER                               |
|  Frontend   Backend    OCR      Postgres   Redis    Traefik           |
|      |         |         |         |         |         |              |
|      +---------+---------+---------+---------+---------+              |
|                          |                                            |
|                Docker Log Driver (json-file, 10m x 3)                 |
+-----------------------------------------------------------------------+
                           |
                           v
              Alloy --> Loki (30-day retention) --> Grafana

Troubleshooting

Logs not appearing in Grafana

Check Alloy is running: docker logs mvp-alloy
Check Loki is healthy: curl http://localhost:3100/ready
Verify log rotation is not too aggressive

Invalid LOG_LEVEL

Both frontend and backend will warn and fall back to 'info' if an invalid LOG_LEVEL is provided.

6.6 KiB Raw Blame History

Unified Logging System

Overview

LOG_LEVEL Values

Environment Defaults

Correlation IDs

Example Log Entry

Grafana Access

Dashboards

Alerting Rules

LogQL Query Reference

Common Debugging Queries

Error Analysis

HTTP Status Analysis

Container-Specific Queries

Configuration

Architecture

Troubleshooting

Logs not appearing in Grafana

Invalid LOG_LEVEL

6.6 KiB

Raw Blame History