Configure Grafana Unified Alerting with file-based provisioned alert rules, contact points, and notification policies. Add stable UID to Loki datasource for alert rule references. Update LOGGING.md with dashboard descriptions, alerting rules table, and LogQL query reference. Alert rules: Error Rate Spike (critical), Container Silence for backend/postgres/redis (warning), 5xx Response Spike (critical). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
6.6 KiB
Unified Logging System
MotoVaultPro uses a unified logging system with centralized log aggregation.
Overview
- Single Control: One
LOG_LEVELenvironment variable controls all containers - Correlation IDs:
requestIdfield traces requests across services - Centralized Aggregation: Grafana + Loki for log querying and visualization
LOG_LEVEL Values
| Level | Frontend | Backend | PostgreSQL | Redis | Traefik |
|---|---|---|---|---|---|
| DEBUG | debug | debug | all queries, 0ms | debug | DEBUG |
| INFO | info | info | DDL only, 500ms | verbose | INFO |
| WARN | warn | warn | errors, 1000ms | notice | WARN |
| ERROR | error | error | errors only | warning | ERROR |
Environment Defaults
| Environment | LOG_LEVEL | Purpose |
|---|---|---|
| Development | DEBUG | Full debugging locally |
| Staging | DEBUG | Full debugging in staging |
| Production | INFO | Standard production logging |
Correlation IDs
All logs include a requestId field (UUID v4) for tracing requests:
- Traefik: Forwards X-Request-Id if present
- Backend: Generates UUID if X-Request-Id missing, includes in all logs
- Frontend: Includes requestId in API call logs
Example Log Entry
{
"level": "info",
"time": "2024-01-15T10:30:00.000Z",
"requestId": "550e8400-e29b-41d4-a716-446655440000",
"msg": "Request processed",
"method": "GET",
"path": "/api/vehicles",
"status": 200,
"duration": 45
}
Grafana Access
- URL: https://logs.motovaultpro.com
- Default credentials: admin/admin (change on first login)
Dashboards
Four provisioned dashboards are available in the MotoVaultPro folder:
| Dashboard | Purpose | Key Panels |
|---|---|---|
| Application Overview | System-wide health at a glance | Container log volume, error rate gauge, log level distribution, container health status, request count |
| API Performance | Backend latency and throughput analysis | Request rate, response time percentiles (p50/p95/p99), status code distribution, slowest endpoints |
| Error Investigation | Debugging and root cause analysis | Error log stream, errors by container/endpoint, stack trace viewer, correlation ID lookup, recent 5xx responses |
| Infrastructure | Container-level logs and platform monitoring | Per-container throughput, PostgreSQL/Redis/Traefik/OCR logs, Loki ingestion rate |
All dashboards refresh every 30 seconds and default to a 1-hour time window. Dashboard JSON files are in config/grafana/dashboards/ and provisioned via config/grafana/provisioning/dashboards.yml.
Alerting Rules
Grafana Unified Alerting is configured with file-based provisioned rules. Alert rules are evaluated every 1 minute and must fire continuously for 5 minutes before triggering.
| Alert | Severity | Condition | Description |
|---|---|---|---|
| Error Rate Spike | critical | Error rate > 5% over 5m | Fires when the percentage of error-level logs across all mvp-* containers exceeds 5% |
| Container Silence: mvp-backend | warning | No logs for 5m | Fires when the backend container stops producing logs |
| Container Silence: mvp-postgres | warning | No logs for 5m | Fires when the database container stops producing logs |
| Container Silence: mvp-redis | warning | No logs for 5m | Fires when the cache container stops producing logs |
| 5xx Response Spike | critical | > 10 5xx responses in 5m | Fires when the backend produces more than 10 HTTP 5xx responses |
Alert configuration files are in config/grafana/alerting/:
alert-rules.yml- Alert rule definitions with LogQL queriescontact-points.yml- Notification endpoints (webhook placeholder for future email/Slack)notification-policies.yml- Routing rules that group alerts by name and severity
LogQL Query Reference
Common Debugging Queries
Query by requestId:
{container="mvp-backend"} |= "550e8400-e29b-41d4"
Query all errors:
{container=~"mvp-.*"} | json | level="error"
Query slow requests (>500ms):
{container="mvp-backend"} | json | msg="Request processed" | duration > 500
Error Analysis
Count errors per container over time:
sum by (container) (count_over_time({container=~"mvp-.*"} | json | level="error" [5m]))
Error rate as percentage:
sum(count_over_time({container=~"mvp-.*"} | json | level="error" [5m]))
/ sum(count_over_time({container=~"mvp-.*"} [5m])) * 100
HTTP Status Analysis
All 5xx responses:
{container="mvp-backend"} | json | msg="Request processed" | status >= 500
Request count by status code:
sum by (status) (count_over_time({container="mvp-backend"} | json | msg="Request processed" [5m]))
Container-Specific Queries
PostgreSQL errors:
{container="mvp-postgres"} |~ "ERROR|FATAL|PANIC"
Traefik access logs:
{container="mvp-traefik"} | json
OCR processing errors:
{container="mvp-ocr"} |~ "ERROR|Exception|Traceback"
Configuration
Logging configuration is generated by scripts/ci/generate-log-config.sh:
# Generate DEBUG level config
./scripts/ci/generate-log-config.sh DEBUG
# Generate INFO level config
./scripts/ci/generate-log-config.sh INFO
This creates .env.logging which is sourced by docker-compose.
Architecture
+-----------------------------------------------------------------------+
| CI/CD PIPELINE |
| LOG_LEVEL --> generate-log-config.sh --> .env.logging |
+-----------------------------------------------------------------------+
|
v
+-----------------------------------------------------------------------+
| APPLICATION LAYER |
| Frontend Backend OCR Postgres Redis Traefik |
| | | | | | | |
| +---------+---------+---------+---------+---------+ |
| | |
| Docker Log Driver (json-file, 10m x 3) |
+-----------------------------------------------------------------------+
|
v
Alloy --> Loki (30-day retention) --> Grafana
Troubleshooting
Logs not appearing in Grafana
- Check Alloy is running:
docker logs mvp-alloy - Check Loki is healthy:
curl http://localhost:3100/ready - Verify log rotation is not too aggressive
Invalid LOG_LEVEL
Both frontend and backend will warn and fall back to 'info' if an invalid LOG_LEVEL is provided.