feat: Add Grafana dashboards and alerting (#105) #112

Merged
egullickson merged 8 commits from issue-105-add-grafana-dashboards into main 2026-02-06 17:44:05 +00:00
Owner

Summary

  • Add file-based Grafana dashboard provisioning infrastructure with 4 dashboards and alerting rules for application observability across all 9 containers
  • Dashboards: Application Overview, API Performance, Error Investigation, Infrastructure
  • Alerting: Error rate spike, container silence, and 5xx spike rules with Grafana Unified Alerting

Linked issues

Type

  • Feature
  • Bug fix
  • Chore / refactor
  • Docs

Test plan

  • Manual verification

Commands / steps:

  1. Run make rebuild to rebuild containers with new Grafana volume mounts
  2. Verify Grafana loads at https://logs.motovaultpro.com with "MotoVaultPro" folder containing all 4 dashboards
  3. Verify Application Overview dashboard shows container log volume, error rate, and health status
  4. Verify API Performance dashboard shows request rate, p50/p95/p99 latency, status codes, and endpoint tables
  5. Verify Error Investigation dashboard shows error stream, requestId lookup, and stack trace viewer
  6. Verify Infrastructure dashboard shows per-container logs for PostgreSQL, Redis, Traefik, OCR, and Loki
  7. Verify alerting rules appear in Grafana Alerting UI (error rate spike, container silence, 5xx spike)

Checklist

  • Acceptance criteria met (from linked issue)
  • No secrets committed
  • Logging is appropriate (no PII)
  • Docs updated (docs/LOGGING.md)

Quality Review

RULE 0 (Production Reliability): PASS
RULE 1 (Project Conformance): PASS_WITH_CONCERNS (config/CLAUDE.md not updated per plan - non-blocking)
RULE 2 (Structural Quality): PASS

Files Changed (11)

  • config/grafana/provisioning/dashboards.yml (NEW) - Dashboard provisioning config
  • config/grafana/dashboards/application-overview.json (NEW) - Application Overview dashboard
  • config/grafana/dashboards/api-performance.json (NEW) - API Performance dashboard
  • config/grafana/dashboards/error-investigation.json (NEW) - Error Investigation dashboard
  • config/grafana/dashboards/infrastructure.json (NEW) - Infrastructure dashboard
  • config/grafana/alerting/alert-rules.yml (NEW) - 5 alert rules
  • config/grafana/alerting/contact-points.yml (NEW) - Default contact point + webhook placeholder
  • config/grafana/alerting/notification-policies.yml (NEW) - Alert routing policy
  • docker-compose.yml (MODIFIED) - Added 3 Grafana volume mounts
  • docs/LOGGING.md (MODIFIED) - Added dashboards, alerting, and LogQL documentation
  • config/grafana/datasources/loki.yml (MODIFIED) - Added explicit UID
## Summary - Add file-based Grafana dashboard provisioning infrastructure with 4 dashboards and alerting rules for application observability across all 9 containers - Dashboards: Application Overview, API Performance, Error Investigation, Infrastructure - Alerting: Error rate spike, container silence, and 5xx spike rules with Grafana Unified Alerting ## Linked issues - Fixes #105 - Fixes #106 - Fixes #107 - Fixes #108 - Fixes #109 - Fixes #110 - Fixes #111 ## Type - [x] Feature - [ ] Bug fix - [ ] Chore / refactor - [ ] Docs ## Test plan - [ ] Manual verification **Commands / steps:** 1. Run `make rebuild` to rebuild containers with new Grafana volume mounts 2. Verify Grafana loads at `https://logs.motovaultpro.com` with "MotoVaultPro" folder containing all 4 dashboards 3. Verify Application Overview dashboard shows container log volume, error rate, and health status 4. Verify API Performance dashboard shows request rate, p50/p95/p99 latency, status codes, and endpoint tables 5. Verify Error Investigation dashboard shows error stream, requestId lookup, and stack trace viewer 6. Verify Infrastructure dashboard shows per-container logs for PostgreSQL, Redis, Traefik, OCR, and Loki 7. Verify alerting rules appear in Grafana Alerting UI (error rate spike, container silence, 5xx spike) ## Checklist - [x] Acceptance criteria met (from linked issue) - [x] No secrets committed - [x] Logging is appropriate (no PII) - [x] Docs updated (docs/LOGGING.md) ## Quality Review RULE 0 (Production Reliability): PASS RULE 1 (Project Conformance): PASS_WITH_CONCERNS (config/CLAUDE.md not updated per plan - non-blocking) RULE 2 (Structural Quality): PASS ## Files Changed (11) - `config/grafana/provisioning/dashboards.yml` (NEW) - Dashboard provisioning config - `config/grafana/dashboards/application-overview.json` (NEW) - Application Overview dashboard - `config/grafana/dashboards/api-performance.json` (NEW) - API Performance dashboard - `config/grafana/dashboards/error-investigation.json` (NEW) - Error Investigation dashboard - `config/grafana/dashboards/infrastructure.json` (NEW) - Infrastructure dashboard - `config/grafana/alerting/alert-rules.yml` (NEW) - 5 alert rules - `config/grafana/alerting/contact-points.yml` (NEW) - Default contact point + webhook placeholder - `config/grafana/alerting/notification-policies.yml` (NEW) - Alert routing policy - `docker-compose.yml` (MODIFIED) - Added 3 Grafana volume mounts - `docs/LOGGING.md` (MODIFIED) - Added dashboards, alerting, and LogQL documentation - `config/grafana/datasources/loki.yml` (MODIFIED) - Added explicit UID
egullickson added 6 commits 2026-02-06 16:30:04 +00:00
Add file-based dashboard provisioning config and mount dashboards
directory into Grafana container for auto-loading dashboard JSON files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds file-provisioned dashboard with 5 panels:
- Container Log Volume Over Time (all 9 containers)
- Error Rate Across All Containers (percentage stat)
- Log Level Distribution Per Container (stacked bar chart)
- Container Health Status (green/red per container)
- Total Request Count Over Time (backend requests/min)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Log-based dashboard with 6 panels: request rate, response time
distribution (p50/p95/p99), HTTP status code distribution, request
volume by endpoint, slowest endpoints, and status code breakdown.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
feat: add Grafana alerting rules and documentation (refs #111)
All checks were successful
Deploy to Staging / Build Images (pull_request) Successful in 36s
Deploy to Staging / Deploy to Staging (pull_request) Successful in 51s
Deploy to Staging / Verify Staging (pull_request) Successful in 2m36s
Deploy to Staging / Notify Staging Ready (pull_request) Successful in 8s
Deploy to Staging / Notify Staging Failure (pull_request) Has been skipped
4b2b318aff
Configure Grafana Unified Alerting with file-based provisioned alert
rules, contact points, and notification policies. Add stable UID to
Loki datasource for alert rule references. Update LOGGING.md with
dashboard descriptions, alerting rules table, and LogQL query reference.

Alert rules: Error Rate Spike (critical), Container Silence for
backend/postgres/redis (warning), 5xx Response Spike (critical).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
egullickson added 1 commit 2026-02-06 16:33:11 +00:00
docs: update config/CLAUDE.md with Grafana subdirectories (refs #111)
All checks were successful
Deploy to Staging / Build Images (pull_request) Successful in 34s
Deploy to Staging / Deploy to Staging (pull_request) Successful in 51s
Deploy to Staging / Verify Staging (pull_request) Successful in 2m36s
Deploy to Staging / Notify Staging Ready (pull_request) Successful in 8s
Deploy to Staging / Notify Staging Failure (pull_request) Has been skipped
842b0eb945
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
egullickson added 1 commit 2026-02-06 16:51:16 +00:00
fix: resolve staging deployment issues with Traefik, Loki, and Alloy (refs #105)
All checks were successful
Deploy to Staging / Build Images (pull_request) Successful in 1m21s
Deploy to Staging / Deploy to Staging (pull_request) Successful in 48s
Deploy to Staging / Verify Staging (pull_request) Successful in 2m37s
Deploy to Staging / Notify Staging Ready (pull_request) Successful in 8s
Deploy to Staging / Notify Staging Failure (pull_request) Has been skipped
462d306783
- Exclude blue-green.yml from staging Traefik by mounting dynamic-staging/
  directory (only grafana.yml + middleware.yml) instead of dynamic/ which
  contains production-only blue-green routing config
- Disable Loki healthcheck: distroless image has no /bin/sh so CMD-SHELL
  healthchecks cannot execute; Alloy and Grafana verify Loki connectivity
- Fix Alloy healthcheck: replace wget (not in image) with bash /dev/tcp
- Add Grafana staging domain override (logs.staging.motovaultpro.com)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
egullickson merged commit 88db803b6a into main 2026-02-06 17:44:05 +00:00
egullickson deleted branch issue-105-add-grafana-dashboards 2026-02-06 17:44:06 +00:00
Sign in to join this conversation.
No Reviewers
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: egullickson/motovaultpro#112