feat: Add Grafana dashboards and alerting for application observability #105
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
The logging stack (Alloy -> Loki -> Grafana) was implemented but no Grafana dashboards were created. Grafana at
https://logs.motovaultpro.comcurrently has only the Loki datasource provisioned (config/grafana/datasources/loki.yml) with no dashboards or alerting rules. We need file-based provisioned dashboards to easily spot and debug errors across all 9 containers.Current State
https://logs.motovaultpro.comcontainer,servicerequestId,method,path,status,duration,ip,userId,error,stackRequirements
Provisioning Method
File-based provisioning: Dashboard JSON files in
config/grafana/dashboards/with a provisioning YAML config, auto-loaded on container startup. Must be version-controlled and reproducible.Dashboards (4 total)
1. Application Overview Dashboard
2. API Performance Dashboard
durationfield)duration)pathfield)3. Error Investigation Dashboard
pathfield)requestId)4. Infrastructure Dashboard
Alerting Rules
Configuration Changes Required
config/grafana/dashboards/directoryconfig/grafana/provisioning/dashboards.ymldocker-compose.ymlto mount the dashboards provisioning directorydocker-compose.staging.ymlanddocker-compose.prod.ymlaccordinglyAvailable Log Labels and Fields
Alloy Labels (for LogQL selectors):
container- Container name (e.g.,mvp-backend,mvp-postgres)service- Docker Compose service nameBackend Structured JSON Fields (for LogQL JSON parsing):
level- Log level (info, warn, error, debug)time- ISO 8601 timestamprequestId- UUID v4 correlation IDmethod- HTTP methodpath- Request URL pathstatus- HTTP status codeduration- Request processing time in msip- Client IPuserId- Auth0 user IDerror- Error messagestack- Stack tracemsg- Log messageExample LogQL Queries:
Monitored Containers
mvp-traefik- Reverse proxymvp-frontend- React SPAmvp-backend- Fastify API (19 feature capsules)mvp-ocr- Python OCR microservicemvp-postgres- PostgreSQL databasemvp-redis- Redis cachemvp-loki- Log storagemvp-alloy- Log collectormvp-grafana- VisualizationAcceptance Criteria
container,service)config/grafana/dashboards/Plan: Grafana Dashboards and Alerting
Phase: Planning | Agent: Planner | Status: AWAITING_REVIEW
Pre-Planning Analysis
Codebase Analysis: Reviewed all Grafana, Alloy, Loki, and backend logging configuration. Current state: Grafana 12.4.0 running with only Loki datasource provisioned. No dashboards, no alerting, no provisioning directory.
Decision Critic Verdict: STAND. All 7 verifiable claims passed. File-based provisioning confirmed supported. Key adjustment: log-based percentile metrics are approximations (no Prometheus), acceptable for current stack. Recommend webhook placeholder for future push alerting.
Sub-Issues Created
Milestone Breakdown
M1: Provisioning Infrastructure (#106)
Agent: Platform Agent
Deliverables:
config/grafana/provisioning/dashboards.yml- Provider config pointing to/var/lib/grafana/dashboardsconfig/grafana/dashboards/.gitkeep- Empty directory for dashboard JSON filesdocker-compose.yml- Add two volume mounts to mvp-grafana:./config/grafana/provisioning:/etc/grafana/provisioning/dashboards:ro./config/grafana/dashboards:/var/lib/grafana/dashboards:roFiles Changed: 3 new, 1 modified
Validation: Grafana starts, logs show "MotoVaultPro" dashboard provider loaded
M2: Application Overview Dashboard (#107)
Agent: Platform Agent
Deliverables:
Create
config/grafana/dashboards/application-overview.jsonwith 5 panels:Key LogQL patterns:
{container=~"mvp-.*"}| json | level="error"sum by (container) (count_over_time(...))Files Changed: 1 new
Validation: Dashboard auto-loads, all panels render with data
M3: API Performance Dashboard (#108)
Agent: Platform Agent
Deliverables:
Create
config/grafana/dashboards/api-performance.jsonwith 6 panels:quantile_over_time+unwrap duration)Note: Percentile calculations are log-based approximations. All queries filter on
msg="Request processed"to isolate request logs from other backend logs.Files Changed: 1 new
Validation: Dashboard auto-loads, percentile panels render, endpoint tables populated
M4: Error Investigation Dashboard (#109)
Agent: Platform Agent
Deliverables:
Create
config/grafana/dashboards/error-investigation.jsonwith 7 panels + 1 template variable:line_format)$requestIdvariable)Template variable:
requestId(text input, used in panel 6)Files Changed: 1 new
Validation: Error stream shows errors, requestId lookup works, stack traces rendered
M5: Infrastructure Dashboard (#110)
Agent: Platform Agent
Deliverables:
Create
config/grafana/dashboards/infrastructure.jsonwith 8 panels:|~ "ERROR|WARNING|FATAL")|~ "level=error|err=")|~ "ERROR|error|Exception|Traceback")Files Changed: 1 new
Validation: All infrastructure containers have dedicated panels rendering data
M6: Alerting Rules and Documentation (#111)
Agent: Platform Agent
Deliverables:
config/grafana/provisioning/alerting/alert-rules.yml- 3 alert rules:config/grafana/provisioning/alerting/contact-points.yml- Default Grafana UI + webhook placeholderconfig/grafana/provisioning/alerting/notification-policies.yml- Route all alerts to default contact pointdocker-compose.yml- Mount alerting provisioning directorydocs/LOGGING.md- Add dashboards section, alerting rules, LogQL referenceFiles Changed: 3 new, 2 modified
Validation: Alert rules load on startup, visible in Grafana Alerting UI, docs updated
Execution Order
Total Files Changed
Risk Register
Verdict: AWAITING_REVIEW | Next: Plan review cycle (QR completeness -> TW scrub -> QR code -> QR docs)
TW SCRUB: NEEDS_REVISION
Agent: Technical Writer | Mode: plan-scrub
Findings
F1. CRITICAL -- M1 volume mount path will not serve dashboards to Grafana
Location: M1, deliverable 3
Issue: The plan specifies two volume mounts:
The first mount maps the host directory
config/grafana/provisioning/to the container path/etc/grafana/provisioning/dashboards/. This meansdashboards.ymlwould appear inside the container at/etc/grafana/provisioning/dashboards/dashboards.yml. Grafana expects the provider configuration YAML to be at/etc/grafana/provisioning/dashboards/*.yml-- so the file location is technically correct.However, the existing datasource provisioning uses a different pattern:
This maps
config/grafana/datasources/(which containsloki.yml) directly to the provisioning subdirectory. For consistency and clarity, the dashboard provisioning YAML should follow the same pattern. The plan should use:or simply place
dashboards.ymlinsideconfig/grafana/dashboards/alongside the JSON files (Grafana will ignore non-YAML files when loading provider configs, and thedashboards.ymlpoints to apathwhich can be the same directory or a subdirectory).Suggested fix: Clarify the exact host directory layout. Recommended approach matching existing convention:
config/grafana/dashboards/dashboards.yml-- provider config (points to/var/lib/grafana/dashboards)config/grafana/dashboards/*.json-- dashboard JSON files./config/grafana/dashboards:/etc/grafana/provisioning/dashboards:rofor the provider config./config/grafana/dashboards:/var/lib/grafana/dashboards:rofor the JSON filesOr separate directories to avoid confusion. Either way, the plan must be explicit about which files live in which host directory.
F2. HIGH -- M6 alerting volume mount not specified in M1
Location: M1 deliverable 3 vs M6 deliverable 4
Issue: M1 defines the docker-compose volume mounts but only lists two (dashboard provisioning + dashboard JSON files). M6 deliverable 4 says "Modify docker-compose.yml - Mount alerting provisioning directory" -- but M1 claims to be the complete provisioning infrastructure milestone. This creates ambiguity: does M1 deliver the complete volume mount set, or does M6 add another mount later?
The alerting files at
config/grafana/provisioning/alerting/need a mount to/etc/grafana/provisioning/alerting/inside the container. This mount should be specified in M1 (the infrastructure milestone) rather than deferred to M6, since M1's purpose is "Provisioning Infrastructure."Suggested fix: Add the alerting provisioning mount to M1 deliverable 3:
Then remove deliverable 4 from M6, or change M6 deliverable 4 to "Verify alerting files load via mount created in M1."
F3. MEDIUM -- LogQL
quantile_over_timequery syntax incorrectLocation: M3 (API Performance Dashboard), deliverable 2
Issue: The issue body references this query pattern:
Two syntax problems:
unwrapexpression must come before the range selector in LogQL. The correct order is:| unwrap duration | __error__="" [5m]. The__error__=""filter after unwrap is best practice to drop entries where unwrap fails (non-numeric duration values).by ()with empty parentheses is valid LogQL (means "aggregate all into one series"), but should be documented as intentional since it differs from theby (container)pattern used elsewhere.Suggested fix: Correct the query pattern in M3 to:
Drop
by ()since it is the default behavior when no grouping clause is specified.F4. LOW -- Terminology inconsistency: "provisioning" directory naming
Location: M1, M6
Issue: The plan uses
config/grafana/provisioning/as a host directory that maps to different container paths. This creates a naming collision with Grafana's internal/etc/grafana/provisioning/directory. The host directoryconfig/grafana/provisioning/dashboards.ymlis a provider config, whileconfig/grafana/provisioning/alerting/contains alerting rules -- these serve different Grafana subsystems but share a parent directory on the host.Suggested fix: Consider one of:
config/grafana/dashboards/for dashboard provider config + JSON,config/grafana/alerting/for alerting YAML files (parallels existingconfig/grafana/datasources/pattern)Option (a) is recommended for consistency with the existing
config/grafana/datasources/convention.F5. LOW -- "Files Changed" counts in M1 are misleading
Location: M1
Issue: M1 says "Files Changed: 3 new, 1 modified" but deliverable 2 is a
.gitkeepfile which will be deleted once dashboard JSONs are added in M2. Creating a file that exists only between milestones is unnecessary if M1 and M2 are implemented sequentially. If they are parallel (they cannot be -- M2 depends on M1), the.gitkeepwould serve a purpose.Suggested fix: Remove
.gitkeepfrom M1 deliverables. The directory will be created implicitly when M2 writes the first dashboard JSON. Update count to "2 new, 1 modified."F6. INFO --
durationfield is numeric (milliseconds), not a Loki duration typeLocation: M3 panel 2 (Response Time Distribution)
Issue: The backend logs
durationas an integer (milliseconds):duration: Date.now() - (request.startTime || Date.now()). Theunwrap durationin LogQL will extract this as a numeric value, which is correct. However, the plan should note that the resulting p50/p95/p99 values are in milliseconds, and Grafana panel units should be set tomsto display correctly. This is not an error but an implementor could miss the unit configuration.Suggested fix: Add a note in M3: "Panel unit: milliseconds (ms). The
durationfield is logged as integer milliseconds by the backend."Considered But Not Flagged
{container="mvp-backend"} |= "$requestId"for template variable: Valid. Grafana interpolates$requestIdbefore sending to Loki. The|=operator performs a full-text substring match, which is correct for correlating by request ID without requiring JSON parsing first (faster).sum by (container) (count_over_time({container=~"mvp-.*"}[1m])): Valid LogQL. Stream selector with regex, range aggregation, grouped by label.{container="mvp-backend"} | json | level="error" | line_format "{{.error}}\n{{.stack}}": Valid LogQL. JSON parse, label filter, then format output. The Go template syntax{{.error}}is correct forline_format.msg="Request processed"filter: Confirmed matches backend code exactly (logging.plugin.tsline 20:logger.info('Request processed', {...})).Alloy labels
containerandservice: Confirmed fromconfig/alloy/config.alloy--containeris extracted from__meta_docker_container_name(stripped leading/),servicefrom compose service label.Staging/prod compose inheritance claim: Confirmed. Both use
-f docker-compose.yml -f docker-compose.{staging,prod}.ymloverlay pattern. Neither overrides Grafana volumes, so base compose mounts are inherited.Existing
config/monitoring/directory: Contains Prometheus-style alert rules (service-health.yml). The plan creates Grafana-native alerting in a separate directory (config/grafana/provisioning/alerting/), which is correct since these are different alerting systems (Prometheus vs Grafana unified alerting).Temporal contamination: No violations detected. The plan describes deliverables and file paths without referencing past changes or using change-relative language.
Forbidden patterns: No marketing language, hedging, or aspirational phrasing detected.
Execution order diagram: Correctly shows M1 as prerequisite for all others, with M2-M5 parallelizable and M6 depending on all dashboards existing.
Verdict: 2 findings require revision before implementation (F1, F2). F3 should also be corrected to prevent implementor confusion with LogQL syntax. F4-F6 are recommendations that can be addressed during implementation.
Plan Review Cycle Results
Phase: Plan-Review | Agent: Quality Reviewer + Technical Writer | Status: PASS_WITH_CONCERNS
1. QR Plan-Completeness: PASS_WITH_CONCERNS
All 11 acceptance criteria are covered. 4 findings:
Not Flagged: Mobile+desktop requirement not applicable (Grafana is ops tooling, not user-facing app). Execution order correct. File changes comprehensive.
2. TW Plan-Scrub: NEEDS_REVISION
3 findings requiring revision:
quantile_over_timesyntaxunwrapplacement, add__error__=""filterLower priority: Host dir naming convention (LOW), unnecessary .gitkeep (LOW), document
durationfield unit in ms (INFO).Verified correct:
msg="Request processed"matches backend code exactly. Alloy labels confirmed. Staging/prod compose inheritance confirmed.3. QR Plan-Code (RULE 0/1/2): PASS
No blocking RULE 0/1/2 violations.
:romounts correct, no security concernsNon-blocking notes: Verify
msg="Request processed"exists in backend logs (confirmed). Consider documenting alert threshold tuning process.4. QR Plan-Docs: NEEDS_CHANGES
4 findings:
config/CLAUDE.mdupdategrafana/dashboards/andgrafana/provisioning/to config navigation tableConsolidated Action Items
Must fix before implementation:
quantile_over_timesyntax in M3config/CLAUDE.mdupdate to M6 deliverablesShould fix:
6. Add Decision Log section to plan
7. Add Policy Defaults (refresh rate, time range, eval interval)
8. Add inline YAML comments to provisioning files
9. Add "how to add/modify dashboards" to LOGGING.md update
10. Document
durationfield unit (milliseconds)Verdict: PASS_WITH_CONCERNS | Next: Address findings, then revise plan
Revised Plan: Grafana Dashboards and Alerting (v2)
Phase: Planning | Agent: Planner | Status: IN_PROGRESS
All review findings addressed. Changes marked with [REV].
Decision Log [REV - added per QR completeness finding]
quantile_over_timePolicy Defaults [REV - added per QR completeness finding]
updateIntervalSecondsallowUiUpdatesdurationfield unitdurationfieldMilestone Breakdown (Revised)
M1: Provisioning Infrastructure (#106)
Agent: Platform Agent
[REV - consolidated ALL provisioning mounts here, fixed paths]
Host directory structure:
Container mount mapping:
docker-compose.yml mvp-grafana volumes (final state):
Provisioning YAML (
config/grafana/provisioning/dashboards.yml) [REV - added inline comments]:Files Changed: 2 new files, 1 modified
config/grafana/provisioning/dashboards.yml(NEW)config/grafana/provisioning/alerting/.gitkeep(NEW - placeholder for M6 alert files)docker-compose.yml(MODIFY - add 3 volume mounts to mvp-grafana)Validation: Grafana starts, logs show "MotoVaultPro" dashboard provider loaded, alerting provisioning directory exists
M2: Application Overview Dashboard (#107)
Agent: Platform Agent
Create
config/grafana/dashboards/application-overview.jsonwith 5 panels:sum by (container) (count_over_time({container=~"mvp-.*"}[1m]))sum(count_over_time({container=~"mvp-.*"} | json | level="error" [5m])) / sum(count_over_time({container=~"mvp-.*"}[5m])) * 100sum by (container, level) (count_over_time({container=~"mvp-.*"} | json [5m]))count_over_time({container="mvp-backend"}[5m]) > 0(one per container)count_over_time({container="mvp-backend"} | json | msg="Request processed" [1m])Dashboard settings: Refresh 30s, default time range 6h.
Files Changed: 1 new
Validation: Dashboard auto-loads, all panels render
M3: API Performance Dashboard (#108)
Agent: Platform Agent
Create
config/grafana/dashboards/api-performance.jsonwith 6 panels:rate({container="mvp-backend"} | json | msg="Request processed" [1m])quantile_over_time(0.50, {container="mvp-backend"} | json | msg="Request processed" | unwrap duration | __error__="" [5m]) by ()sum by (status) (count_over_time({container="mvp-backend"} | json | msg="Request processed" [5m]))topk(10, avg by (path) (avg_over_time({container="mvp-backend"} | json | msg="Request processed" | unwrap duration | __error__="" [5m])))sum by (path) (count_over_time({container="mvp-backend"} | json | msg="Request processed" [5m]))sum by (path, status) (count_over_time({container="mvp-backend"} | json | msg="Request processed" [5m]))Note: Percentile panels display log-based approximations.
durationfield is in milliseconds (from Pino logger).__error__=""filters parse failures.Dashboard settings: Refresh 30s, default time range 6h.
Files Changed: 1 new
Validation: Dashboard auto-loads, percentile panels render, endpoint tables populated
M4: Error Investigation Dashboard (#109)
Agent: Platform Agent
Create
config/grafana/dashboards/error-investigation.jsonwith 7 panels + 1 template variable:{container=~"mvp-.*"} | json | level="error"sum(count_over_time({container=~"mvp-.*"} | json | level="error" [1m]))sum by (container) (count_over_time({container=~"mvp-.*"} | json | level="error" [5m]))sum by (path) (count_over_time({container="mvp-backend"} | json | level="error" [5m])){container="mvp-backend"} | json | level="error" | line_format "{{.error}}\n{{.stack}}"requestId){container="mvp-backend"} |= "$requestId"{container="mvp-backend"} | json | msg="Request processed" | status >= 500Template variable:
requestId(text input)Dashboard settings: Refresh 30s, default time range 6h.
Files Changed: 1 new
Validation: Error stream works, requestId lookup works, stack traces visible
M5: Infrastructure Dashboard (#110)
Agent: Platform Agent
Create
config/grafana/dashboards/infrastructure.jsonwith 8 panels:sum by (container) (rate({container=~"mvp-.*"}[1m])){container="mvp-postgres"} |~ "ERROR|WARNING|FATAL"{container="mvp-redis"}{container="mvp-traefik"}{container="mvp-traefik"} |~ "level=error|err="{container="mvp-ocr"}{container="mvp-ocr"} |~ "ERROR|error|Exception|Traceback"sum(rate({container="mvp-loki"}[1m]))Dashboard settings: Refresh 30s, default time range 6h.
Files Changed: 1 new
Validation: All infrastructure containers have panels rendering data
M6: Alerting Rules and Documentation (#111)
Agent: Platform Agent
[REV - explicit contact point/notification policy creation, detailed docs scope, config/CLAUDE.md update]
Alerting files (in
config/grafana/provisioning/alerting/):alert-rules.yml[REV - inline comments]:sum(count_over_time({container=~"mvp-.*"} | json | level="error" [5m])) / sum(count_over_time({container=~"mvp-.*"}[5m])) * 100 > 5| severity: critical | for: 5m | eval: 1mcount_over_time({container="mvp-backend"}[5m]) == 0(per critical container: backend, postgres, redis) | severity: warning | for: 5m | eval: 1msum(count_over_time({container="mvp-backend"} | json | msg="Request processed" | status >= 500 [5m])) > 10| severity: critical | for: 5m | eval: 1mcontact-points.yml:notification-policies.yml:Documentation updates:
docs/LOGGING.md- Add these sections after existing "Grafana Access" section:config/grafana/dashboards/, container restart to reloadconfig/CLAUDE.md[REV - new deliverable]: Update subdirectories table:grafana/dashboards/- "Provisioned Grafana dashboard JSON files"grafana/provisioning/- "Grafana provisioning configs (dashboards, alerting)"Files Changed: 3 new (replace .gitkeep from M1), 3 modified
config/grafana/provisioning/alerting/alert-rules.yml(NEW)config/grafana/provisioning/alerting/contact-points.yml(NEW)config/grafana/provisioning/alerting/notification-policies.yml(NEW)docs/LOGGING.md(MODIFY)config/CLAUDE.md(MODIFY)config/grafana/provisioning/alerting/.gitkeep(replaced by real files)Validation: Alert rules load on startup, visible in Grafana Alerting UI, docs updated, config/CLAUDE.md navigation updated
Execution Order (unchanged)
Total Files Changed (revised)
Risk Register (unchanged)
__error__=""filter, documented as approximationVerdict: APPROVED | Next: Create branch and begin execution
Final Review: Quality Agent RULE 0/1/2
Phase: Review | Agent: Quality Agent | Status: PASS_WITH_MINOR_CONCERNS
RULE 0: Production Reliability (CRITICAL)
VERDICT: PASS
:ro(read-only) flag correctly| __error__=""filtersnoDataStateandexecErrStateexplicitly setdatasourceUid: lokiRULE 1: Project Conformance (HIGH)
VERDICT: PASS_WITH_CONCERNS
Conforming:
config/grafana/{subdirectory}/)docs/LOGGING.md) updated comprehensivelyConcern: config/CLAUDE.md not updated
Plan v2 M6 deliverable 5 specified adding
grafana/dashboards/andgrafana/provisioning/toconfig/CLAUDE.mdsubdirectories table. This was not done. Non-blocking - affects future AI context efficiency only.Acceptance Criteria (all met):
RULE 2: Structural Quality (SHOULD_FIX)
VERDICT: PASS
Files Changed (11 total)
Non-Blocking Action Item
Update
config/CLAUDE.mdwithgrafana/dashboards/andgrafana/provisioning/rows (can be follow-up).Verdict: PASS_WITH_MINOR_CONCERNS | Next: Merge PR