Files
motovaultpro/ETL-FIXES.md
2025-12-14 13:55:39 -06:00

276 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ETL Fixes Plan (MultiAgent Dispatch) — Vehicle Dropdown Data
## Purpose
Fix the ETL that populates vehicle dropdown data so it is:
- Clean (no duplicate dimension rows, no duplicate fact rows).
- Year-accurate for trims (no “impossible” year/make/model/trim combinations).
- Rerunnable across environments.
- Limited to a configurable year window (default **20002026**) with **no API-level filtering changes**.
This plan is written to be dispatched to multiple AI agents working in parallel.
## Scope
Backend vehicle dropdowns (Year → Make → Model → Trim → Engine → Transmission).
In-scope:
- ETL logic and output SQL generation (`data/make-model-import/etl_generate_sql.py`).
- Import script behavior (`data/make-model-import/import_data.sh`).
- ETL schema migration used by the import (`data/make-model-import/migrations/001_create_vehicle_database.sql`).
- Data quality validation harness (new script(s)).
- Documentation updates for rerun workflow.
Out-of-scope:
- Any API filtering logic changes. The API must continue to reflect whatever data exists in the DB.
- Network calls or new scraping. **Use local scraped data only.**
## Current Data Contract (as used by backend)
Backend dropdowns currently query:
- `public.vehicle_options`
- `public.engines` (joined by `engine_id`)
- `public.transmissions` (joined by `transmission_id`)
Primary call sites:
- `backend/src/features/platform/data/vehicle-data.repository.ts`
- `backend/src/features/platform/domain/vehicle-data.service.ts`
- Dropdown routes: `backend/src/features/vehicles/api/vehicles.routes.ts`
## Requirements (Confirmed)
### Year range behavior
- Data outside the configured year window must **not be loaded**.
- Default year window: **20002026** (configurable).
- No API changes to filter years.
### Missing data defaults
- If **no trim exists**, map to trim `"Base"`.
- If **no detailed engine spec exists**, default to one of: `Gas` / `Diesel` / `Electric` / `Hybrid`.
- If local scraped data indicates EV → show `Electric`.
- If indicates Diesel → show `Diesel`.
- If indicates Hybrid (including mild / plug-in) → show `Hybrid`.
- Else default → `Gas`.
- If **no specific transmission data** exists for a `(year, make, model)`, show **both** `Manual` and `Automatic`.
- If a detailed engine spec is known, **always use the detailed engine spec** (do not replace it with the fuel-type default label).
### Transmission granularity
- Transmission dropdown should be correct at the `(year, make, model)` level (trim-specific not required).
## Observed Defects (Root Causes)
### 1) Massive duplicate dimension rows
Examples:
- `data/make-model-import/output/02_transmissions.sql` contains repeated values like:
- `(1,'1-Speed Automatic')`, `(2,'1-Speed Automatic')`, …
Reason:
- ETL dedupes transmissions on a raw tuple `(gearbox_string, speed, drive_type)` but *stores* only a simplified display string, so many distinct raw tuples collapse to the same output `type`.
Similarly for engines:
- `data/make-model-import/output/01_engines.sql` has many repeated engine display names.
Reason:
- ETL assigns IDs per raw scraped engine record (30,066), even though the UI-facing engine name collapses to far fewer distinct names.
### 2) Inaccurate year/make/model/trim mappings (dropdown integrity break)
Example:
- User can select `1992 Chevrolet Corvette Z06` which never existed.
Root cause:
- `data/make-model-import/makes-filter/*.json` includes trims/submodels that appear to be “all-time” variants, not year-accurate.
- Example evidence: `data/make-model-import/makes-filter/chevrolet.json` contains `Z06` for 1992 Corvette.
- Resulting DB evidence: `data/make-model-import/output/03_vehicle_options.sql` includes `(1992,'Chevrolet','Corvette','Z06',...)`.
### 3) Duplicate rows in `vehicle_options`
Example evidence:
- `data/make-model-import/output/03_vehicle_options.sql` shows repeated identical rows for the same year/make/model/trim/engine/transmission.
Root causes:
- No dedupe at the fact level prior to SQL generation.
- Dimension ID strategy makes it difficult to dedupe correctly.
## Data Sources (Local Only)
Inputs in `data/make-model-import/`:
- `makes-filter/*.json`: provides coverage for makes/models by year, but trims/engines are not reliable for year accuracy.
- `automobiles.json`: contains “model pages” with names that include year or year ranges (e.g., `2013-2019`, `2021-Present`).
- `engines.json`: engine records keyed to `automobile_id` with specs and “Transmission Specs”.
- `brands.json`: make name metadata (ALL CAPS) + `id` used by `automobiles.json.brand_id`.
## Target ETL Strategy (Baseline Grid + Evidence Overlay)
We cannot use network sources, so the best available path is:
1) **Baseline coverage** from `makes-filter` for `(year, make, model)` within `[MIN_YEAR, MAX_YEAR]`.
2) **Year-accuracy overlay** from `automobiles.json` + `engines.json`:
- Parse each automobile into:
- Canonical make (via `brand_id → brands.json` mapping).
- Model name and an inferred trim/variant string.
- Year range (start/end) from the automobile name.
- Use these to build **evidence sets**:
- Which trims are evidenced for a `(make, model)` and which year ranges they apply to.
- Which engines/transmissions are evidenced (via `engines.json`) for that automobile entry.
3) Generate `vehicle_options`:
- For each baseline `(year, make, model)`:
- If overlay evidence exists for that `(make, model, year)`:
- Use evidenced trims for that year (trim defaults to `Base` if missing).
- Engines: use detailed engine display names when available; else fuel-type fallback label.
- Transmissions: derive from engine specs when available; else fallback to `Manual`+`Automatic`.
- If no overlay evidence exists:
- Create a single row with trim `Base`.
- Engine default label `Gas` (or other fuel label if you can infer it locally without guessing; otherwise `Gas`).
- Transmission fallback `Manual`+`Automatic`.
This approach ensures:
- Completeness: you still have a working dropdown for all year/make/model combos in-range.
- Accuracy improvements where the scraped evidence supports it (especially trims by year).
- No invented trims like `Z06` in years where there is no overlay evidence for `Z06` in that year range.
## Engine & Transmission Normalization Rules
### Engine display name
Use existing ETL display logic as a base (from `etl_generate_sql.py`) but change the ID strategy:
- If you can create a detailed engine display string (e.g., `V8 5.7L`, `L4 2.0L Turbo`), use it.
- Only use default labels when detailed specs are not available:
- `Electric` if fuel indicates electric.
- `Diesel` if fuel indicates diesel.
- `Hybrid` if fuel indicates any hybrid variant.
- Else `Gas`.
Fuel mapping should be derived from `engines.json → specs → Engine Specs → Fuel:` which currently includes values like:
- `Electric`
- `Diesel`
- `Hybrid`, `Hybrid Gasoline`, `Mild Hybrid`, `Mild Hybrid Diesel`, `Plug-In Hybrid`, etc.
### Transmission display name
Normalize to a small set of UI-friendly strings:
- Prefer `"{N}-Speed Manual"` or `"{N}-Speed Automatic"` when speed and type are known.
- Preserve `CVT`.
- If unknown for a `(year, make, model)`, provide both `Manual` and `Automatic`.
Important: transmission table IDs must be keyed by the **final display name**, not the raw tuple.
## Schema + Import Requirements (Rerunnable + Clean)
### Migration changes
Update `data/make-model-import/migrations/001_create_vehicle_database.sql` to:
- Match actual stored columns (current migration defines extra columns not populated by ETL).
- Enforce uniqueness to prevent duplicates:
- `engines`: unique on normalized name (e.g., `UNIQUE (LOWER(name))`).
- `transmissions`: unique on normalized type (e.g., `UNIQUE (LOWER(type))`).
- `vehicle_options`: unique on `(year, make, model, trim, engine_id, transmission_id)`.
### Import script changes
Update `data/make-model-import/import_data.sh` so reruns are consistent:
- Either:
- `TRUNCATE vehicle_options, engines, transmissions RESTART IDENTITY CASCADE;` before import, then insert, OR
- Use `INSERT ... ON CONFLICT DO NOTHING` with deterministic IDs (more complex).
Given constraints and large volume, truncation + re-import is simplest and most deterministic for dev environments.
## Validation / QA Harness (New)
Add a new script (recommended location: `data/make-model-import/qa_validate.py`) plus a small SQL file or inline queries.
Must-check assertions:
1) **Year window enforced**
- `MIN(year) >= MIN_YEAR` and `MAX(year) <= MAX_YEAR`.
2) **No dimension duplicates**
- `SELECT LOWER(name), COUNT(*) FROM engines GROUP BY 1 HAVING COUNT(*) > 1;` returns 0 rows.
- `SELECT LOWER(type), COUNT(*) FROM transmissions GROUP BY 1 HAVING COUNT(*) > 1;` returns 0 rows.
3) **No fact duplicates**
- `SELECT year, make, model, trim, engine_id, transmission_id, COUNT(*) FROM vehicle_options GROUP BY 1,2,3,4,5,6 HAVING COUNT(*) > 1;` returns 0 rows.
4) **Dropdown integrity sanity**
- For sampled `(year, make, model)`, trims returned by `get_trims_for_year_make_model()` must match distinct trims in `vehicle_options` for that tuple.
- For sampled `(year, make, model, trim)`, engines query matches `vehicle_options` join to `engines`.
- For sampled `(year, make, model)`, transmissions query matches `vehicle_options` join to `transmissions` (plus fallbacks when missing).
Optional (recommended) golden assertions:
- Add a small list of “known invalid historically” checks (like `1992 Corvette Z06`) that must return empty / not present.
- These should be driven by overlay evidence (do not hardcode large historical facts without evidence in local data).
## Work Breakdown (Assign to Agents)
### Agent A — ETL Core Refactor
Owner: ETL generation logic.
Deliverables:
- Update `data/make-model-import/etl_generate_sql.py`:
- Add config: `MIN_YEAR`/`MAX_YEAR` (defaults `2000`/`2026`).
- Replace current engine/transmission ID assignment with dedup-by-display-name mapping.
- Remove coupling where an `engine_id` implies an index into `engines.json` for transmission lookup.
- Implement fuel-type fallback label logic (`Gas/Diesel/Electric/Hybrid`) only when detailed engine spec cannot be built.
- Dedupe `vehicle_options` rows before writing SQL.
Acceptance:
- Generated `output/01_engines.sql` and `output/02_transmissions.sql` contain only unique values.
- Generated `output/03_vehicle_options.sql` contains no duplicate tuples.
- Output respects `[MIN_YEAR, MAX_YEAR]`.
### Agent B — Overlay Evidence Builder (Year-Accurate Trims)
Owner: parse `automobiles.json` and build trim/year evidence.
Deliverables:
- Implement parsing in `etl_generate_sql.py` (or a helper module if splitting is allowed) to:
- Extract year or year range from `automobiles.json.name` (handle `YYYY`, `YYYY-YYYY`, `YYYY-Present`).
- Map `brand_id → canonical make`.
- Normalize automobile “model+variant” string.
- Match against known models for that make (derived from `makes-filter`) to split `model` vs `trim`.
- Produce an evidence structure: for `(make, model)`, a list of `(trim, year_start, year_end)`.
Acceptance:
- Evidence filtering prevents trims that have no evidenced year overlap from appearing in those years when generating `vehicle_options`.
Notes:
- Matching model vs trim is heuristic; implement conservative logic:
- Prefer the longest model name match.
- If ambiguity, do not guess trim; default to `Base` and log a counter for review.
### Agent C — DB Migration + Constraints
Owner: schema correctness and preventing duplicates.
Deliverables:
- Update `data/make-model-import/migrations/001_create_vehicle_database.sql`:
- Align columns to the ETL output (keep only whats used).
- Add uniqueness constraints (engines/transmissions dims + vehicle_options fact).
- Ensure functions `get_makes_for_year`, `get_models_for_year_make`, `get_trims_for_year_make_model` remain compatible.
Acceptance:
- Rerunning import does not create duplicates even if the ETL output accidentally contains repeats (constraints will reject).
### Agent D — Import Script Rerun Safety
Owner: repeatable import process.
Deliverables:
- Update `data/make-model-import/import_data.sh`:
- Clear tables deterministically (truncate + restart identity) before import.
- Import order: schema → engines → transmissions → vehicle_options.
- Print verification counts and min/max year.
Acceptance:
- Running `./import_data.sh` twice produces identical row counts and no errors.
### Agent E — QA Harness
Owner: automated validation.
Deliverables:
- Add `data/make-model-import/qa_validate.py` with:
- Connect-free checks using generated SQL files (fast pre-import) AND/OR
- Post-import checks executed via `docker exec mvp-postgres psql ...` (slower, authoritative).
- Add a short `data/make-model-import/QA_README.md` or extend existing docs with exact commands.
Acceptance:
- QA script fails on duplicates, out-of-range years, and basic dropdown integrity mismatches.
### Agent F (Optional) — Backend/Docs Consistency
Owner: documentation accuracy.
Deliverables:
- Update docs that reference the old normalized `vehicles.*` schema if they conflict with the current `vehicle_options` based system.
- Primary references: `docs/VEHICLES-API.md`, `backend/src/features/platform/README.md` (verify claims).
Acceptance:
- Docs correctly describe the actual dropdown data source and rerun steps.
## Rollout Plan
1) Implement ETL refactor + evidence overlay + constraints + rerunnable import.
2) Regenerate SQL (`python3 etl_generate_sql.py` in `data/make-model-import/`).
3) Re-import (`./import_data.sh`).
4) Flush Redis dropdown caches (if needed) and re-test dropdowns.
5) Run QA harness and capture summary output in a `stats.txt` (or similar).
## Acceptance Criteria (End-to-End)
- Years available in dropdown are exactly those loaded (default 20002026).
- Makes for a year only include makes with models in that year.
- Models for year+make only include models available for that tuple.
- Trims for year+make+model do not include impossible trims (e.g., no `1992 Corvette Z06` unless local evidence supports it).
- Engines show detailed specs when available; otherwise show one of `Gas/Diesel/Electric/Hybrid`.
- Transmissions show derived options when available; otherwise show both `Manual` and `Automatic`.
- No duplicate dimension rows; no duplicate fact rows.