diff --git a/ETL-FIXES.md b/ETL-FIXES.md new file mode 100644 index 0000000..39e5ea1 --- /dev/null +++ b/ETL-FIXES.md @@ -0,0 +1,275 @@ +# ETL Fixes Plan (Multi‑Agent Dispatch) — Vehicle Dropdown Data + +## Purpose +Fix the ETL that populates vehicle dropdown data so it is: +- Clean (no duplicate dimension rows, no duplicate fact rows). +- Year-accurate for trims (no “impossible” year/make/model/trim combinations). +- Rerunnable across environments. +- Limited to a configurable year window (default **2000–2026**) with **no API-level filtering changes**. + +This plan is written to be dispatched to multiple AI agents working in parallel. + +## Scope +Backend vehicle dropdowns (Year → Make → Model → Trim → Engine → Transmission). + +In-scope: +- ETL logic and output SQL generation (`data/make-model-import/etl_generate_sql.py`). +- Import script behavior (`data/make-model-import/import_data.sh`). +- ETL schema migration used by the import (`data/make-model-import/migrations/001_create_vehicle_database.sql`). +- Data quality validation harness (new script(s)). +- Documentation updates for rerun workflow. + +Out-of-scope: +- Any API filtering logic changes. The API must continue to reflect whatever data exists in the DB. +- Network calls or new scraping. **Use local scraped data only.** + +## Current Data Contract (as used by backend) +Backend dropdowns currently query: +- `public.vehicle_options` +- `public.engines` (joined by `engine_id`) +- `public.transmissions` (joined by `transmission_id`) + +Primary call sites: +- `backend/src/features/platform/data/vehicle-data.repository.ts` +- `backend/src/features/platform/domain/vehicle-data.service.ts` +- Dropdown routes: `backend/src/features/vehicles/api/vehicles.routes.ts` + +## Requirements (Confirmed) +### Year range behavior +- Data outside the configured year window must **not be loaded**. +- Default year window: **2000–2026** (configurable). +- No API changes to filter years. + +### Missing data defaults +- If **no trim exists**, map to trim `"Base"`. +- If **no detailed engine spec exists**, default to one of: `Gas` / `Diesel` / `Electric` / `Hybrid`. + - If local scraped data indicates EV → show `Electric`. + - If indicates Diesel → show `Diesel`. + - If indicates Hybrid (including mild / plug-in) → show `Hybrid`. + - Else default → `Gas`. +- If **no specific transmission data** exists for a `(year, make, model)`, show **both** `Manual` and `Automatic`. +- If a detailed engine spec is known, **always use the detailed engine spec** (do not replace it with the fuel-type default label). + +### Transmission granularity +- Transmission dropdown should be correct at the `(year, make, model)` level (trim-specific not required). + +## Observed Defects (Root Causes) +### 1) Massive duplicate dimension rows +Examples: +- `data/make-model-import/output/02_transmissions.sql` contains repeated values like: + - `(1,'1-Speed Automatic')`, `(2,'1-Speed Automatic')`, … +Reason: +- ETL dedupes transmissions on a raw tuple `(gearbox_string, speed, drive_type)` but *stores* only a simplified display string, so many distinct raw tuples collapse to the same output `type`. + +Similarly for engines: +- `data/make-model-import/output/01_engines.sql` has many repeated engine display names. +Reason: +- ETL assigns IDs per raw scraped engine record (30,066), even though the UI-facing engine name collapses to far fewer distinct names. + +### 2) Inaccurate year/make/model/trim mappings (dropdown integrity break) +Example: +- User can select `1992 Chevrolet Corvette Z06` which never existed. +Root cause: +- `data/make-model-import/makes-filter/*.json` includes trims/submodels that appear to be “all-time” variants, not year-accurate. + - Example evidence: `data/make-model-import/makes-filter/chevrolet.json` contains `Z06` for 1992 Corvette. + - Resulting DB evidence: `data/make-model-import/output/03_vehicle_options.sql` includes `(1992,'Chevrolet','Corvette','Z06',...)`. + +### 3) Duplicate rows in `vehicle_options` +Example evidence: +- `data/make-model-import/output/03_vehicle_options.sql` shows repeated identical rows for the same year/make/model/trim/engine/transmission. +Root causes: +- No dedupe at the fact level prior to SQL generation. +- Dimension ID strategy makes it difficult to dedupe correctly. + +## Data Sources (Local Only) +Inputs in `data/make-model-import/`: +- `makes-filter/*.json`: provides coverage for makes/models by year, but trims/engines are not reliable for year accuracy. +- `automobiles.json`: contains “model pages” with names that include year or year ranges (e.g., `2013-2019`, `2021-Present`). +- `engines.json`: engine records keyed to `automobile_id` with specs and “Transmission Specs”. +- `brands.json`: make name metadata (ALL CAPS) + `id` used by `automobiles.json.brand_id`. + +## Target ETL Strategy (Baseline Grid + Evidence Overlay) +We cannot use network sources, so the best available path is: +1) **Baseline coverage** from `makes-filter` for `(year, make, model)` within `[MIN_YEAR, MAX_YEAR]`. +2) **Year-accuracy overlay** from `automobiles.json` + `engines.json`: + - Parse each automobile into: + - Canonical make (via `brand_id → brands.json` mapping). + - Model name and an inferred trim/variant string. + - Year range (start/end) from the automobile name. + - Use these to build **evidence sets**: + - Which trims are evidenced for a `(make, model)` and which year ranges they apply to. + - Which engines/transmissions are evidenced (via `engines.json`) for that automobile entry. +3) Generate `vehicle_options`: + - For each baseline `(year, make, model)`: + - If overlay evidence exists for that `(make, model, year)`: + - Use evidenced trims for that year (trim defaults to `Base` if missing). + - Engines: use detailed engine display names when available; else fuel-type fallback label. + - Transmissions: derive from engine specs when available; else fallback to `Manual`+`Automatic`. + - If no overlay evidence exists: + - Create a single row with trim `Base`. + - Engine default label `Gas` (or other fuel label if you can infer it locally without guessing; otherwise `Gas`). + - Transmission fallback `Manual`+`Automatic`. + +This approach ensures: +- Completeness: you still have a working dropdown for all year/make/model combos in-range. +- Accuracy improvements where the scraped evidence supports it (especially trims by year). +- No invented trims like `Z06` in years where there is no overlay evidence for `Z06` in that year range. + +## Engine & Transmission Normalization Rules +### Engine display name +Use existing ETL display logic as a base (from `etl_generate_sql.py`) but change the ID strategy: +- If you can create a detailed engine display string (e.g., `V8 5.7L`, `L4 2.0L Turbo`), use it. +- Only use default labels when detailed specs are not available: + - `Electric` if fuel indicates electric. + - `Diesel` if fuel indicates diesel. + - `Hybrid` if fuel indicates any hybrid variant. + - Else `Gas`. + +Fuel mapping should be derived from `engines.json → specs → Engine Specs → Fuel:` which currently includes values like: +- `Electric` +- `Diesel` +- `Hybrid`, `Hybrid Gasoline`, `Mild Hybrid`, `Mild Hybrid Diesel`, `Plug-In Hybrid`, etc. + +### Transmission display name +Normalize to a small set of UI-friendly strings: +- Prefer `"{N}-Speed Manual"` or `"{N}-Speed Automatic"` when speed and type are known. +- Preserve `CVT`. +- If unknown for a `(year, make, model)`, provide both `Manual` and `Automatic`. + +Important: transmission table IDs must be keyed by the **final display name**, not the raw tuple. + +## Schema + Import Requirements (Rerunnable + Clean) +### Migration changes +Update `data/make-model-import/migrations/001_create_vehicle_database.sql` to: +- Match actual stored columns (current migration defines extra columns not populated by ETL). +- Enforce uniqueness to prevent duplicates: + - `engines`: unique on normalized name (e.g., `UNIQUE (LOWER(name))`). + - `transmissions`: unique on normalized type (e.g., `UNIQUE (LOWER(type))`). + - `vehicle_options`: unique on `(year, make, model, trim, engine_id, transmission_id)`. + +### Import script changes +Update `data/make-model-import/import_data.sh` so reruns are consistent: +- Either: + - `TRUNCATE vehicle_options, engines, transmissions RESTART IDENTITY CASCADE;` before import, then insert, OR + - Use `INSERT ... ON CONFLICT DO NOTHING` with deterministic IDs (more complex). + +Given constraints and large volume, truncation + re-import is simplest and most deterministic for dev environments. + +## Validation / QA Harness (New) +Add a new script (recommended location: `data/make-model-import/qa_validate.py`) plus a small SQL file or inline queries. + +Must-check assertions: +1) **Year window enforced** + - `MIN(year) >= MIN_YEAR` and `MAX(year) <= MAX_YEAR`. +2) **No dimension duplicates** + - `SELECT LOWER(name), COUNT(*) FROM engines GROUP BY 1 HAVING COUNT(*) > 1;` returns 0 rows. + - `SELECT LOWER(type), COUNT(*) FROM transmissions GROUP BY 1 HAVING COUNT(*) > 1;` returns 0 rows. +3) **No fact duplicates** + - `SELECT year, make, model, trim, engine_id, transmission_id, COUNT(*) FROM vehicle_options GROUP BY 1,2,3,4,5,6 HAVING COUNT(*) > 1;` returns 0 rows. +4) **Dropdown integrity sanity** + - For sampled `(year, make, model)`, trims returned by `get_trims_for_year_make_model()` must match distinct trims in `vehicle_options` for that tuple. + - For sampled `(year, make, model, trim)`, engines query matches `vehicle_options` join to `engines`. + - For sampled `(year, make, model)`, transmissions query matches `vehicle_options` join to `transmissions` (plus fallbacks when missing). + +Optional (recommended) golden assertions: +- Add a small list of “known invalid historically” checks (like `1992 Corvette Z06`) that must return empty / not present. + - These should be driven by overlay evidence (do not hardcode large historical facts without evidence in local data). + +## Work Breakdown (Assign to Agents) +### Agent A — ETL Core Refactor +Owner: ETL generation logic. + +Deliverables: +- Update `data/make-model-import/etl_generate_sql.py`: + - Add config: `MIN_YEAR`/`MAX_YEAR` (defaults `2000`/`2026`). + - Replace current engine/transmission ID assignment with dedup-by-display-name mapping. + - Remove coupling where an `engine_id` implies an index into `engines.json` for transmission lookup. + - Implement fuel-type fallback label logic (`Gas/Diesel/Electric/Hybrid`) only when detailed engine spec cannot be built. + - Dedupe `vehicle_options` rows before writing SQL. + +Acceptance: +- Generated `output/01_engines.sql` and `output/02_transmissions.sql` contain only unique values. +- Generated `output/03_vehicle_options.sql` contains no duplicate tuples. +- Output respects `[MIN_YEAR, MAX_YEAR]`. + +### Agent B — Overlay Evidence Builder (Year-Accurate Trims) +Owner: parse `automobiles.json` and build trim/year evidence. + +Deliverables: +- Implement parsing in `etl_generate_sql.py` (or a helper module if splitting is allowed) to: + - Extract year or year range from `automobiles.json.name` (handle `YYYY`, `YYYY-YYYY`, `YYYY-Present`). + - Map `brand_id → canonical make`. + - Normalize automobile “model+variant” string. + - Match against known models for that make (derived from `makes-filter`) to split `model` vs `trim`. + - Produce an evidence structure: for `(make, model)`, a list of `(trim, year_start, year_end)`. + +Acceptance: +- Evidence filtering prevents trims that have no evidenced year overlap from appearing in those years when generating `vehicle_options`. + +Notes: +- Matching model vs trim is heuristic; implement conservative logic: + - Prefer the longest model name match. + - If ambiguity, do not guess trim; default to `Base` and log a counter for review. + +### Agent C — DB Migration + Constraints +Owner: schema correctness and preventing duplicates. + +Deliverables: +- Update `data/make-model-import/migrations/001_create_vehicle_database.sql`: + - Align columns to the ETL output (keep only what’s used). + - Add uniqueness constraints (engines/transmissions dims + vehicle_options fact). + - Ensure functions `get_makes_for_year`, `get_models_for_year_make`, `get_trims_for_year_make_model` remain compatible. + +Acceptance: +- Rerunning import does not create duplicates even if the ETL output accidentally contains repeats (constraints will reject). + +### Agent D — Import Script Rerun Safety +Owner: repeatable import process. + +Deliverables: +- Update `data/make-model-import/import_data.sh`: + - Clear tables deterministically (truncate + restart identity) before import. + - Import order: schema → engines → transmissions → vehicle_options. + - Print verification counts and min/max year. + +Acceptance: +- Running `./import_data.sh` twice produces identical row counts and no errors. + +### Agent E — QA Harness +Owner: automated validation. + +Deliverables: +- Add `data/make-model-import/qa_validate.py` with: + - Connect-free checks using generated SQL files (fast pre-import) AND/OR + - Post-import checks executed via `docker exec mvp-postgres psql ...` (slower, authoritative). +- Add a short `data/make-model-import/QA_README.md` or extend existing docs with exact commands. + +Acceptance: +- QA script fails on duplicates, out-of-range years, and basic dropdown integrity mismatches. + +### Agent F (Optional) — Backend/Docs Consistency +Owner: documentation accuracy. + +Deliverables: +- Update docs that reference the old normalized `vehicles.*` schema if they conflict with the current `vehicle_options` based system. + - Primary references: `docs/VEHICLES-API.md`, `backend/src/features/platform/README.md` (verify claims). + +Acceptance: +- Docs correctly describe the actual dropdown data source and rerun steps. + +## Rollout Plan +1) Implement ETL refactor + evidence overlay + constraints + rerunnable import. +2) Regenerate SQL (`python3 etl_generate_sql.py` in `data/make-model-import/`). +3) Re-import (`./import_data.sh`). +4) Flush Redis dropdown caches (if needed) and re-test dropdowns. +5) Run QA harness and capture summary output in a `stats.txt` (or similar). + +## Acceptance Criteria (End-to-End) +- Years available in dropdown are exactly those loaded (default 2000–2026). +- Makes for a year only include makes with models in that year. +- Models for year+make only include models available for that tuple. +- Trims for year+make+model do not include impossible trims (e.g., no `1992 Corvette Z06` unless local evidence supports it). +- Engines show detailed specs when available; otherwise show one of `Gas/Diesel/Electric/Hybrid`. +- Transmissions show derived options when available; otherwise show both `Manual` and `Automatic`. +- No duplicate dimension rows; no duplicate fact rows. + diff --git a/docs/PROMPTS.md b/docs/PROMPTS.md index 0a8adf1..c9e531c 100644 --- a/docs/PROMPTS.md +++ b/docs/PROMPTS.md @@ -48,4 +48,7 @@ Your task is to create a plan to fix a previous ETL process for importing Automo *** DATA MISSING HANDLING *** - If no Trim exists, map it to "Base" - If no specific engine is available default to "Gas" "Diesel" or "Electric" -- If no specific transmission data is available default to "Manual" or "Automatic" \ No newline at end of file +- If no specific transmission data is available default to "Manual" or "Automatic" + +*** CRITICAL *** +- Make no assumptions. Ask for clarification on anything not clear. \ No newline at end of file