ETL Fix Plan complete. Not implemented.
This commit is contained in:
275
ETL-FIXES.md
Normal file
275
ETL-FIXES.md
Normal file
@@ -0,0 +1,275 @@
|
|||||||
|
# ETL Fixes Plan (Multi‑Agent Dispatch) — Vehicle Dropdown Data
|
||||||
|
|
||||||
|
## Purpose
|
||||||
|
Fix the ETL that populates vehicle dropdown data so it is:
|
||||||
|
- Clean (no duplicate dimension rows, no duplicate fact rows).
|
||||||
|
- Year-accurate for trims (no “impossible” year/make/model/trim combinations).
|
||||||
|
- Rerunnable across environments.
|
||||||
|
- Limited to a configurable year window (default **2000–2026**) with **no API-level filtering changes**.
|
||||||
|
|
||||||
|
This plan is written to be dispatched to multiple AI agents working in parallel.
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
Backend vehicle dropdowns (Year → Make → Model → Trim → Engine → Transmission).
|
||||||
|
|
||||||
|
In-scope:
|
||||||
|
- ETL logic and output SQL generation (`data/make-model-import/etl_generate_sql.py`).
|
||||||
|
- Import script behavior (`data/make-model-import/import_data.sh`).
|
||||||
|
- ETL schema migration used by the import (`data/make-model-import/migrations/001_create_vehicle_database.sql`).
|
||||||
|
- Data quality validation harness (new script(s)).
|
||||||
|
- Documentation updates for rerun workflow.
|
||||||
|
|
||||||
|
Out-of-scope:
|
||||||
|
- Any API filtering logic changes. The API must continue to reflect whatever data exists in the DB.
|
||||||
|
- Network calls or new scraping. **Use local scraped data only.**
|
||||||
|
|
||||||
|
## Current Data Contract (as used by backend)
|
||||||
|
Backend dropdowns currently query:
|
||||||
|
- `public.vehicle_options`
|
||||||
|
- `public.engines` (joined by `engine_id`)
|
||||||
|
- `public.transmissions` (joined by `transmission_id`)
|
||||||
|
|
||||||
|
Primary call sites:
|
||||||
|
- `backend/src/features/platform/data/vehicle-data.repository.ts`
|
||||||
|
- `backend/src/features/platform/domain/vehicle-data.service.ts`
|
||||||
|
- Dropdown routes: `backend/src/features/vehicles/api/vehicles.routes.ts`
|
||||||
|
|
||||||
|
## Requirements (Confirmed)
|
||||||
|
### Year range behavior
|
||||||
|
- Data outside the configured year window must **not be loaded**.
|
||||||
|
- Default year window: **2000–2026** (configurable).
|
||||||
|
- No API changes to filter years.
|
||||||
|
|
||||||
|
### Missing data defaults
|
||||||
|
- If **no trim exists**, map to trim `"Base"`.
|
||||||
|
- If **no detailed engine spec exists**, default to one of: `Gas` / `Diesel` / `Electric` / `Hybrid`.
|
||||||
|
- If local scraped data indicates EV → show `Electric`.
|
||||||
|
- If indicates Diesel → show `Diesel`.
|
||||||
|
- If indicates Hybrid (including mild / plug-in) → show `Hybrid`.
|
||||||
|
- Else default → `Gas`.
|
||||||
|
- If **no specific transmission data** exists for a `(year, make, model)`, show **both** `Manual` and `Automatic`.
|
||||||
|
- If a detailed engine spec is known, **always use the detailed engine spec** (do not replace it with the fuel-type default label).
|
||||||
|
|
||||||
|
### Transmission granularity
|
||||||
|
- Transmission dropdown should be correct at the `(year, make, model)` level (trim-specific not required).
|
||||||
|
|
||||||
|
## Observed Defects (Root Causes)
|
||||||
|
### 1) Massive duplicate dimension rows
|
||||||
|
Examples:
|
||||||
|
- `data/make-model-import/output/02_transmissions.sql` contains repeated values like:
|
||||||
|
- `(1,'1-Speed Automatic')`, `(2,'1-Speed Automatic')`, …
|
||||||
|
Reason:
|
||||||
|
- ETL dedupes transmissions on a raw tuple `(gearbox_string, speed, drive_type)` but *stores* only a simplified display string, so many distinct raw tuples collapse to the same output `type`.
|
||||||
|
|
||||||
|
Similarly for engines:
|
||||||
|
- `data/make-model-import/output/01_engines.sql` has many repeated engine display names.
|
||||||
|
Reason:
|
||||||
|
- ETL assigns IDs per raw scraped engine record (30,066), even though the UI-facing engine name collapses to far fewer distinct names.
|
||||||
|
|
||||||
|
### 2) Inaccurate year/make/model/trim mappings (dropdown integrity break)
|
||||||
|
Example:
|
||||||
|
- User can select `1992 Chevrolet Corvette Z06` which never existed.
|
||||||
|
Root cause:
|
||||||
|
- `data/make-model-import/makes-filter/*.json` includes trims/submodels that appear to be “all-time” variants, not year-accurate.
|
||||||
|
- Example evidence: `data/make-model-import/makes-filter/chevrolet.json` contains `Z06` for 1992 Corvette.
|
||||||
|
- Resulting DB evidence: `data/make-model-import/output/03_vehicle_options.sql` includes `(1992,'Chevrolet','Corvette','Z06',...)`.
|
||||||
|
|
||||||
|
### 3) Duplicate rows in `vehicle_options`
|
||||||
|
Example evidence:
|
||||||
|
- `data/make-model-import/output/03_vehicle_options.sql` shows repeated identical rows for the same year/make/model/trim/engine/transmission.
|
||||||
|
Root causes:
|
||||||
|
- No dedupe at the fact level prior to SQL generation.
|
||||||
|
- Dimension ID strategy makes it difficult to dedupe correctly.
|
||||||
|
|
||||||
|
## Data Sources (Local Only)
|
||||||
|
Inputs in `data/make-model-import/`:
|
||||||
|
- `makes-filter/*.json`: provides coverage for makes/models by year, but trims/engines are not reliable for year accuracy.
|
||||||
|
- `automobiles.json`: contains “model pages” with names that include year or year ranges (e.g., `2013-2019`, `2021-Present`).
|
||||||
|
- `engines.json`: engine records keyed to `automobile_id` with specs and “Transmission Specs”.
|
||||||
|
- `brands.json`: make name metadata (ALL CAPS) + `id` used by `automobiles.json.brand_id`.
|
||||||
|
|
||||||
|
## Target ETL Strategy (Baseline Grid + Evidence Overlay)
|
||||||
|
We cannot use network sources, so the best available path is:
|
||||||
|
1) **Baseline coverage** from `makes-filter` for `(year, make, model)` within `[MIN_YEAR, MAX_YEAR]`.
|
||||||
|
2) **Year-accuracy overlay** from `automobiles.json` + `engines.json`:
|
||||||
|
- Parse each automobile into:
|
||||||
|
- Canonical make (via `brand_id → brands.json` mapping).
|
||||||
|
- Model name and an inferred trim/variant string.
|
||||||
|
- Year range (start/end) from the automobile name.
|
||||||
|
- Use these to build **evidence sets**:
|
||||||
|
- Which trims are evidenced for a `(make, model)` and which year ranges they apply to.
|
||||||
|
- Which engines/transmissions are evidenced (via `engines.json`) for that automobile entry.
|
||||||
|
3) Generate `vehicle_options`:
|
||||||
|
- For each baseline `(year, make, model)`:
|
||||||
|
- If overlay evidence exists for that `(make, model, year)`:
|
||||||
|
- Use evidenced trims for that year (trim defaults to `Base` if missing).
|
||||||
|
- Engines: use detailed engine display names when available; else fuel-type fallback label.
|
||||||
|
- Transmissions: derive from engine specs when available; else fallback to `Manual`+`Automatic`.
|
||||||
|
- If no overlay evidence exists:
|
||||||
|
- Create a single row with trim `Base`.
|
||||||
|
- Engine default label `Gas` (or other fuel label if you can infer it locally without guessing; otherwise `Gas`).
|
||||||
|
- Transmission fallback `Manual`+`Automatic`.
|
||||||
|
|
||||||
|
This approach ensures:
|
||||||
|
- Completeness: you still have a working dropdown for all year/make/model combos in-range.
|
||||||
|
- Accuracy improvements where the scraped evidence supports it (especially trims by year).
|
||||||
|
- No invented trims like `Z06` in years where there is no overlay evidence for `Z06` in that year range.
|
||||||
|
|
||||||
|
## Engine & Transmission Normalization Rules
|
||||||
|
### Engine display name
|
||||||
|
Use existing ETL display logic as a base (from `etl_generate_sql.py`) but change the ID strategy:
|
||||||
|
- If you can create a detailed engine display string (e.g., `V8 5.7L`, `L4 2.0L Turbo`), use it.
|
||||||
|
- Only use default labels when detailed specs are not available:
|
||||||
|
- `Electric` if fuel indicates electric.
|
||||||
|
- `Diesel` if fuel indicates diesel.
|
||||||
|
- `Hybrid` if fuel indicates any hybrid variant.
|
||||||
|
- Else `Gas`.
|
||||||
|
|
||||||
|
Fuel mapping should be derived from `engines.json → specs → Engine Specs → Fuel:` which currently includes values like:
|
||||||
|
- `Electric`
|
||||||
|
- `Diesel`
|
||||||
|
- `Hybrid`, `Hybrid Gasoline`, `Mild Hybrid`, `Mild Hybrid Diesel`, `Plug-In Hybrid`, etc.
|
||||||
|
|
||||||
|
### Transmission display name
|
||||||
|
Normalize to a small set of UI-friendly strings:
|
||||||
|
- Prefer `"{N}-Speed Manual"` or `"{N}-Speed Automatic"` when speed and type are known.
|
||||||
|
- Preserve `CVT`.
|
||||||
|
- If unknown for a `(year, make, model)`, provide both `Manual` and `Automatic`.
|
||||||
|
|
||||||
|
Important: transmission table IDs must be keyed by the **final display name**, not the raw tuple.
|
||||||
|
|
||||||
|
## Schema + Import Requirements (Rerunnable + Clean)
|
||||||
|
### Migration changes
|
||||||
|
Update `data/make-model-import/migrations/001_create_vehicle_database.sql` to:
|
||||||
|
- Match actual stored columns (current migration defines extra columns not populated by ETL).
|
||||||
|
- Enforce uniqueness to prevent duplicates:
|
||||||
|
- `engines`: unique on normalized name (e.g., `UNIQUE (LOWER(name))`).
|
||||||
|
- `transmissions`: unique on normalized type (e.g., `UNIQUE (LOWER(type))`).
|
||||||
|
- `vehicle_options`: unique on `(year, make, model, trim, engine_id, transmission_id)`.
|
||||||
|
|
||||||
|
### Import script changes
|
||||||
|
Update `data/make-model-import/import_data.sh` so reruns are consistent:
|
||||||
|
- Either:
|
||||||
|
- `TRUNCATE vehicle_options, engines, transmissions RESTART IDENTITY CASCADE;` before import, then insert, OR
|
||||||
|
- Use `INSERT ... ON CONFLICT DO NOTHING` with deterministic IDs (more complex).
|
||||||
|
|
||||||
|
Given constraints and large volume, truncation + re-import is simplest and most deterministic for dev environments.
|
||||||
|
|
||||||
|
## Validation / QA Harness (New)
|
||||||
|
Add a new script (recommended location: `data/make-model-import/qa_validate.py`) plus a small SQL file or inline queries.
|
||||||
|
|
||||||
|
Must-check assertions:
|
||||||
|
1) **Year window enforced**
|
||||||
|
- `MIN(year) >= MIN_YEAR` and `MAX(year) <= MAX_YEAR`.
|
||||||
|
2) **No dimension duplicates**
|
||||||
|
- `SELECT LOWER(name), COUNT(*) FROM engines GROUP BY 1 HAVING COUNT(*) > 1;` returns 0 rows.
|
||||||
|
- `SELECT LOWER(type), COUNT(*) FROM transmissions GROUP BY 1 HAVING COUNT(*) > 1;` returns 0 rows.
|
||||||
|
3) **No fact duplicates**
|
||||||
|
- `SELECT year, make, model, trim, engine_id, transmission_id, COUNT(*) FROM vehicle_options GROUP BY 1,2,3,4,5,6 HAVING COUNT(*) > 1;` returns 0 rows.
|
||||||
|
4) **Dropdown integrity sanity**
|
||||||
|
- For sampled `(year, make, model)`, trims returned by `get_trims_for_year_make_model()` must match distinct trims in `vehicle_options` for that tuple.
|
||||||
|
- For sampled `(year, make, model, trim)`, engines query matches `vehicle_options` join to `engines`.
|
||||||
|
- For sampled `(year, make, model)`, transmissions query matches `vehicle_options` join to `transmissions` (plus fallbacks when missing).
|
||||||
|
|
||||||
|
Optional (recommended) golden assertions:
|
||||||
|
- Add a small list of “known invalid historically” checks (like `1992 Corvette Z06`) that must return empty / not present.
|
||||||
|
- These should be driven by overlay evidence (do not hardcode large historical facts without evidence in local data).
|
||||||
|
|
||||||
|
## Work Breakdown (Assign to Agents)
|
||||||
|
### Agent A — ETL Core Refactor
|
||||||
|
Owner: ETL generation logic.
|
||||||
|
|
||||||
|
Deliverables:
|
||||||
|
- Update `data/make-model-import/etl_generate_sql.py`:
|
||||||
|
- Add config: `MIN_YEAR`/`MAX_YEAR` (defaults `2000`/`2026`).
|
||||||
|
- Replace current engine/transmission ID assignment with dedup-by-display-name mapping.
|
||||||
|
- Remove coupling where an `engine_id` implies an index into `engines.json` for transmission lookup.
|
||||||
|
- Implement fuel-type fallback label logic (`Gas/Diesel/Electric/Hybrid`) only when detailed engine spec cannot be built.
|
||||||
|
- Dedupe `vehicle_options` rows before writing SQL.
|
||||||
|
|
||||||
|
Acceptance:
|
||||||
|
- Generated `output/01_engines.sql` and `output/02_transmissions.sql` contain only unique values.
|
||||||
|
- Generated `output/03_vehicle_options.sql` contains no duplicate tuples.
|
||||||
|
- Output respects `[MIN_YEAR, MAX_YEAR]`.
|
||||||
|
|
||||||
|
### Agent B — Overlay Evidence Builder (Year-Accurate Trims)
|
||||||
|
Owner: parse `automobiles.json` and build trim/year evidence.
|
||||||
|
|
||||||
|
Deliverables:
|
||||||
|
- Implement parsing in `etl_generate_sql.py` (or a helper module if splitting is allowed) to:
|
||||||
|
- Extract year or year range from `automobiles.json.name` (handle `YYYY`, `YYYY-YYYY`, `YYYY-Present`).
|
||||||
|
- Map `brand_id → canonical make`.
|
||||||
|
- Normalize automobile “model+variant” string.
|
||||||
|
- Match against known models for that make (derived from `makes-filter`) to split `model` vs `trim`.
|
||||||
|
- Produce an evidence structure: for `(make, model)`, a list of `(trim, year_start, year_end)`.
|
||||||
|
|
||||||
|
Acceptance:
|
||||||
|
- Evidence filtering prevents trims that have no evidenced year overlap from appearing in those years when generating `vehicle_options`.
|
||||||
|
|
||||||
|
Notes:
|
||||||
|
- Matching model vs trim is heuristic; implement conservative logic:
|
||||||
|
- Prefer the longest model name match.
|
||||||
|
- If ambiguity, do not guess trim; default to `Base` and log a counter for review.
|
||||||
|
|
||||||
|
### Agent C — DB Migration + Constraints
|
||||||
|
Owner: schema correctness and preventing duplicates.
|
||||||
|
|
||||||
|
Deliverables:
|
||||||
|
- Update `data/make-model-import/migrations/001_create_vehicle_database.sql`:
|
||||||
|
- Align columns to the ETL output (keep only what’s used).
|
||||||
|
- Add uniqueness constraints (engines/transmissions dims + vehicle_options fact).
|
||||||
|
- Ensure functions `get_makes_for_year`, `get_models_for_year_make`, `get_trims_for_year_make_model` remain compatible.
|
||||||
|
|
||||||
|
Acceptance:
|
||||||
|
- Rerunning import does not create duplicates even if the ETL output accidentally contains repeats (constraints will reject).
|
||||||
|
|
||||||
|
### Agent D — Import Script Rerun Safety
|
||||||
|
Owner: repeatable import process.
|
||||||
|
|
||||||
|
Deliverables:
|
||||||
|
- Update `data/make-model-import/import_data.sh`:
|
||||||
|
- Clear tables deterministically (truncate + restart identity) before import.
|
||||||
|
- Import order: schema → engines → transmissions → vehicle_options.
|
||||||
|
- Print verification counts and min/max year.
|
||||||
|
|
||||||
|
Acceptance:
|
||||||
|
- Running `./import_data.sh` twice produces identical row counts and no errors.
|
||||||
|
|
||||||
|
### Agent E — QA Harness
|
||||||
|
Owner: automated validation.
|
||||||
|
|
||||||
|
Deliverables:
|
||||||
|
- Add `data/make-model-import/qa_validate.py` with:
|
||||||
|
- Connect-free checks using generated SQL files (fast pre-import) AND/OR
|
||||||
|
- Post-import checks executed via `docker exec mvp-postgres psql ...` (slower, authoritative).
|
||||||
|
- Add a short `data/make-model-import/QA_README.md` or extend existing docs with exact commands.
|
||||||
|
|
||||||
|
Acceptance:
|
||||||
|
- QA script fails on duplicates, out-of-range years, and basic dropdown integrity mismatches.
|
||||||
|
|
||||||
|
### Agent F (Optional) — Backend/Docs Consistency
|
||||||
|
Owner: documentation accuracy.
|
||||||
|
|
||||||
|
Deliverables:
|
||||||
|
- Update docs that reference the old normalized `vehicles.*` schema if they conflict with the current `vehicle_options` based system.
|
||||||
|
- Primary references: `docs/VEHICLES-API.md`, `backend/src/features/platform/README.md` (verify claims).
|
||||||
|
|
||||||
|
Acceptance:
|
||||||
|
- Docs correctly describe the actual dropdown data source and rerun steps.
|
||||||
|
|
||||||
|
## Rollout Plan
|
||||||
|
1) Implement ETL refactor + evidence overlay + constraints + rerunnable import.
|
||||||
|
2) Regenerate SQL (`python3 etl_generate_sql.py` in `data/make-model-import/`).
|
||||||
|
3) Re-import (`./import_data.sh`).
|
||||||
|
4) Flush Redis dropdown caches (if needed) and re-test dropdowns.
|
||||||
|
5) Run QA harness and capture summary output in a `stats.txt` (or similar).
|
||||||
|
|
||||||
|
## Acceptance Criteria (End-to-End)
|
||||||
|
- Years available in dropdown are exactly those loaded (default 2000–2026).
|
||||||
|
- Makes for a year only include makes with models in that year.
|
||||||
|
- Models for year+make only include models available for that tuple.
|
||||||
|
- Trims for year+make+model do not include impossible trims (e.g., no `1992 Corvette Z06` unless local evidence supports it).
|
||||||
|
- Engines show detailed specs when available; otherwise show one of `Gas/Diesel/Electric/Hybrid`.
|
||||||
|
- Transmissions show derived options when available; otherwise show both `Manual` and `Automatic`.
|
||||||
|
- No duplicate dimension rows; no duplicate fact rows.
|
||||||
|
|
||||||
@@ -48,4 +48,7 @@ Your task is to create a plan to fix a previous ETL process for importing Automo
|
|||||||
*** DATA MISSING HANDLING ***
|
*** DATA MISSING HANDLING ***
|
||||||
- If no Trim exists, map it to "Base"
|
- If no Trim exists, map it to "Base"
|
||||||
- If no specific engine is available default to "Gas" "Diesel" or "Electric"
|
- If no specific engine is available default to "Gas" "Diesel" or "Electric"
|
||||||
- If no specific transmission data is available default to "Manual" or "Automatic"
|
- If no specific transmission data is available default to "Manual" or "Automatic"
|
||||||
|
|
||||||
|
*** CRITICAL ***
|
||||||
|
- Make no assumptions. Ask for clarification on anything not clear.
|
||||||
Reference in New Issue
Block a user