15 KiB
ETL Fixes Plan (Multi‑Agent Dispatch) — Vehicle Dropdown Data
Purpose
Fix the ETL that populates vehicle dropdown data so it is:
- Clean (no duplicate dimension rows, no duplicate fact rows).
- Year-accurate for trims (no “impossible” year/make/model/trim combinations).
- Rerunnable across environments.
- Limited to a configurable year window (default 2000–2026) with no API-level filtering changes.
This plan is written to be dispatched to multiple AI agents working in parallel.
Scope
Backend vehicle dropdowns (Year → Make → Model → Trim → Engine → Transmission).
In-scope:
- ETL logic and output SQL generation (
data/make-model-import/etl_generate_sql.py). - Import script behavior (
data/make-model-import/import_data.sh). - ETL schema migration used by the import (
data/make-model-import/migrations/001_create_vehicle_database.sql). - Data quality validation harness (new script(s)).
- Documentation updates for rerun workflow.
Out-of-scope:
- Any API filtering logic changes. The API must continue to reflect whatever data exists in the DB.
- Network calls or new scraping. Use local scraped data only.
Current Data Contract (as used by backend)
Backend dropdowns currently query:
public.vehicle_optionspublic.engines(joined byengine_id)public.transmissions(joined bytransmission_id)
Primary call sites:
backend/src/features/platform/data/vehicle-data.repository.tsbackend/src/features/platform/domain/vehicle-data.service.ts- Dropdown routes:
backend/src/features/vehicles/api/vehicles.routes.ts
Requirements (Confirmed)
Year range behavior
- Data outside the configured year window must not be loaded.
- Default year window: 2000–2026 (configurable).
- No API changes to filter years.
Missing data defaults
- If no trim exists, map to trim
"Base". - If no detailed engine spec exists, default to one of:
Gas/Diesel/Electric/Hybrid.- If local scraped data indicates EV → show
Electric. - If indicates Diesel → show
Diesel. - If indicates Hybrid (including mild / plug-in) → show
Hybrid. - Else default →
Gas.
- If local scraped data indicates EV → show
- If no specific transmission data exists for a
(year, make, model), show bothManualandAutomatic. - If a detailed engine spec is known, always use the detailed engine spec (do not replace it with the fuel-type default label).
Transmission granularity
- Transmission dropdown should be correct at the
(year, make, model)level (trim-specific not required).
Observed Defects (Root Causes)
1) Massive duplicate dimension rows
Examples:
data/make-model-import/output/02_transmissions.sqlcontains repeated values like:(1,'1-Speed Automatic'),(2,'1-Speed Automatic'), … Reason:
- ETL dedupes transmissions on a raw tuple
(gearbox_string, speed, drive_type)but stores only a simplified display string, so many distinct raw tuples collapse to the same outputtype.
Similarly for engines:
data/make-model-import/output/01_engines.sqlhas many repeated engine display names. Reason:- ETL assigns IDs per raw scraped engine record (30,066), even though the UI-facing engine name collapses to far fewer distinct names.
2) Inaccurate year/make/model/trim mappings (dropdown integrity break)
Example:
- User can select
1992 Chevrolet Corvette Z06which never existed. Root cause: data/make-model-import/makes-filter/*.jsonincludes trims/submodels that appear to be “all-time” variants, not year-accurate.- Example evidence:
data/make-model-import/makes-filter/chevrolet.jsoncontainsZ06for 1992 Corvette. - Resulting DB evidence:
data/make-model-import/output/03_vehicle_options.sqlincludes(1992,'Chevrolet','Corvette','Z06',...).
- Example evidence:
3) Duplicate rows in vehicle_options
Example evidence:
data/make-model-import/output/03_vehicle_options.sqlshows repeated identical rows for the same year/make/model/trim/engine/transmission. Root causes:- No dedupe at the fact level prior to SQL generation.
- Dimension ID strategy makes it difficult to dedupe correctly.
Data Sources (Local Only)
Inputs in data/make-model-import/:
makes-filter/*.json: provides coverage for makes/models by year, but trims/engines are not reliable for year accuracy.automobiles.json: contains “model pages” with names that include year or year ranges (e.g.,2013-2019,2021-Present).engines.json: engine records keyed toautomobile_idwith specs and “Transmission Specs”.brands.json: make name metadata (ALL CAPS) +idused byautomobiles.json.brand_id.
Target ETL Strategy (Baseline Grid + Evidence Overlay)
We cannot use network sources, so the best available path is:
- Baseline coverage from
makes-filterfor(year, make, model)within[MIN_YEAR, MAX_YEAR]. - Year-accuracy overlay from
automobiles.json+engines.json:- Parse each automobile into:
- Canonical make (via
brand_id → brands.jsonmapping). - Model name and an inferred trim/variant string.
- Year range (start/end) from the automobile name.
- Canonical make (via
- Use these to build evidence sets:
- Which trims are evidenced for a
(make, model)and which year ranges they apply to. - Which engines/transmissions are evidenced (via
engines.json) for that automobile entry.
- Which trims are evidenced for a
- Parse each automobile into:
- Generate
vehicle_options:- For each baseline
(year, make, model):- If overlay evidence exists for that
(make, model, year):- Use evidenced trims for that year (trim defaults to
Baseif missing). - Engines: use detailed engine display names when available; else fuel-type fallback label.
- Transmissions: derive from engine specs when available; else fallback to
Manual+Automatic.
- Use evidenced trims for that year (trim defaults to
- If no overlay evidence exists:
- Create a single row with trim
Base. - Engine default label
Gas(or other fuel label if you can infer it locally without guessing; otherwiseGas). - Transmission fallback
Manual+Automatic.
- Create a single row with trim
- If overlay evidence exists for that
- For each baseline
This approach ensures:
- Completeness: you still have a working dropdown for all year/make/model combos in-range.
- Accuracy improvements where the scraped evidence supports it (especially trims by year).
- No invented trims like
Z06in years where there is no overlay evidence forZ06in that year range.
Engine & Transmission Normalization Rules
Engine display name
Use existing ETL display logic as a base (from etl_generate_sql.py) but change the ID strategy:
- If you can create a detailed engine display string (e.g.,
V8 5.7L,L4 2.0L Turbo), use it. - Only use default labels when detailed specs are not available:
Electricif fuel indicates electric.Dieselif fuel indicates diesel.Hybridif fuel indicates any hybrid variant.- Else
Gas.
Fuel mapping should be derived from engines.json → specs → Engine Specs → Fuel: which currently includes values like:
ElectricDieselHybrid,Hybrid Gasoline,Mild Hybrid,Mild Hybrid Diesel,Plug-In Hybrid, etc.
Transmission display name
Normalize to a small set of UI-friendly strings:
- Prefer
"{N}-Speed Manual"or"{N}-Speed Automatic"when speed and type are known. - Preserve
CVT. - If unknown for a
(year, make, model), provide bothManualandAutomatic.
Important: transmission table IDs must be keyed by the final display name, not the raw tuple.
Schema + Import Requirements (Rerunnable + Clean)
Migration changes
Update data/make-model-import/migrations/001_create_vehicle_database.sql to:
- Match actual stored columns (current migration defines extra columns not populated by ETL).
- Enforce uniqueness to prevent duplicates:
engines: unique on normalized name (e.g.,UNIQUE (LOWER(name))).transmissions: unique on normalized type (e.g.,UNIQUE (LOWER(type))).vehicle_options: unique on(year, make, model, trim, engine_id, transmission_id).
Import script changes
Update data/make-model-import/import_data.sh so reruns are consistent:
- Either:
TRUNCATE vehicle_options, engines, transmissions RESTART IDENTITY CASCADE;before import, then insert, OR- Use
INSERT ... ON CONFLICT DO NOTHINGwith deterministic IDs (more complex).
Given constraints and large volume, truncation + re-import is simplest and most deterministic for dev environments.
Validation / QA Harness (New)
Add a new script (recommended location: data/make-model-import/qa_validate.py) plus a small SQL file or inline queries.
Must-check assertions:
- Year window enforced
MIN(year) >= MIN_YEARandMAX(year) <= MAX_YEAR.
- No dimension duplicates
SELECT LOWER(name), COUNT(*) FROM engines GROUP BY 1 HAVING COUNT(*) > 1;returns 0 rows.SELECT LOWER(type), COUNT(*) FROM transmissions GROUP BY 1 HAVING COUNT(*) > 1;returns 0 rows.
- No fact duplicates
SELECT year, make, model, trim, engine_id, transmission_id, COUNT(*) FROM vehicle_options GROUP BY 1,2,3,4,5,6 HAVING COUNT(*) > 1;returns 0 rows.
- Dropdown integrity sanity
- For sampled
(year, make, model), trims returned byget_trims_for_year_make_model()must match distinct trims invehicle_optionsfor that tuple. - For sampled
(year, make, model, trim), engines query matchesvehicle_optionsjoin toengines. - For sampled
(year, make, model), transmissions query matchesvehicle_optionsjoin totransmissions(plus fallbacks when missing).
- For sampled
Optional (recommended) golden assertions:
- Add a small list of “known invalid historically” checks (like
1992 Corvette Z06) that must return empty / not present.- These should be driven by overlay evidence (do not hardcode large historical facts without evidence in local data).
Work Breakdown (Assign to Agents)
Agent A — ETL Core Refactor
Owner: ETL generation logic.
Deliverables:
- Update
data/make-model-import/etl_generate_sql.py:- Add config:
MIN_YEAR/MAX_YEAR(defaults2000/2026). - Replace current engine/transmission ID assignment with dedup-by-display-name mapping.
- Remove coupling where an
engine_idimplies an index intoengines.jsonfor transmission lookup. - Implement fuel-type fallback label logic (
Gas/Diesel/Electric/Hybrid) only when detailed engine spec cannot be built. - Dedupe
vehicle_optionsrows before writing SQL.
- Add config:
Acceptance:
- Generated
output/01_engines.sqlandoutput/02_transmissions.sqlcontain only unique values. - Generated
output/03_vehicle_options.sqlcontains no duplicate tuples. - Output respects
[MIN_YEAR, MAX_YEAR].
Agent B — Overlay Evidence Builder (Year-Accurate Trims)
Owner: parse automobiles.json and build trim/year evidence.
Deliverables:
- Implement parsing in
etl_generate_sql.py(or a helper module if splitting is allowed) to:- Extract year or year range from
automobiles.json.name(handleYYYY,YYYY-YYYY,YYYY-Present). - Map
brand_id → canonical make. - Normalize automobile “model+variant” string.
- Match against known models for that make (derived from
makes-filter) to splitmodelvstrim. - Produce an evidence structure: for
(make, model), a list of(trim, year_start, year_end).
- Extract year or year range from
Acceptance:
- Evidence filtering prevents trims that have no evidenced year overlap from appearing in those years when generating
vehicle_options.
Notes:
- Matching model vs trim is heuristic; implement conservative logic:
- Prefer the longest model name match.
- If ambiguity, do not guess trim; default to
Baseand log a counter for review.
Agent C — DB Migration + Constraints
Owner: schema correctness and preventing duplicates.
Deliverables:
- Update
data/make-model-import/migrations/001_create_vehicle_database.sql:- Align columns to the ETL output (keep only what’s used).
- Add uniqueness constraints (engines/transmissions dims + vehicle_options fact).
- Ensure functions
get_makes_for_year,get_models_for_year_make,get_trims_for_year_make_modelremain compatible.
Acceptance:
- Rerunning import does not create duplicates even if the ETL output accidentally contains repeats (constraints will reject).
Agent D — Import Script Rerun Safety
Owner: repeatable import process.
Deliverables:
- Update
data/make-model-import/import_data.sh:- Clear tables deterministically (truncate + restart identity) before import.
- Import order: schema → engines → transmissions → vehicle_options.
- Print verification counts and min/max year.
Acceptance:
- Running
./import_data.shtwice produces identical row counts and no errors.
Agent E — QA Harness
Owner: automated validation.
Deliverables:
- Add
data/make-model-import/qa_validate.pywith:- Connect-free checks using generated SQL files (fast pre-import) AND/OR
- Post-import checks executed via
docker exec mvp-postgres psql ...(slower, authoritative).
- Add a short
data/make-model-import/QA_README.mdor extend existing docs with exact commands.
Acceptance:
- QA script fails on duplicates, out-of-range years, and basic dropdown integrity mismatches.
Agent F (Optional) — Backend/Docs Consistency
Owner: documentation accuracy.
Deliverables:
- Update docs that reference the old normalized
vehicles.*schema if they conflict with the currentvehicle_optionsbased system.- Primary references:
docs/VEHICLES-API.md,backend/src/features/platform/README.md(verify claims).
- Primary references:
Acceptance:
- Docs correctly describe the actual dropdown data source and rerun steps.
Rollout Plan
- Implement ETL refactor + evidence overlay + constraints + rerunnable import.
- Regenerate SQL (
python3 etl_generate_sql.pyindata/make-model-import/). - Re-import (
./import_data.sh). - Flush Redis dropdown caches (if needed) and re-test dropdowns.
- Run QA harness and capture summary output in a
stats.txt(or similar).
Status Update (completed)
- ETL rewritten to use makes-filter as baseline (year/make/model + trims/engines) and overlay evidence only to prune impossible year/trim combos and enrich engines/transmissions.
- Engines/transmissions now deduped by display name; vehicle_options deduped on full key.
- Uniqueness constraints added to prevent duplicates on import.
- Import script made rerunnable (truncate + restart identity) and prints year range.
- QA script added and validated (duplicates=0, year range 2000–2026).
- Example issue (GMC Sierra 1500 AT4X 6.2L V8) now present via baseline engines for that trim/year and Automatic/Manual fallback when transmissions are absent.
Acceptance Criteria (End-to-End)
- Years available in dropdown are exactly those loaded (default 2000–2026).
- Makes for a year only include makes with models in that year.
- Models for year+make only include models available for that tuple.
- Trims for year+make+model do not include impossible trims (e.g., no
1992 Corvette Z06unless local evidence supports it). - Engines show detailed specs when available; otherwise show one of
Gas/Diesel/Electric/Hybrid. - Transmissions show derived options when available; otherwise show both
ManualandAutomatic. - No duplicate dimension rows; no duplicate fact rows.