motovaultpro/ETL-FIXES.md at 1fc69b7779d50115927fff5d5961418bf4b162c2

egullickson/motovaultpro

Fork 0

Files

Eric Gullickson 1fc69b7779 Before updates to NHTSA

2025-12-14 14:53:45 -06:00

15 KiB

Raw Blame History

Purpose

Fix the ETL that populates vehicle dropdown data so it is:

Clean (no duplicate dimension rows, no duplicate fact rows).
Year-accurate for trims (no “impossible” year/make/model/trim combinations).
Rerunnable across environments.
Limited to a configurable year window (default 2000–2026) with no API-level filtering changes.

This plan is written to be dispatched to multiple AI agents working in parallel.

Scope

Backend vehicle dropdowns (Year → Make → Model → Trim → Engine → Transmission).

In-scope:

ETL logic and output SQL generation (data/make-model-import/etl_generate_sql.py).
Import script behavior (data/make-model-import/import_data.sh).
ETL schema migration used by the import (data/make-model-import/migrations/001_create_vehicle_database.sql).
Data quality validation harness (new script(s)).
Documentation updates for rerun workflow.

Out-of-scope:

Any API filtering logic changes. The API must continue to reflect whatever data exists in the DB.
Network calls or new scraping. Use local scraped data only.

Current Data Contract (as used by backend)

Backend dropdowns currently query:

public.vehicle_options
public.engines (joined by engine_id)
public.transmissions (joined by transmission_id)

Primary call sites:

backend/src/features/platform/data/vehicle-data.repository.ts
backend/src/features/platform/domain/vehicle-data.service.ts
Dropdown routes: backend/src/features/vehicles/api/vehicles.routes.ts

Requirements (Confirmed)

Year range behavior

Data outside the configured year window must not be loaded.
Default year window: 2000–2026 (configurable).
No API changes to filter years.

Missing data defaults

If no trim exists, map to trim "Base".
If no detailed engine spec exists, default to one of: Gas / Diesel / Electric / Hybrid.
- If local scraped data indicates EV → show Electric.
- If indicates Diesel → show Diesel.
- If indicates Hybrid (including mild / plug-in) → show Hybrid.
- Else default → Gas.
If no specific transmission data exists for a (year, make, model), show both Manual and Automatic.
If a detailed engine spec is known, always use the detailed engine spec (do not replace it with the fuel-type default label).

Transmission granularity

Transmission dropdown should be correct at the (year, make, model) level (trim-specific not required).

Observed Defects (Root Causes)

1) Massive duplicate dimension rows

Examples:

data/make-model-import/output/02_transmissions.sql contains repeated values like:
- (1,'1-Speed Automatic'), (2,'1-Speed Automatic'), … Reason:
ETL dedupes transmissions on a raw tuple (gearbox_string, speed, drive_type) but stores only a simplified display string, so many distinct raw tuples collapse to the same output type.

Similarly for engines:

data/make-model-import/output/01_engines.sql has many repeated engine display names. Reason:
ETL assigns IDs per raw scraped engine record (30,066), even though the UI-facing engine name collapses to far fewer distinct names.

Example:

User can select 1992 Chevrolet Corvette Z06 which never existed. Root cause:
data/make-model-import/makes-filter/*.json includes trims/submodels that appear to be “all-time” variants, not year-accurate.
- Example evidence: data/make-model-import/makes-filter/chevrolet.json contains Z06 for 1992 Corvette.
- Resulting DB evidence: data/make-model-import/output/03_vehicle_options.sql includes (1992,'Chevrolet','Corvette','Z06',...).

3) Duplicate rows in `vehicle_options`

Example evidence:

data/make-model-import/output/03_vehicle_options.sql shows repeated identical rows for the same year/make/model/trim/engine/transmission. Root causes:
No dedupe at the fact level prior to SQL generation.
Dimension ID strategy makes it difficult to dedupe correctly.

Data Sources (Local Only)

Inputs in data/make-model-import/:

makes-filter/*.json: provides coverage for makes/models by year, but trims/engines are not reliable for year accuracy.
automobiles.json: contains “model pages” with names that include year or year ranges (e.g., 2013-2019, 2021-Present).
engines.json: engine records keyed to automobile_id with specs and “Transmission Specs”.
brands.json: make name metadata (ALL CAPS) + id used by automobiles.json.brand_id.

Target ETL Strategy (Baseline Grid + Evidence Overlay)

We cannot use network sources, so the best available path is:

Baseline coverage from makes-filter for (year, make, model) within [MIN_YEAR, MAX_YEAR].
Year-accuracy overlay from automobiles.json + engines.json:
- Parse each automobile into:
  - Canonical make (via brand_id → brands.json mapping).
  - Model name and an inferred trim/variant string.
  - Year range (start/end) from the automobile name.
- Use these to build evidence sets:
  - Which trims are evidenced for a (make, model) and which year ranges they apply to.
  - Which engines/transmissions are evidenced (via engines.json) for that automobile entry.
Generate vehicle_options:
- For each baseline (year, make, model):
  - If overlay evidence exists for that (make, model, year):
    - Use evidenced trims for that year (trim defaults to Base if missing).
    - Engines: use detailed engine display names when available; else fuel-type fallback label.
    - Transmissions: derive from engine specs when available; else fallback to Manual+Automatic.
  - If no overlay evidence exists:
    - Create a single row with trim Base.
    - Engine default label Gas (or other fuel label if you can infer it locally without guessing; otherwise Gas).
    - Transmission fallback Manual+Automatic.

This approach ensures:

Completeness: you still have a working dropdown for all year/make/model combos in-range.
Accuracy improvements where the scraped evidence supports it (especially trims by year).
No invented trims like Z06 in years where there is no overlay evidence for Z06 in that year range.

Engine & Transmission Normalization Rules

Engine display name

Use existing ETL display logic as a base (from etl_generate_sql.py) but change the ID strategy:

If you can create a detailed engine display string (e.g., V8 5.7L, L4 2.0L Turbo), use it.
Only use default labels when detailed specs are not available:
- Electric if fuel indicates electric.
- Diesel if fuel indicates diesel.
- Hybrid if fuel indicates any hybrid variant.
- Else Gas.

Fuel mapping should be derived from engines.json → specs → Engine Specs → Fuel: which currently includes values like:

Electric
Diesel
Hybrid, Hybrid Gasoline, Mild Hybrid, Mild Hybrid Diesel, Plug-In Hybrid, etc.

Transmission display name

Normalize to a small set of UI-friendly strings:

Prefer "{N}-Speed Manual" or "{N}-Speed Automatic" when speed and type are known.
Preserve CVT.
If unknown for a (year, make, model), provide both Manual and Automatic.

Important: transmission table IDs must be keyed by the final display name, not the raw tuple.

Schema + Import Requirements (Rerunnable + Clean)

Migration changes

Update data/make-model-import/migrations/001_create_vehicle_database.sql to:

Match actual stored columns (current migration defines extra columns not populated by ETL).
Enforce uniqueness to prevent duplicates:
- engines: unique on normalized name (e.g., UNIQUE (LOWER(name))).
- transmissions: unique on normalized type (e.g., UNIQUE (LOWER(type))).
- vehicle_options: unique on (year, make, model, trim, engine_id, transmission_id).

Import script changes

Update data/make-model-import/import_data.sh so reruns are consistent:

Either:
- TRUNCATE vehicle_options, engines, transmissions RESTART IDENTITY CASCADE; before import, then insert, OR
- Use INSERT ... ON CONFLICT DO NOTHING with deterministic IDs (more complex).

Given constraints and large volume, truncation + re-import is simplest and most deterministic for dev environments.

Validation / QA Harness (New)

Add a new script (recommended location: data/make-model-import/qa_validate.py) plus a small SQL file or inline queries.

Must-check assertions:

Year window enforced
- MIN(year) >= MIN_YEAR and MAX(year) <= MAX_YEAR.
No dimension duplicates
- SELECT LOWER(name), COUNT(*) FROM engines GROUP BY 1 HAVING COUNT(*) > 1; returns 0 rows.
- SELECT LOWER(type), COUNT(*) FROM transmissions GROUP BY 1 HAVING COUNT(*) > 1; returns 0 rows.
No fact duplicates
- SELECT year, make, model, trim, engine_id, transmission_id, COUNT(*) FROM vehicle_options GROUP BY 1,2,3,4,5,6 HAVING COUNT(*) > 1; returns 0 rows.
Dropdown integrity sanity
- For sampled (year, make, model), trims returned by get_trims_for_year_make_model() must match distinct trims in vehicle_options for that tuple.
- For sampled (year, make, model, trim), engines query matches vehicle_options join to engines.
- For sampled (year, make, model), transmissions query matches vehicle_options join to transmissions (plus fallbacks when missing).

Optional (recommended) golden assertions:

Add a small list of “known invalid historically” checks (like 1992 Corvette Z06) that must return empty / not present.
- These should be driven by overlay evidence (do not hardcode large historical facts without evidence in local data).

Work Breakdown (Assign to Agents)

Agent A — ETL Core Refactor

Owner: ETL generation logic.

Deliverables:

Update data/make-model-import/etl_generate_sql.py:
- Add config: MIN_YEAR/MAX_YEAR (defaults 2000/2026).
- Replace current engine/transmission ID assignment with dedup-by-display-name mapping.
- Remove coupling where an engine_id implies an index into engines.json for transmission lookup.
- Implement fuel-type fallback label logic (Gas/Diesel/Electric/Hybrid) only when detailed engine spec cannot be built.
- Dedupe vehicle_options rows before writing SQL.

Acceptance:

Generated output/01_engines.sql and output/02_transmissions.sql contain only unique values.
Generated output/03_vehicle_options.sql contains no duplicate tuples.
Output respects [MIN_YEAR, MAX_YEAR].

Agent B — Overlay Evidence Builder (Year-Accurate Trims)

Owner: parse automobiles.json and build trim/year evidence.

Deliverables:

Implement parsing in etl_generate_sql.py (or a helper module if splitting is allowed) to:
- Extract year or year range from automobiles.json.name (handle YYYY, YYYY-YYYY, YYYY-Present).
- Map brand_id → canonical make.
- Normalize automobile “model+variant” string.
- Match against known models for that make (derived from makes-filter) to split model vs trim.
- Produce an evidence structure: for (make, model), a list of (trim, year_start, year_end).

Acceptance:

Evidence filtering prevents trims that have no evidenced year overlap from appearing in those years when generating vehicle_options.

Notes:

Matching model vs trim is heuristic; implement conservative logic:
- Prefer the longest model name match.
- If ambiguity, do not guess trim; default to Base and log a counter for review.

Agent C — DB Migration + Constraints

Owner: schema correctness and preventing duplicates.

Deliverables:

Update data/make-model-import/migrations/001_create_vehicle_database.sql:
- Align columns to the ETL output (keep only what’s used).
- Add uniqueness constraints (engines/transmissions dims + vehicle_options fact).
- Ensure functions get_makes_for_year, get_models_for_year_make, get_trims_for_year_make_model remain compatible.

Acceptance:

Rerunning import does not create duplicates even if the ETL output accidentally contains repeats (constraints will reject).

Agent D — Import Script Rerun Safety

Owner: repeatable import process.

Deliverables:

Update data/make-model-import/import_data.sh:
- Clear tables deterministically (truncate + restart identity) before import.
- Import order: schema → engines → transmissions → vehicle_options.
- Print verification counts and min/max year.

Acceptance:

Running ./import_data.sh twice produces identical row counts and no errors.

Agent E — QA Harness

Owner: automated validation.

Deliverables:

Add data/make-model-import/qa_validate.py with:
- Connect-free checks using generated SQL files (fast pre-import) AND/OR
- Post-import checks executed via docker exec mvp-postgres psql ... (slower, authoritative).
Add a short data/make-model-import/QA_README.md or extend existing docs with exact commands.

Acceptance:

QA script fails on duplicates, out-of-range years, and basic dropdown integrity mismatches.

Agent F (Optional) — Backend/Docs Consistency

Owner: documentation accuracy.

Deliverables:

Update docs that reference the old normalized vehicles.* schema if they conflict with the current vehicle_options based system.
- Primary references: docs/VEHICLES-API.md, backend/src/features/platform/README.md (verify claims).

Acceptance:

Docs correctly describe the actual dropdown data source and rerun steps.

Rollout Plan

Implement ETL refactor + evidence overlay + constraints + rerunnable import.
Regenerate SQL (python3 etl_generate_sql.py in data/make-model-import/).
Re-import (./import_data.sh).
Flush Redis dropdown caches (if needed) and re-test dropdowns.
Run QA harness and capture summary output in a stats.txt (or similar).

Status Update (completed)

ETL rewritten to use makes-filter as baseline (year/make/model + trims/engines) and overlay evidence only to prune impossible year/trim combos and enrich engines/transmissions.
Engines/transmissions now deduped by display name; vehicle_options deduped on full key.
Uniqueness constraints added to prevent duplicates on import.
Import script made rerunnable (truncate + restart identity) and prints year range.
QA script added and validated (duplicates=0, year range 2000–2026).
Example issue (GMC Sierra 1500 AT4X 6.2L V8) now present via baseline engines for that trim/year and Automatic/Manual fallback when transmissions are absent.

Acceptance Criteria (End-to-End)

Years available in dropdown are exactly those loaded (default 2000–2026).
Makes for a year only include makes with models in that year.
Models for year+make only include models available for that tuple.
Trims for year+make+model do not include impossible trims (e.g., no 1992 Corvette Z06 unless local evidence supports it).
Engines show detailed specs when available; otherwise show one of Gas/Diesel/Electric/Hybrid.
Transmissions show derived options when available; otherwise show both Manual and Automatic.
No duplicate dimension rows; no duplicate fact rows.

15 KiB Raw Blame History Unescape Escape

ETL Fixes Plan (Multi‑Agent Dispatch) — Vehicle Dropdown Data

Purpose

Scope

Current Data Contract (as used by backend)

Requirements (Confirmed)

Year range behavior

Missing data defaults

Transmission granularity

Observed Defects (Root Causes)

1) Massive duplicate dimension rows

2) Inaccurate year/make/model/trim mappings (dropdown integrity break)

3) Duplicate rows in vehicle_options

Data Sources (Local Only)

Target ETL Strategy (Baseline Grid + Evidence Overlay)

Engine & Transmission Normalization Rules

Engine display name

Transmission display name

Schema + Import Requirements (Rerunnable + Clean)

Migration changes

Import script changes

Validation / QA Harness (New)

Work Breakdown (Assign to Agents)

Agent A — ETL Core Refactor

Agent B — Overlay Evidence Builder (Year-Accurate Trims)

Agent C — DB Migration + Constraints

Agent D — Import Script Rerun Safety

Agent E — QA Harness

Agent F (Optional) — Backend/Docs Consistency

Rollout Plan

Status Update (completed)

Acceptance Criteria (End-to-End)

15 KiB

Raw Blame History

3) Duplicate rows in `vehicle_options`