Files
motovaultpro/ETL-FIXES.md
2025-12-14 14:53:45 -06:00

15 KiB
Raw Blame History

ETL Fixes Plan (MultiAgent Dispatch) — Vehicle Dropdown Data

Purpose

Fix the ETL that populates vehicle dropdown data so it is:

  • Clean (no duplicate dimension rows, no duplicate fact rows).
  • Year-accurate for trims (no “impossible” year/make/model/trim combinations).
  • Rerunnable across environments.
  • Limited to a configurable year window (default 20002026) with no API-level filtering changes.

This plan is written to be dispatched to multiple AI agents working in parallel.

Scope

Backend vehicle dropdowns (Year → Make → Model → Trim → Engine → Transmission).

In-scope:

  • ETL logic and output SQL generation (data/make-model-import/etl_generate_sql.py).
  • Import script behavior (data/make-model-import/import_data.sh).
  • ETL schema migration used by the import (data/make-model-import/migrations/001_create_vehicle_database.sql).
  • Data quality validation harness (new script(s)).
  • Documentation updates for rerun workflow.

Out-of-scope:

  • Any API filtering logic changes. The API must continue to reflect whatever data exists in the DB.
  • Network calls or new scraping. Use local scraped data only.

Current Data Contract (as used by backend)

Backend dropdowns currently query:

  • public.vehicle_options
  • public.engines (joined by engine_id)
  • public.transmissions (joined by transmission_id)

Primary call sites:

  • backend/src/features/platform/data/vehicle-data.repository.ts
  • backend/src/features/platform/domain/vehicle-data.service.ts
  • Dropdown routes: backend/src/features/vehicles/api/vehicles.routes.ts

Requirements (Confirmed)

Year range behavior

  • Data outside the configured year window must not be loaded.
  • Default year window: 20002026 (configurable).
  • No API changes to filter years.

Missing data defaults

  • If no trim exists, map to trim "Base".
  • If no detailed engine spec exists, default to one of: Gas / Diesel / Electric / Hybrid.
    • If local scraped data indicates EV → show Electric.
    • If indicates Diesel → show Diesel.
    • If indicates Hybrid (including mild / plug-in) → show Hybrid.
    • Else default → Gas.
  • If no specific transmission data exists for a (year, make, model), show both Manual and Automatic.
  • If a detailed engine spec is known, always use the detailed engine spec (do not replace it with the fuel-type default label).

Transmission granularity

  • Transmission dropdown should be correct at the (year, make, model) level (trim-specific not required).

Observed Defects (Root Causes)

1) Massive duplicate dimension rows

Examples:

  • data/make-model-import/output/02_transmissions.sql contains repeated values like:
    • (1,'1-Speed Automatic'), (2,'1-Speed Automatic'), … Reason:
  • ETL dedupes transmissions on a raw tuple (gearbox_string, speed, drive_type) but stores only a simplified display string, so many distinct raw tuples collapse to the same output type.

Similarly for engines:

  • data/make-model-import/output/01_engines.sql has many repeated engine display names. Reason:
  • ETL assigns IDs per raw scraped engine record (30,066), even though the UI-facing engine name collapses to far fewer distinct names.

2) Inaccurate year/make/model/trim mappings (dropdown integrity break)

Example:

  • User can select 1992 Chevrolet Corvette Z06 which never existed. Root cause:
  • data/make-model-import/makes-filter/*.json includes trims/submodels that appear to be “all-time” variants, not year-accurate.
    • Example evidence: data/make-model-import/makes-filter/chevrolet.json contains Z06 for 1992 Corvette.
    • Resulting DB evidence: data/make-model-import/output/03_vehicle_options.sql includes (1992,'Chevrolet','Corvette','Z06',...).

3) Duplicate rows in vehicle_options

Example evidence:

  • data/make-model-import/output/03_vehicle_options.sql shows repeated identical rows for the same year/make/model/trim/engine/transmission. Root causes:
  • No dedupe at the fact level prior to SQL generation.
  • Dimension ID strategy makes it difficult to dedupe correctly.

Data Sources (Local Only)

Inputs in data/make-model-import/:

  • makes-filter/*.json: provides coverage for makes/models by year, but trims/engines are not reliable for year accuracy.
  • automobiles.json: contains “model pages” with names that include year or year ranges (e.g., 2013-2019, 2021-Present).
  • engines.json: engine records keyed to automobile_id with specs and “Transmission Specs”.
  • brands.json: make name metadata (ALL CAPS) + id used by automobiles.json.brand_id.

Target ETL Strategy (Baseline Grid + Evidence Overlay)

We cannot use network sources, so the best available path is:

  1. Baseline coverage from makes-filter for (year, make, model) within [MIN_YEAR, MAX_YEAR].
  2. Year-accuracy overlay from automobiles.json + engines.json:
    • Parse each automobile into:
      • Canonical make (via brand_id → brands.json mapping).
      • Model name and an inferred trim/variant string.
      • Year range (start/end) from the automobile name.
    • Use these to build evidence sets:
      • Which trims are evidenced for a (make, model) and which year ranges they apply to.
      • Which engines/transmissions are evidenced (via engines.json) for that automobile entry.
  3. Generate vehicle_options:
    • For each baseline (year, make, model):
      • If overlay evidence exists for that (make, model, year):
        • Use evidenced trims for that year (trim defaults to Base if missing).
        • Engines: use detailed engine display names when available; else fuel-type fallback label.
        • Transmissions: derive from engine specs when available; else fallback to Manual+Automatic.
      • If no overlay evidence exists:
        • Create a single row with trim Base.
        • Engine default label Gas (or other fuel label if you can infer it locally without guessing; otherwise Gas).
        • Transmission fallback Manual+Automatic.

This approach ensures:

  • Completeness: you still have a working dropdown for all year/make/model combos in-range.
  • Accuracy improvements where the scraped evidence supports it (especially trims by year).
  • No invented trims like Z06 in years where there is no overlay evidence for Z06 in that year range.

Engine & Transmission Normalization Rules

Engine display name

Use existing ETL display logic as a base (from etl_generate_sql.py) but change the ID strategy:

  • If you can create a detailed engine display string (e.g., V8 5.7L, L4 2.0L Turbo), use it.
  • Only use default labels when detailed specs are not available:
    • Electric if fuel indicates electric.
    • Diesel if fuel indicates diesel.
    • Hybrid if fuel indicates any hybrid variant.
    • Else Gas.

Fuel mapping should be derived from engines.json → specs → Engine Specs → Fuel: which currently includes values like:

  • Electric
  • Diesel
  • Hybrid, Hybrid Gasoline, Mild Hybrid, Mild Hybrid Diesel, Plug-In Hybrid, etc.

Transmission display name

Normalize to a small set of UI-friendly strings:

  • Prefer "{N}-Speed Manual" or "{N}-Speed Automatic" when speed and type are known.
  • Preserve CVT.
  • If unknown for a (year, make, model), provide both Manual and Automatic.

Important: transmission table IDs must be keyed by the final display name, not the raw tuple.

Schema + Import Requirements (Rerunnable + Clean)

Migration changes

Update data/make-model-import/migrations/001_create_vehicle_database.sql to:

  • Match actual stored columns (current migration defines extra columns not populated by ETL).
  • Enforce uniqueness to prevent duplicates:
    • engines: unique on normalized name (e.g., UNIQUE (LOWER(name))).
    • transmissions: unique on normalized type (e.g., UNIQUE (LOWER(type))).
    • vehicle_options: unique on (year, make, model, trim, engine_id, transmission_id).

Import script changes

Update data/make-model-import/import_data.sh so reruns are consistent:

  • Either:
    • TRUNCATE vehicle_options, engines, transmissions RESTART IDENTITY CASCADE; before import, then insert, OR
    • Use INSERT ... ON CONFLICT DO NOTHING with deterministic IDs (more complex).

Given constraints and large volume, truncation + re-import is simplest and most deterministic for dev environments.

Validation / QA Harness (New)

Add a new script (recommended location: data/make-model-import/qa_validate.py) plus a small SQL file or inline queries.

Must-check assertions:

  1. Year window enforced
    • MIN(year) >= MIN_YEAR and MAX(year) <= MAX_YEAR.
  2. No dimension duplicates
    • SELECT LOWER(name), COUNT(*) FROM engines GROUP BY 1 HAVING COUNT(*) > 1; returns 0 rows.
    • SELECT LOWER(type), COUNT(*) FROM transmissions GROUP BY 1 HAVING COUNT(*) > 1; returns 0 rows.
  3. No fact duplicates
    • SELECT year, make, model, trim, engine_id, transmission_id, COUNT(*) FROM vehicle_options GROUP BY 1,2,3,4,5,6 HAVING COUNT(*) > 1; returns 0 rows.
  4. Dropdown integrity sanity
    • For sampled (year, make, model), trims returned by get_trims_for_year_make_model() must match distinct trims in vehicle_options for that tuple.
    • For sampled (year, make, model, trim), engines query matches vehicle_options join to engines.
    • For sampled (year, make, model), transmissions query matches vehicle_options join to transmissions (plus fallbacks when missing).

Optional (recommended) golden assertions:

  • Add a small list of “known invalid historically” checks (like 1992 Corvette Z06) that must return empty / not present.
    • These should be driven by overlay evidence (do not hardcode large historical facts without evidence in local data).

Work Breakdown (Assign to Agents)

Agent A — ETL Core Refactor

Owner: ETL generation logic.

Deliverables:

  • Update data/make-model-import/etl_generate_sql.py:
    • Add config: MIN_YEAR/MAX_YEAR (defaults 2000/2026).
    • Replace current engine/transmission ID assignment with dedup-by-display-name mapping.
    • Remove coupling where an engine_id implies an index into engines.json for transmission lookup.
    • Implement fuel-type fallback label logic (Gas/Diesel/Electric/Hybrid) only when detailed engine spec cannot be built.
    • Dedupe vehicle_options rows before writing SQL.

Acceptance:

  • Generated output/01_engines.sql and output/02_transmissions.sql contain only unique values.
  • Generated output/03_vehicle_options.sql contains no duplicate tuples.
  • Output respects [MIN_YEAR, MAX_YEAR].

Agent B — Overlay Evidence Builder (Year-Accurate Trims)

Owner: parse automobiles.json and build trim/year evidence.

Deliverables:

  • Implement parsing in etl_generate_sql.py (or a helper module if splitting is allowed) to:
    • Extract year or year range from automobiles.json.name (handle YYYY, YYYY-YYYY, YYYY-Present).
    • Map brand_id → canonical make.
    • Normalize automobile “model+variant” string.
    • Match against known models for that make (derived from makes-filter) to split model vs trim.
    • Produce an evidence structure: for (make, model), a list of (trim, year_start, year_end).

Acceptance:

  • Evidence filtering prevents trims that have no evidenced year overlap from appearing in those years when generating vehicle_options.

Notes:

  • Matching model vs trim is heuristic; implement conservative logic:
    • Prefer the longest model name match.
    • If ambiguity, do not guess trim; default to Base and log a counter for review.

Agent C — DB Migration + Constraints

Owner: schema correctness and preventing duplicates.

Deliverables:

  • Update data/make-model-import/migrations/001_create_vehicle_database.sql:
    • Align columns to the ETL output (keep only whats used).
    • Add uniqueness constraints (engines/transmissions dims + vehicle_options fact).
    • Ensure functions get_makes_for_year, get_models_for_year_make, get_trims_for_year_make_model remain compatible.

Acceptance:

  • Rerunning import does not create duplicates even if the ETL output accidentally contains repeats (constraints will reject).

Agent D — Import Script Rerun Safety

Owner: repeatable import process.

Deliverables:

  • Update data/make-model-import/import_data.sh:
    • Clear tables deterministically (truncate + restart identity) before import.
    • Import order: schema → engines → transmissions → vehicle_options.
    • Print verification counts and min/max year.

Acceptance:

  • Running ./import_data.sh twice produces identical row counts and no errors.

Agent E — QA Harness

Owner: automated validation.

Deliverables:

  • Add data/make-model-import/qa_validate.py with:
    • Connect-free checks using generated SQL files (fast pre-import) AND/OR
    • Post-import checks executed via docker exec mvp-postgres psql ... (slower, authoritative).
  • Add a short data/make-model-import/QA_README.md or extend existing docs with exact commands.

Acceptance:

  • QA script fails on duplicates, out-of-range years, and basic dropdown integrity mismatches.

Agent F (Optional) — Backend/Docs Consistency

Owner: documentation accuracy.

Deliverables:

  • Update docs that reference the old normalized vehicles.* schema if they conflict with the current vehicle_options based system.
    • Primary references: docs/VEHICLES-API.md, backend/src/features/platform/README.md (verify claims).

Acceptance:

  • Docs correctly describe the actual dropdown data source and rerun steps.

Rollout Plan

  1. Implement ETL refactor + evidence overlay + constraints + rerunnable import.
  2. Regenerate SQL (python3 etl_generate_sql.py in data/make-model-import/).
  3. Re-import (./import_data.sh).
  4. Flush Redis dropdown caches (if needed) and re-test dropdowns.
  5. Run QA harness and capture summary output in a stats.txt (or similar).

Status Update (completed)

  • ETL rewritten to use makes-filter as baseline (year/make/model + trims/engines) and overlay evidence only to prune impossible year/trim combos and enrich engines/transmissions.
  • Engines/transmissions now deduped by display name; vehicle_options deduped on full key.
  • Uniqueness constraints added to prevent duplicates on import.
  • Import script made rerunnable (truncate + restart identity) and prints year range.
  • QA script added and validated (duplicates=0, year range 20002026).
  • Example issue (GMC Sierra 1500 AT4X 6.2L V8) now present via baseline engines for that trim/year and Automatic/Manual fallback when transmissions are absent.

Acceptance Criteria (End-to-End)

  • Years available in dropdown are exactly those loaded (default 20002026).
  • Makes for a year only include makes with models in that year.
  • Models for year+make only include models available for that tuple.
  • Trims for year+make+model do not include impossible trims (e.g., no 1992 Corvette Z06 unless local evidence supports it).
  • Engines show detailed specs when available; otherwise show one of Gas/Diesel/Electric/Hybrid.
  • Transmissions show derived options when available; otherwise show both Manual and Automatic.
  • No duplicate dimension rows; no duplicate fact rows.