# ETL Fix V2: Year-Accurate Vehicle Dropdown Data ## Executive Summary This document provides a complete implementation plan for fixing the vehicle dropdown ETL to produce year-accurate data. The fix addresses impossible year/trim combinations (e.g., "1992 Corvette Z06") by using the NHTSA VPIC API for authoritative Year/Make/Model validation and automobiles.json for evidence-based trim data with year ranges. --- ## Problem Statement ### Current Issues 1. **Year-inaccurate trims**: The `makes-filter/*.json` files contain ALL trims ever made for a model, applied to EVERY year 2. **Impossible combinations**: Users can select "1992 Corvette Z06" (Z06 didn't exist until 2001) 3. **Data bloat**: 400 records for 2000 Corvette with 20 trims instead of ~3-4 ### Root Cause The `makes-filter/*.json` data structure does NOT have year-specific trims. Example from `chevrolet.json`: ```json { "year": "2025", "models": [{ "name": "corvette", "submodels": ["LT", "35th Anniversary Edition", "427", "Z06", "ZR1", ...] }] } ``` The same `submodels` array is repeated for every year, making ALL trims appear valid for ALL years. --- ## Solution Architecture ### Data Sources (Priority Order) 1. **NHTSA VPIC API** - Authoritative Year/Make/Model validation 2. **automobiles.json** - Primary trim source with year-range evidence 3. **makes-filter/*.json** - Engine data enrichment 4. **Defaults** - "Base" trim, "Gas" engine, "Manual"/"Automatic" transmission ### Year Range - Minimum: 1990 - Maximum: 2026 --- ## Implementation Steps ### Phase 1: Create NHTSA Data Fetcher **Create file:** `data/make-model-import/nhtsa_fetch.py` ```python #!/usr/bin/env python3 """ NHTSA VPIC API Data Fetcher Fetches authoritative Year/Make/Model data from the US government database. """ import json import os import time from pathlib import Path from typing import Dict, List, Set import urllib.request import urllib.error class NHTSAFetcher: BASE_URL = "https://vpic.nhtsa.dot.gov/api/vehicles" CACHE_DIR = Path("nhtsa_cache") OUTPUT_FILE = Path("nhtsa_vehicles.json") def __init__(self): self.min_year = int(os.getenv("MIN_YEAR", "1990")) self.max_year = int(os.getenv("MAX_YEAR", "2026")) self.request_delay = 0.1 # 100ms between requests # Makes we care about (from makes-filter) self.target_makes = self._load_target_makes() def _load_target_makes(self) -> Set[str]: """Load makes from makes-filter directory.""" makes_dir = Path("makes-filter") makes = set() for f in makes_dir.glob("*.json"): make_name = f.stem.replace("_", " ").title() makes.add(make_name) return makes def fetch_url(self, url: str) -> dict: """Fetch JSON from URL with error handling.""" try: with urllib.request.urlopen(url, timeout=30) as response: return json.loads(response.read().decode()) except urllib.error.URLError as e: print(f" Error fetching {url}: {e}") return {"Results": []} def get_all_makes(self) -> List[Dict]: """Fetch all makes for passenger cars and trucks.""" makes = [] for vehicle_type in ["car", "truck"]: url = f"{self.BASE_URL}/GetMakesForVehicleType/{vehicle_type}?format=json" data = self.fetch_url(url) makes.extend(data.get("Results", [])) return makes def get_models_for_make_year(self, make: str, year: int) -> List[str]: """Fetch models for a specific make and year.""" cache_file = self.CACHE_DIR / f"{make.lower().replace(' ', '_')}_{year}.json" # Check cache first if cache_file.exists(): with open(cache_file) as f: return json.load(f) url = f"{self.BASE_URL}/GetModelsForMakeYear/make/{make}/modelyear/{year}?format=json" time.sleep(self.request_delay) data = self.fetch_url(url) models = list(set(r.get("Model_Name", "") for r in data.get("Results", []) if r.get("Model_Name"))) # Cache result self.CACHE_DIR.mkdir(exist_ok=True) with open(cache_file, "w") as f: json.dump(models, f) return models def fetch_all_data(self) -> Dict: """Fetch all Year/Make/Model data.""" print("Fetching NHTSA data...") # Filter to target makes all_makes = self.get_all_makes() target_make_names = [m["MakeName"] for m in all_makes if m["MakeName"].title() in self.target_makes or m["MakeName"].upper() in ["BMW", "GMC", "RAM"]] print(f"Found {len(target_make_names)} matching makes") result = {} for year in range(self.min_year, self.max_year + 1): result[str(year)] = {} for make in target_make_names: models = self.get_models_for_make_year(make, year) if models: # Normalize make name make_normalized = make.title() if make.upper() in ["BMW", "GMC", "RAM"]: make_normalized = make.upper() result[str(year)][make_normalized] = sorted(models) print(f" Year {year}: {sum(len(v) for v in result[str(year)].values())} models") # Save output with open(self.OUTPUT_FILE, "w") as f: json.dump(result, f, indent=2) print(f"Saved to {self.OUTPUT_FILE}") return result if __name__ == "__main__": NHTSAFetcher().fetch_all_data() ``` ### Phase 2: Refactor ETL Script **Modify file:** `data/make-model-import/etl_generate_sql.py` Key changes: #### 2.1 Load NHTSA data as primary source ```python def load_nhtsa_data(self): """Load NHTSA Year/Make/Model data.""" nhtsa_file = Path("nhtsa_vehicles.json") if not nhtsa_file.exists(): raise FileNotFoundError("Run nhtsa_fetch.py first to generate nhtsa_vehicles.json") with open(nhtsa_file) as f: self.nhtsa_data = json.load(f) print(f" Loaded NHTSA data for {len(self.nhtsa_data)} years") ``` #### 2.2 Build trim evidence from automobiles.json ```python def build_trim_evidence(self): """ Parse automobiles.json to build year-range evidence for trims. """ self.trim_evidence: Dict[Tuple[str, str], List[Dict]] = defaultdict(list) brand_lookup = {b.get("id"): self.get_canonical_make_name(b.get("name", "")) for b in self.brands_data} for auto in self.automobiles_data: brand_id = auto.get("brand_id") make = brand_lookup.get(brand_id) if not make: continue name = auto.get("name", "") year_range = self.parse_year_range_from_name(name) if not year_range: continue year_start, year_end = year_range # Extract model and trim from name model, trim = self.extract_model_trim_from_name(name, make) if not model: continue self.trim_evidence[(make, model)].append({ "trim": trim or "Base", "year_start": year_start, "year_end": year_end, "source_name": name }) print(f" Built trim evidence for {len(self.trim_evidence)} make/model combinations") def extract_model_trim_from_name(self, name: str, make: str) -> Tuple[Optional[str], Optional[str]]: """ Extract model and trim from automobile name. Examples: "CHEVROLET Corvette Z06 2021-Present" -> ("Corvette", "Z06") "2020 Chevrolet Corvette C8 Stingray" -> ("Corvette", "Stingray") "FORD F-150 Raptor 2021-Present" -> ("F-150", "Raptor") """ # Remove make prefix clean = re.sub(rf"^{re.escape(make)}\s+", "", name, flags=re.IGNORECASE) # Remove year/range patterns clean = re.sub(r"\d{4}(-\d{4}|-Present)?", "", clean) # Remove common suffixes clean = re.sub(r"\s*(Photos|engines|full specs|&).*$", "", clean, flags=re.IGNORECASE) # Clean up extra spaces clean = " ".join(clean.split()) # Try to match against known models known_models = self.known_models_by_make.get(make, set()) for model in sorted(known_models, key=len, reverse=True): pattern = re.compile(rf"^{re.escape(model)}\b\s*(.*)", re.IGNORECASE) match = pattern.match(clean) if match: trim = match.group(1).strip() # Remove generation codes like C5, C6, C7, C8 trim = re.sub(r"^C\d+\s*", "", trim) return (model, trim if trim else None) return (None, None) ``` #### 2.3 New trim resolution logic ```python def get_trims_for_vehicle(self, year: int, make: str, model: str) -> List[str]: """ Get valid trims for a year/make/model combination. Uses evidence from automobiles.json, falls back to "Base". """ evidence = self.trim_evidence.get((make, model), []) valid_trims = set() for entry in evidence: if entry['year_start'] <= year <= entry['year_end']: valid_trims.add(entry['trim']) # Always include "Base" as an option valid_trims.add("Base") return sorted(valid_trims) ``` #### 2.4 Updated vehicle record building ```python def build_vehicle_records(self): """Build vehicle records using NHTSA for Y/M/M, evidence for trims.""" print("\n Building vehicle option records...") records = [] for year_str, makes in self.nhtsa_data.items(): year = int(year_str) if year < self.min_year or year > self.max_year: continue for make, models in makes.items(): for model in models: # Get valid trims from evidence trims = self.get_trims_for_vehicle(year, make, model) # Get engines from makes-filter (or default) engines = self.get_engines_for_vehicle(year, make, model) # Default transmissions transmissions = ["Manual", "Automatic"] for trim in trims: for engine in engines: for trans in transmissions: records.append({ "year": year, "make": make, "model": model, "trim": trim, "engine_name": engine, "trans_name": trans }) # Deduplicate unique_set = set() deduped = [] for r in records: key = (r["year"], r["make"].lower(), r["model"].lower(), r["trim"].lower(), r["engine_name"].lower(), r["trans_name"].lower()) if key not in unique_set: unique_set.add(key) deduped.append(r) self.vehicle_records = deduped print(f" Vehicle records: {len(self.vehicle_records):,}") def get_engines_for_vehicle(self, year: int, make: str, model: str) -> List[str]: """Get engines from makes-filter or use defaults.""" # Try to find in makes-filter data for baseline in self.baseline_records: if (baseline['year'] == year and baseline['make'].lower() == make.lower() and baseline['model'].lower() == model.lower()): engines = [] for trim_data in baseline.get('trims', []): engines.extend(trim_data.get('engines', [])) if engines: return list(set(engines)) # Default based on make/model patterns model_lower = model.lower() if 'electric' in model_lower or 'ev' in model_lower or 'lightning' in model_lower: return ["Electric"] return ["Gas"] ``` ### Phase 3: Update Import Script **Modify file:** `data/make-model-import/import_data.sh` Add NHTSA cache check: ```bash #!/bin/bash # Import generated SQL files into PostgreSQL database set -e echo "==========================================" echo " Automotive Database Import" echo "==========================================" # Check NHTSA cache freshness NHTSA_FILE="nhtsa_vehicles.json" CACHE_AGE_DAYS=30 if [ ! -f "$NHTSA_FILE" ]; then echo "NHTSA data not found. Fetching..." python3 nhtsa_fetch.py elif [ $(find "$NHTSA_FILE" -mtime +$CACHE_AGE_DAYS 2>/dev/null | wc -l) -gt 0 ]; then echo "NHTSA cache is stale (>$CACHE_AGE_DAYS days). Refreshing..." python3 nhtsa_fetch.py else echo "Using cached NHTSA data" fi # Continue with existing import logic... ``` ### Phase 4: Update QA Validation **Modify file:** `data/make-model-import/qa_validate.py` Add invalid combination checks: ```python def check_invalid_combinations(): """Verify known invalid combinations do not exist.""" invalid_combos = [ # (year, make, model, trim) - known to be invalid (1992, 'Chevrolet', 'Corvette', 'Z06'), # Z06 started 2001 (2000, 'Chevrolet', 'Corvette', '35th Anniversary Edition'), # Was 1988 (2000, 'Chevrolet', 'Corvette', 'Stingray'), # Stingray started 2014 (1995, 'Ford', 'Mustang', 'Mach-E'), # Mach-E is 2021+ ] issues = [] for year, make, model, trim in invalid_combos: query = f""" SELECT COUNT(*) FROM vehicle_options WHERE year = {year} AND make = '{make}' AND model = '{model}' AND trim = '{trim}' """ count = int(run_psql(query).strip()) if count > 0: issues.append(f"Invalid combo found: {year} {make} {model} {trim}") return issues def check_trim_coverage(): """Report on trim coverage statistics.""" query = """ SELECT COUNT(DISTINCT (year, make, model)) as total_models, COUNT(DISTINCT (year, make, model)) FILTER (WHERE trim = 'Base') as base_only, COUNT(DISTINCT (year, make, model)) FILTER (WHERE trim != 'Base') as has_specific_trims FROM vehicle_options """ result = run_psql(query).strip() print(f"Trim coverage: {result}") ``` --- ## Files Summary | File | Action | Purpose | |------|--------|---------| | `data/make-model-import/nhtsa_fetch.py` | CREATE | Fetch Year/Make/Model from NHTSA API | | `data/make-model-import/etl_generate_sql.py` | MODIFY | Use NHTSA data, evidence-based trims | | `data/make-model-import/import_data.sh` | MODIFY | Add NHTSA cache refresh | | `data/make-model-import/qa_validate.py` | MODIFY | Add invalid combo checks | --- ## Execution Order ```bash # 1. Navigate to ETL directory cd data/make-model-import # 2. Fetch NHTSA data (creates nhtsa_vehicles.json) python3 nhtsa_fetch.py # 3. Generate SQL files python3 etl_generate_sql.py # 4. Import to database ./import_data.sh # 5. Validate results python3 qa_validate.py ``` --- ## Expected Results ### Before - 2000 Corvette: 400 records, 20 trims (most invalid) - Total records: ~1,675,335 - Many impossible year/trim combinations ### After - 2000 Corvette: ~8 records (Base, Coupe, Convertible) - 2015 Corvette: ~20 records (Stingray, Z06, Grand Sport, Base) - Total records: ~400,000-600,000 - No invalid year/trim combinations ### Validation Checks 1. No 1992 Corvette Z06 2. No 2000 Corvette Stingray 3. No 1995 Mustang Mach-E 4. Year range: 1990-2026 --- ## Data Source Coverage **automobiles.json trim coverage (samples):** | Model | Entries | Trims Found | |-------|---------|-------------| | Civic | 67 | Si, Type R, eHEV, Sedan, Hatchback | | Mustang | 38 | GT, Dark Horse, Mach-E GT, GTD | | Accord | 35 | Sedan, Coupe (various years) | | Corvette | 31 | Z06, ZR1, Stingray, Grand Sport | | Camaro | 29 | ZL1, Convertible, Coupe | | F-150 | 19 | Lightning, Raptor, Tremor | --- ## Environment Variables | Variable | Default | Description | |----------|---------|-------------| | `MIN_YEAR` | 1990 | Minimum year to include | | `MAX_YEAR` | 2026 | Maximum year to include | --- ## Troubleshooting ### NHTSA API Rate Limiting The script includes 100ms delays between requests. If you encounter rate limiting: - Increase `request_delay` in `nhtsa_fetch.py` - Use cached data in `nhtsa_cache/` directory ### Missing Models If NHTSA returns fewer models than expected: - Check if the make name matches exactly - Some brands (BMW, GMC) need uppercase handling - Verify the year range is supported (NHTSA has data back to ~1995) ### Cache Refresh To force refresh NHTSA data: ```bash rm nhtsa_vehicles.json rm -rf nhtsa_cache/ python3 nhtsa_fetch.py ```