Before updates to NHTSA

This commit is contained in:
Eric Gullickson
2025-12-14 14:53:45 -06:00
parent 61e87bb9ad
commit 1fc69b7779
12 changed files with 1680503 additions and 1156458 deletions

518
ETL-FIX-V2.md Normal file
View File

@@ -0,0 +1,518 @@
# ETL Fix V2: Year-Accurate Vehicle Dropdown Data
## Executive Summary
This document provides a complete implementation plan for fixing the vehicle dropdown ETL to produce year-accurate data. The fix addresses impossible year/trim combinations (e.g., "1992 Corvette Z06") by using the NHTSA VPIC API for authoritative Year/Make/Model validation and automobiles.json for evidence-based trim data with year ranges.
---
## Problem Statement
### Current Issues
1. **Year-inaccurate trims**: The `makes-filter/*.json` files contain ALL trims ever made for a model, applied to EVERY year
2. **Impossible combinations**: Users can select "1992 Corvette Z06" (Z06 didn't exist until 2001)
3. **Data bloat**: 400 records for 2000 Corvette with 20 trims instead of ~3-4
### Root Cause
The `makes-filter/*.json` data structure does NOT have year-specific trims. Example from `chevrolet.json`:
```json
{
"year": "2025",
"models": [{
"name": "corvette",
"submodels": ["LT", "35th Anniversary Edition", "427", "Z06", "ZR1", ...]
}]
}
```
The same `submodels` array is repeated for every year, making ALL trims appear valid for ALL years.
---
## Solution Architecture
### Data Sources (Priority Order)
1. **NHTSA VPIC API** - Authoritative Year/Make/Model validation
2. **automobiles.json** - Primary trim source with year-range evidence
3. **makes-filter/*.json** - Engine data enrichment
4. **Defaults** - "Base" trim, "Gas" engine, "Manual"/"Automatic" transmission
### Year Range
- Minimum: 1990
- Maximum: 2026
---
## Implementation Steps
### Phase 1: Create NHTSA Data Fetcher
**Create file:** `data/make-model-import/nhtsa_fetch.py`
```python
#!/usr/bin/env python3
"""
NHTSA VPIC API Data Fetcher
Fetches authoritative Year/Make/Model data from the US government database.
"""
import json
import os
import time
from pathlib import Path
from typing import Dict, List, Set
import urllib.request
import urllib.error
class NHTSAFetcher:
BASE_URL = "https://vpic.nhtsa.dot.gov/api/vehicles"
CACHE_DIR = Path("nhtsa_cache")
OUTPUT_FILE = Path("nhtsa_vehicles.json")
def __init__(self):
self.min_year = int(os.getenv("MIN_YEAR", "1990"))
self.max_year = int(os.getenv("MAX_YEAR", "2026"))
self.request_delay = 0.1 # 100ms between requests
# Makes we care about (from makes-filter)
self.target_makes = self._load_target_makes()
def _load_target_makes(self) -> Set[str]:
"""Load makes from makes-filter directory."""
makes_dir = Path("makes-filter")
makes = set()
for f in makes_dir.glob("*.json"):
make_name = f.stem.replace("_", " ").title()
makes.add(make_name)
return makes
def fetch_url(self, url: str) -> dict:
"""Fetch JSON from URL with error handling."""
try:
with urllib.request.urlopen(url, timeout=30) as response:
return json.loads(response.read().decode())
except urllib.error.URLError as e:
print(f" Error fetching {url}: {e}")
return {"Results": []}
def get_all_makes(self) -> List[Dict]:
"""Fetch all makes for passenger cars and trucks."""
makes = []
for vehicle_type in ["car", "truck"]:
url = f"{self.BASE_URL}/GetMakesForVehicleType/{vehicle_type}?format=json"
data = self.fetch_url(url)
makes.extend(data.get("Results", []))
return makes
def get_models_for_make_year(self, make: str, year: int) -> List[str]:
"""Fetch models for a specific make and year."""
cache_file = self.CACHE_DIR / f"{make.lower().replace(' ', '_')}_{year}.json"
# Check cache first
if cache_file.exists():
with open(cache_file) as f:
return json.load(f)
url = f"{self.BASE_URL}/GetModelsForMakeYear/make/{make}/modelyear/{year}?format=json"
time.sleep(self.request_delay)
data = self.fetch_url(url)
models = list(set(r.get("Model_Name", "") for r in data.get("Results", []) if r.get("Model_Name")))
# Cache result
self.CACHE_DIR.mkdir(exist_ok=True)
with open(cache_file, "w") as f:
json.dump(models, f)
return models
def fetch_all_data(self) -> Dict:
"""Fetch all Year/Make/Model data."""
print("Fetching NHTSA data...")
# Filter to target makes
all_makes = self.get_all_makes()
target_make_names = [m["MakeName"] for m in all_makes
if m["MakeName"].title() in self.target_makes
or m["MakeName"].upper() in ["BMW", "GMC", "RAM"]]
print(f"Found {len(target_make_names)} matching makes")
result = {}
for year in range(self.min_year, self.max_year + 1):
result[str(year)] = {}
for make in target_make_names:
models = self.get_models_for_make_year(make, year)
if models:
# Normalize make name
make_normalized = make.title()
if make.upper() in ["BMW", "GMC", "RAM"]:
make_normalized = make.upper()
result[str(year)][make_normalized] = sorted(models)
print(f" Year {year}: {sum(len(v) for v in result[str(year)].values())} models")
# Save output
with open(self.OUTPUT_FILE, "w") as f:
json.dump(result, f, indent=2)
print(f"Saved to {self.OUTPUT_FILE}")
return result
if __name__ == "__main__":
NHTSAFetcher().fetch_all_data()
```
### Phase 2: Refactor ETL Script
**Modify file:** `data/make-model-import/etl_generate_sql.py`
Key changes:
#### 2.1 Load NHTSA data as primary source
```python
def load_nhtsa_data(self):
"""Load NHTSA Year/Make/Model data."""
nhtsa_file = Path("nhtsa_vehicles.json")
if not nhtsa_file.exists():
raise FileNotFoundError("Run nhtsa_fetch.py first to generate nhtsa_vehicles.json")
with open(nhtsa_file) as f:
self.nhtsa_data = json.load(f)
print(f" Loaded NHTSA data for {len(self.nhtsa_data)} years")
```
#### 2.2 Build trim evidence from automobiles.json
```python
def build_trim_evidence(self):
"""
Parse automobiles.json to build year-range evidence for trims.
"""
self.trim_evidence: Dict[Tuple[str, str], List[Dict]] = defaultdict(list)
brand_lookup = {b.get("id"): self.get_canonical_make_name(b.get("name", ""))
for b in self.brands_data}
for auto in self.automobiles_data:
brand_id = auto.get("brand_id")
make = brand_lookup.get(brand_id)
if not make:
continue
name = auto.get("name", "")
year_range = self.parse_year_range_from_name(name)
if not year_range:
continue
year_start, year_end = year_range
# Extract model and trim from name
model, trim = self.extract_model_trim_from_name(name, make)
if not model:
continue
self.trim_evidence[(make, model)].append({
"trim": trim or "Base",
"year_start": year_start,
"year_end": year_end,
"source_name": name
})
print(f" Built trim evidence for {len(self.trim_evidence)} make/model combinations")
def extract_model_trim_from_name(self, name: str, make: str) -> Tuple[Optional[str], Optional[str]]:
"""
Extract model and trim from automobile name.
Examples:
"CHEVROLET Corvette Z06 2021-Present" -> ("Corvette", "Z06")
"2020 Chevrolet Corvette C8 Stingray" -> ("Corvette", "Stingray")
"FORD F-150 Raptor 2021-Present" -> ("F-150", "Raptor")
"""
# Remove make prefix
clean = re.sub(rf"^{re.escape(make)}\s+", "", name, flags=re.IGNORECASE)
# Remove year/range patterns
clean = re.sub(r"\d{4}(-\d{4}|-Present)?", "", clean)
# Remove common suffixes
clean = re.sub(r"\s*(Photos|engines|full specs|&).*$", "", clean, flags=re.IGNORECASE)
# Clean up extra spaces
clean = " ".join(clean.split())
# Try to match against known models
known_models = self.known_models_by_make.get(make, set())
for model in sorted(known_models, key=len, reverse=True):
pattern = re.compile(rf"^{re.escape(model)}\b\s*(.*)", re.IGNORECASE)
match = pattern.match(clean)
if match:
trim = match.group(1).strip()
# Remove generation codes like C5, C6, C7, C8
trim = re.sub(r"^C\d+\s*", "", trim)
return (model, trim if trim else None)
return (None, None)
```
#### 2.3 New trim resolution logic
```python
def get_trims_for_vehicle(self, year: int, make: str, model: str) -> List[str]:
"""
Get valid trims for a year/make/model combination.
Uses evidence from automobiles.json, falls back to "Base".
"""
evidence = self.trim_evidence.get((make, model), [])
valid_trims = set()
for entry in evidence:
if entry['year_start'] <= year <= entry['year_end']:
valid_trims.add(entry['trim'])
# Always include "Base" as an option
valid_trims.add("Base")
return sorted(valid_trims)
```
#### 2.4 Updated vehicle record building
```python
def build_vehicle_records(self):
"""Build vehicle records using NHTSA for Y/M/M, evidence for trims."""
print("\n Building vehicle option records...")
records = []
for year_str, makes in self.nhtsa_data.items():
year = int(year_str)
if year < self.min_year or year > self.max_year:
continue
for make, models in makes.items():
for model in models:
# Get valid trims from evidence
trims = self.get_trims_for_vehicle(year, make, model)
# Get engines from makes-filter (or default)
engines = self.get_engines_for_vehicle(year, make, model)
# Default transmissions
transmissions = ["Manual", "Automatic"]
for trim in trims:
for engine in engines:
for trans in transmissions:
records.append({
"year": year,
"make": make,
"model": model,
"trim": trim,
"engine_name": engine,
"trans_name": trans
})
# Deduplicate
unique_set = set()
deduped = []
for r in records:
key = (r["year"], r["make"].lower(), r["model"].lower(),
r["trim"].lower(), r["engine_name"].lower(), r["trans_name"].lower())
if key not in unique_set:
unique_set.add(key)
deduped.append(r)
self.vehicle_records = deduped
print(f" Vehicle records: {len(self.vehicle_records):,}")
def get_engines_for_vehicle(self, year: int, make: str, model: str) -> List[str]:
"""Get engines from makes-filter or use defaults."""
# Try to find in makes-filter data
for baseline in self.baseline_records:
if (baseline['year'] == year and
baseline['make'].lower() == make.lower() and
baseline['model'].lower() == model.lower()):
engines = []
for trim_data in baseline.get('trims', []):
engines.extend(trim_data.get('engines', []))
if engines:
return list(set(engines))
# Default based on make/model patterns
model_lower = model.lower()
if 'electric' in model_lower or 'ev' in model_lower or 'lightning' in model_lower:
return ["Electric"]
return ["Gas"]
```
### Phase 3: Update Import Script
**Modify file:** `data/make-model-import/import_data.sh`
Add NHTSA cache check:
```bash
#!/bin/bash
# Import generated SQL files into PostgreSQL database
set -e
echo "=========================================="
echo " Automotive Database Import"
echo "=========================================="
# Check NHTSA cache freshness
NHTSA_FILE="nhtsa_vehicles.json"
CACHE_AGE_DAYS=30
if [ ! -f "$NHTSA_FILE" ]; then
echo "NHTSA data not found. Fetching..."
python3 nhtsa_fetch.py
elif [ $(find "$NHTSA_FILE" -mtime +$CACHE_AGE_DAYS 2>/dev/null | wc -l) -gt 0 ]; then
echo "NHTSA cache is stale (>$CACHE_AGE_DAYS days). Refreshing..."
python3 nhtsa_fetch.py
else
echo "Using cached NHTSA data"
fi
# Continue with existing import logic...
```
### Phase 4: Update QA Validation
**Modify file:** `data/make-model-import/qa_validate.py`
Add invalid combination checks:
```python
def check_invalid_combinations():
"""Verify known invalid combinations do not exist."""
invalid_combos = [
# (year, make, model, trim) - known to be invalid
(1992, 'Chevrolet', 'Corvette', 'Z06'), # Z06 started 2001
(2000, 'Chevrolet', 'Corvette', '35th Anniversary Edition'), # Was 1988
(2000, 'Chevrolet', 'Corvette', 'Stingray'), # Stingray started 2014
(1995, 'Ford', 'Mustang', 'Mach-E'), # Mach-E is 2021+
]
issues = []
for year, make, model, trim in invalid_combos:
query = f"""
SELECT COUNT(*) FROM vehicle_options
WHERE year = {year}
AND make = '{make}'
AND model = '{model}'
AND trim = '{trim}'
"""
count = int(run_psql(query).strip())
if count > 0:
issues.append(f"Invalid combo found: {year} {make} {model} {trim}")
return issues
def check_trim_coverage():
"""Report on trim coverage statistics."""
query = """
SELECT
COUNT(DISTINCT (year, make, model)) as total_models,
COUNT(DISTINCT (year, make, model)) FILTER (WHERE trim = 'Base') as base_only,
COUNT(DISTINCT (year, make, model)) FILTER (WHERE trim != 'Base') as has_specific_trims
FROM vehicle_options
"""
result = run_psql(query).strip()
print(f"Trim coverage: {result}")
```
---
## Files Summary
| File | Action | Purpose |
|------|--------|---------|
| `data/make-model-import/nhtsa_fetch.py` | CREATE | Fetch Year/Make/Model from NHTSA API |
| `data/make-model-import/etl_generate_sql.py` | MODIFY | Use NHTSA data, evidence-based trims |
| `data/make-model-import/import_data.sh` | MODIFY | Add NHTSA cache refresh |
| `data/make-model-import/qa_validate.py` | MODIFY | Add invalid combo checks |
---
## Execution Order
```bash
# 1. Navigate to ETL directory
cd data/make-model-import
# 2. Fetch NHTSA data (creates nhtsa_vehicles.json)
python3 nhtsa_fetch.py
# 3. Generate SQL files
python3 etl_generate_sql.py
# 4. Import to database
./import_data.sh
# 5. Validate results
python3 qa_validate.py
```
---
## Expected Results
### Before
- 2000 Corvette: 400 records, 20 trims (most invalid)
- Total records: ~1,675,335
- Many impossible year/trim combinations
### After
- 2000 Corvette: ~8 records (Base, Coupe, Convertible)
- 2015 Corvette: ~20 records (Stingray, Z06, Grand Sport, Base)
- Total records: ~400,000-600,000
- No invalid year/trim combinations
### Validation Checks
1. No 1992 Corvette Z06
2. No 2000 Corvette Stingray
3. No 1995 Mustang Mach-E
4. Year range: 1990-2026
---
## Data Source Coverage
**automobiles.json trim coverage (samples):**
| Model | Entries | Trims Found |
|-------|---------|-------------|
| Civic | 67 | Si, Type R, eHEV, Sedan, Hatchback |
| Mustang | 38 | GT, Dark Horse, Mach-E GT, GTD |
| Accord | 35 | Sedan, Coupe (various years) |
| Corvette | 31 | Z06, ZR1, Stingray, Grand Sport |
| Camaro | 29 | ZL1, Convertible, Coupe |
| F-150 | 19 | Lightning, Raptor, Tremor |
---
## Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `MIN_YEAR` | 1990 | Minimum year to include |
| `MAX_YEAR` | 2026 | Maximum year to include |
---
## Troubleshooting
### NHTSA API Rate Limiting
The script includes 100ms delays between requests. If you encounter rate limiting:
- Increase `request_delay` in `nhtsa_fetch.py`
- Use cached data in `nhtsa_cache/` directory
### Missing Models
If NHTSA returns fewer models than expected:
- Check if the make name matches exactly
- Some brands (BMW, GMC) need uppercase handling
- Verify the year range is supported (NHTSA has data back to ~1995)
### Cache Refresh
To force refresh NHTSA data:
```bash
rm nhtsa_vehicles.json
rm -rf nhtsa_cache/
python3 nhtsa_fetch.py
```