Before updates to NHTSA
This commit is contained in:
518
ETL-FIX-V2.md
Normal file
518
ETL-FIX-V2.md
Normal file
@@ -0,0 +1,518 @@
|
||||
# ETL Fix V2: Year-Accurate Vehicle Dropdown Data
|
||||
|
||||
## Executive Summary
|
||||
|
||||
This document provides a complete implementation plan for fixing the vehicle dropdown ETL to produce year-accurate data. The fix addresses impossible year/trim combinations (e.g., "1992 Corvette Z06") by using the NHTSA VPIC API for authoritative Year/Make/Model validation and automobiles.json for evidence-based trim data with year ranges.
|
||||
|
||||
---
|
||||
|
||||
## Problem Statement
|
||||
|
||||
### Current Issues
|
||||
1. **Year-inaccurate trims**: The `makes-filter/*.json` files contain ALL trims ever made for a model, applied to EVERY year
|
||||
2. **Impossible combinations**: Users can select "1992 Corvette Z06" (Z06 didn't exist until 2001)
|
||||
3. **Data bloat**: 400 records for 2000 Corvette with 20 trims instead of ~3-4
|
||||
|
||||
### Root Cause
|
||||
The `makes-filter/*.json` data structure does NOT have year-specific trims. Example from `chevrolet.json`:
|
||||
```json
|
||||
{
|
||||
"year": "2025",
|
||||
"models": [{
|
||||
"name": "corvette",
|
||||
"submodels": ["LT", "35th Anniversary Edition", "427", "Z06", "ZR1", ...]
|
||||
}]
|
||||
}
|
||||
```
|
||||
The same `submodels` array is repeated for every year, making ALL trims appear valid for ALL years.
|
||||
|
||||
---
|
||||
|
||||
## Solution Architecture
|
||||
|
||||
### Data Sources (Priority Order)
|
||||
1. **NHTSA VPIC API** - Authoritative Year/Make/Model validation
|
||||
2. **automobiles.json** - Primary trim source with year-range evidence
|
||||
3. **makes-filter/*.json** - Engine data enrichment
|
||||
4. **Defaults** - "Base" trim, "Gas" engine, "Manual"/"Automatic" transmission
|
||||
|
||||
### Year Range
|
||||
- Minimum: 1990
|
||||
- Maximum: 2026
|
||||
|
||||
---
|
||||
|
||||
## Implementation Steps
|
||||
|
||||
### Phase 1: Create NHTSA Data Fetcher
|
||||
|
||||
**Create file:** `data/make-model-import/nhtsa_fetch.py`
|
||||
|
||||
```python
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
NHTSA VPIC API Data Fetcher
|
||||
Fetches authoritative Year/Make/Model data from the US government database.
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import time
|
||||
from pathlib import Path
|
||||
from typing import Dict, List, Set
|
||||
import urllib.request
|
||||
import urllib.error
|
||||
|
||||
class NHTSAFetcher:
|
||||
BASE_URL = "https://vpic.nhtsa.dot.gov/api/vehicles"
|
||||
CACHE_DIR = Path("nhtsa_cache")
|
||||
OUTPUT_FILE = Path("nhtsa_vehicles.json")
|
||||
|
||||
def __init__(self):
|
||||
self.min_year = int(os.getenv("MIN_YEAR", "1990"))
|
||||
self.max_year = int(os.getenv("MAX_YEAR", "2026"))
|
||||
self.request_delay = 0.1 # 100ms between requests
|
||||
|
||||
# Makes we care about (from makes-filter)
|
||||
self.target_makes = self._load_target_makes()
|
||||
|
||||
def _load_target_makes(self) -> Set[str]:
|
||||
"""Load makes from makes-filter directory."""
|
||||
makes_dir = Path("makes-filter")
|
||||
makes = set()
|
||||
for f in makes_dir.glob("*.json"):
|
||||
make_name = f.stem.replace("_", " ").title()
|
||||
makes.add(make_name)
|
||||
return makes
|
||||
|
||||
def fetch_url(self, url: str) -> dict:
|
||||
"""Fetch JSON from URL with error handling."""
|
||||
try:
|
||||
with urllib.request.urlopen(url, timeout=30) as response:
|
||||
return json.loads(response.read().decode())
|
||||
except urllib.error.URLError as e:
|
||||
print(f" Error fetching {url}: {e}")
|
||||
return {"Results": []}
|
||||
|
||||
def get_all_makes(self) -> List[Dict]:
|
||||
"""Fetch all makes for passenger cars and trucks."""
|
||||
makes = []
|
||||
for vehicle_type in ["car", "truck"]:
|
||||
url = f"{self.BASE_URL}/GetMakesForVehicleType/{vehicle_type}?format=json"
|
||||
data = self.fetch_url(url)
|
||||
makes.extend(data.get("Results", []))
|
||||
return makes
|
||||
|
||||
def get_models_for_make_year(self, make: str, year: int) -> List[str]:
|
||||
"""Fetch models for a specific make and year."""
|
||||
cache_file = self.CACHE_DIR / f"{make.lower().replace(' ', '_')}_{year}.json"
|
||||
|
||||
# Check cache first
|
||||
if cache_file.exists():
|
||||
with open(cache_file) as f:
|
||||
return json.load(f)
|
||||
|
||||
url = f"{self.BASE_URL}/GetModelsForMakeYear/make/{make}/modelyear/{year}?format=json"
|
||||
time.sleep(self.request_delay)
|
||||
data = self.fetch_url(url)
|
||||
|
||||
models = list(set(r.get("Model_Name", "") for r in data.get("Results", []) if r.get("Model_Name")))
|
||||
|
||||
# Cache result
|
||||
self.CACHE_DIR.mkdir(exist_ok=True)
|
||||
with open(cache_file, "w") as f:
|
||||
json.dump(models, f)
|
||||
|
||||
return models
|
||||
|
||||
def fetch_all_data(self) -> Dict:
|
||||
"""Fetch all Year/Make/Model data."""
|
||||
print("Fetching NHTSA data...")
|
||||
|
||||
# Filter to target makes
|
||||
all_makes = self.get_all_makes()
|
||||
target_make_names = [m["MakeName"] for m in all_makes
|
||||
if m["MakeName"].title() in self.target_makes
|
||||
or m["MakeName"].upper() in ["BMW", "GMC", "RAM"]]
|
||||
|
||||
print(f"Found {len(target_make_names)} matching makes")
|
||||
|
||||
result = {}
|
||||
for year in range(self.min_year, self.max_year + 1):
|
||||
result[str(year)] = {}
|
||||
for make in target_make_names:
|
||||
models = self.get_models_for_make_year(make, year)
|
||||
if models:
|
||||
# Normalize make name
|
||||
make_normalized = make.title()
|
||||
if make.upper() in ["BMW", "GMC", "RAM"]:
|
||||
make_normalized = make.upper()
|
||||
result[str(year)][make_normalized] = sorted(models)
|
||||
print(f" Year {year}: {sum(len(v) for v in result[str(year)].values())} models")
|
||||
|
||||
# Save output
|
||||
with open(self.OUTPUT_FILE, "w") as f:
|
||||
json.dump(result, f, indent=2)
|
||||
|
||||
print(f"Saved to {self.OUTPUT_FILE}")
|
||||
return result
|
||||
|
||||
if __name__ == "__main__":
|
||||
NHTSAFetcher().fetch_all_data()
|
||||
```
|
||||
|
||||
### Phase 2: Refactor ETL Script
|
||||
|
||||
**Modify file:** `data/make-model-import/etl_generate_sql.py`
|
||||
|
||||
Key changes:
|
||||
|
||||
#### 2.1 Load NHTSA data as primary source
|
||||
```python
|
||||
def load_nhtsa_data(self):
|
||||
"""Load NHTSA Year/Make/Model data."""
|
||||
nhtsa_file = Path("nhtsa_vehicles.json")
|
||||
if not nhtsa_file.exists():
|
||||
raise FileNotFoundError("Run nhtsa_fetch.py first to generate nhtsa_vehicles.json")
|
||||
|
||||
with open(nhtsa_file) as f:
|
||||
self.nhtsa_data = json.load(f)
|
||||
print(f" Loaded NHTSA data for {len(self.nhtsa_data)} years")
|
||||
```
|
||||
|
||||
#### 2.2 Build trim evidence from automobiles.json
|
||||
```python
|
||||
def build_trim_evidence(self):
|
||||
"""
|
||||
Parse automobiles.json to build year-range evidence for trims.
|
||||
"""
|
||||
self.trim_evidence: Dict[Tuple[str, str], List[Dict]] = defaultdict(list)
|
||||
|
||||
brand_lookup = {b.get("id"): self.get_canonical_make_name(b.get("name", ""))
|
||||
for b in self.brands_data}
|
||||
|
||||
for auto in self.automobiles_data:
|
||||
brand_id = auto.get("brand_id")
|
||||
make = brand_lookup.get(brand_id)
|
||||
if not make:
|
||||
continue
|
||||
|
||||
name = auto.get("name", "")
|
||||
year_range = self.parse_year_range_from_name(name)
|
||||
if not year_range:
|
||||
continue
|
||||
|
||||
year_start, year_end = year_range
|
||||
|
||||
# Extract model and trim from name
|
||||
model, trim = self.extract_model_trim_from_name(name, make)
|
||||
if not model:
|
||||
continue
|
||||
|
||||
self.trim_evidence[(make, model)].append({
|
||||
"trim": trim or "Base",
|
||||
"year_start": year_start,
|
||||
"year_end": year_end,
|
||||
"source_name": name
|
||||
})
|
||||
|
||||
print(f" Built trim evidence for {len(self.trim_evidence)} make/model combinations")
|
||||
|
||||
def extract_model_trim_from_name(self, name: str, make: str) -> Tuple[Optional[str], Optional[str]]:
|
||||
"""
|
||||
Extract model and trim from automobile name.
|
||||
Examples:
|
||||
"CHEVROLET Corvette Z06 2021-Present" -> ("Corvette", "Z06")
|
||||
"2020 Chevrolet Corvette C8 Stingray" -> ("Corvette", "Stingray")
|
||||
"FORD F-150 Raptor 2021-Present" -> ("F-150", "Raptor")
|
||||
"""
|
||||
# Remove make prefix
|
||||
clean = re.sub(rf"^{re.escape(make)}\s+", "", name, flags=re.IGNORECASE)
|
||||
|
||||
# Remove year/range patterns
|
||||
clean = re.sub(r"\d{4}(-\d{4}|-Present)?", "", clean)
|
||||
|
||||
# Remove common suffixes
|
||||
clean = re.sub(r"\s*(Photos|engines|full specs|&).*$", "", clean, flags=re.IGNORECASE)
|
||||
|
||||
# Clean up extra spaces
|
||||
clean = " ".join(clean.split())
|
||||
|
||||
# Try to match against known models
|
||||
known_models = self.known_models_by_make.get(make, set())
|
||||
|
||||
for model in sorted(known_models, key=len, reverse=True):
|
||||
pattern = re.compile(rf"^{re.escape(model)}\b\s*(.*)", re.IGNORECASE)
|
||||
match = pattern.match(clean)
|
||||
if match:
|
||||
trim = match.group(1).strip()
|
||||
# Remove generation codes like C5, C6, C7, C8
|
||||
trim = re.sub(r"^C\d+\s*", "", trim)
|
||||
return (model, trim if trim else None)
|
||||
|
||||
return (None, None)
|
||||
```
|
||||
|
||||
#### 2.3 New trim resolution logic
|
||||
```python
|
||||
def get_trims_for_vehicle(self, year: int, make: str, model: str) -> List[str]:
|
||||
"""
|
||||
Get valid trims for a year/make/model combination.
|
||||
Uses evidence from automobiles.json, falls back to "Base".
|
||||
"""
|
||||
evidence = self.trim_evidence.get((make, model), [])
|
||||
valid_trims = set()
|
||||
|
||||
for entry in evidence:
|
||||
if entry['year_start'] <= year <= entry['year_end']:
|
||||
valid_trims.add(entry['trim'])
|
||||
|
||||
# Always include "Base" as an option
|
||||
valid_trims.add("Base")
|
||||
|
||||
return sorted(valid_trims)
|
||||
```
|
||||
|
||||
#### 2.4 Updated vehicle record building
|
||||
```python
|
||||
def build_vehicle_records(self):
|
||||
"""Build vehicle records using NHTSA for Y/M/M, evidence for trims."""
|
||||
print("\n Building vehicle option records...")
|
||||
records = []
|
||||
|
||||
for year_str, makes in self.nhtsa_data.items():
|
||||
year = int(year_str)
|
||||
if year < self.min_year or year > self.max_year:
|
||||
continue
|
||||
|
||||
for make, models in makes.items():
|
||||
for model in models:
|
||||
# Get valid trims from evidence
|
||||
trims = self.get_trims_for_vehicle(year, make, model)
|
||||
|
||||
# Get engines from makes-filter (or default)
|
||||
engines = self.get_engines_for_vehicle(year, make, model)
|
||||
|
||||
# Default transmissions
|
||||
transmissions = ["Manual", "Automatic"]
|
||||
|
||||
for trim in trims:
|
||||
for engine in engines:
|
||||
for trans in transmissions:
|
||||
records.append({
|
||||
"year": year,
|
||||
"make": make,
|
||||
"model": model,
|
||||
"trim": trim,
|
||||
"engine_name": engine,
|
||||
"trans_name": trans
|
||||
})
|
||||
|
||||
# Deduplicate
|
||||
unique_set = set()
|
||||
deduped = []
|
||||
for r in records:
|
||||
key = (r["year"], r["make"].lower(), r["model"].lower(),
|
||||
r["trim"].lower(), r["engine_name"].lower(), r["trans_name"].lower())
|
||||
if key not in unique_set:
|
||||
unique_set.add(key)
|
||||
deduped.append(r)
|
||||
|
||||
self.vehicle_records = deduped
|
||||
print(f" Vehicle records: {len(self.vehicle_records):,}")
|
||||
|
||||
def get_engines_for_vehicle(self, year: int, make: str, model: str) -> List[str]:
|
||||
"""Get engines from makes-filter or use defaults."""
|
||||
# Try to find in makes-filter data
|
||||
for baseline in self.baseline_records:
|
||||
if (baseline['year'] == year and
|
||||
baseline['make'].lower() == make.lower() and
|
||||
baseline['model'].lower() == model.lower()):
|
||||
engines = []
|
||||
for trim_data in baseline.get('trims', []):
|
||||
engines.extend(trim_data.get('engines', []))
|
||||
if engines:
|
||||
return list(set(engines))
|
||||
|
||||
# Default based on make/model patterns
|
||||
model_lower = model.lower()
|
||||
if 'electric' in model_lower or 'ev' in model_lower or 'lightning' in model_lower:
|
||||
return ["Electric"]
|
||||
|
||||
return ["Gas"]
|
||||
```
|
||||
|
||||
### Phase 3: Update Import Script
|
||||
|
||||
**Modify file:** `data/make-model-import/import_data.sh`
|
||||
|
||||
Add NHTSA cache check:
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Import generated SQL files into PostgreSQL database
|
||||
|
||||
set -e
|
||||
|
||||
echo "=========================================="
|
||||
echo " Automotive Database Import"
|
||||
echo "=========================================="
|
||||
|
||||
# Check NHTSA cache freshness
|
||||
NHTSA_FILE="nhtsa_vehicles.json"
|
||||
CACHE_AGE_DAYS=30
|
||||
|
||||
if [ ! -f "$NHTSA_FILE" ]; then
|
||||
echo "NHTSA data not found. Fetching..."
|
||||
python3 nhtsa_fetch.py
|
||||
elif [ $(find "$NHTSA_FILE" -mtime +$CACHE_AGE_DAYS 2>/dev/null | wc -l) -gt 0 ]; then
|
||||
echo "NHTSA cache is stale (>$CACHE_AGE_DAYS days). Refreshing..."
|
||||
python3 nhtsa_fetch.py
|
||||
else
|
||||
echo "Using cached NHTSA data"
|
||||
fi
|
||||
|
||||
# Continue with existing import logic...
|
||||
```
|
||||
|
||||
### Phase 4: Update QA Validation
|
||||
|
||||
**Modify file:** `data/make-model-import/qa_validate.py`
|
||||
|
||||
Add invalid combination checks:
|
||||
```python
|
||||
def check_invalid_combinations():
|
||||
"""Verify known invalid combinations do not exist."""
|
||||
invalid_combos = [
|
||||
# (year, make, model, trim) - known to be invalid
|
||||
(1992, 'Chevrolet', 'Corvette', 'Z06'), # Z06 started 2001
|
||||
(2000, 'Chevrolet', 'Corvette', '35th Anniversary Edition'), # Was 1988
|
||||
(2000, 'Chevrolet', 'Corvette', 'Stingray'), # Stingray started 2014
|
||||
(1995, 'Ford', 'Mustang', 'Mach-E'), # Mach-E is 2021+
|
||||
]
|
||||
|
||||
issues = []
|
||||
for year, make, model, trim in invalid_combos:
|
||||
query = f"""
|
||||
SELECT COUNT(*) FROM vehicle_options
|
||||
WHERE year = {year}
|
||||
AND make = '{make}'
|
||||
AND model = '{model}'
|
||||
AND trim = '{trim}'
|
||||
"""
|
||||
count = int(run_psql(query).strip())
|
||||
if count > 0:
|
||||
issues.append(f"Invalid combo found: {year} {make} {model} {trim}")
|
||||
|
||||
return issues
|
||||
|
||||
def check_trim_coverage():
|
||||
"""Report on trim coverage statistics."""
|
||||
query = """
|
||||
SELECT
|
||||
COUNT(DISTINCT (year, make, model)) as total_models,
|
||||
COUNT(DISTINCT (year, make, model)) FILTER (WHERE trim = 'Base') as base_only,
|
||||
COUNT(DISTINCT (year, make, model)) FILTER (WHERE trim != 'Base') as has_specific_trims
|
||||
FROM vehicle_options
|
||||
"""
|
||||
result = run_psql(query).strip()
|
||||
print(f"Trim coverage: {result}")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Files Summary
|
||||
|
||||
| File | Action | Purpose |
|
||||
|------|--------|---------|
|
||||
| `data/make-model-import/nhtsa_fetch.py` | CREATE | Fetch Year/Make/Model from NHTSA API |
|
||||
| `data/make-model-import/etl_generate_sql.py` | MODIFY | Use NHTSA data, evidence-based trims |
|
||||
| `data/make-model-import/import_data.sh` | MODIFY | Add NHTSA cache refresh |
|
||||
| `data/make-model-import/qa_validate.py` | MODIFY | Add invalid combo checks |
|
||||
|
||||
---
|
||||
|
||||
## Execution Order
|
||||
|
||||
```bash
|
||||
# 1. Navigate to ETL directory
|
||||
cd data/make-model-import
|
||||
|
||||
# 2. Fetch NHTSA data (creates nhtsa_vehicles.json)
|
||||
python3 nhtsa_fetch.py
|
||||
|
||||
# 3. Generate SQL files
|
||||
python3 etl_generate_sql.py
|
||||
|
||||
# 4. Import to database
|
||||
./import_data.sh
|
||||
|
||||
# 5. Validate results
|
||||
python3 qa_validate.py
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Expected Results
|
||||
|
||||
### Before
|
||||
- 2000 Corvette: 400 records, 20 trims (most invalid)
|
||||
- Total records: ~1,675,335
|
||||
- Many impossible year/trim combinations
|
||||
|
||||
### After
|
||||
- 2000 Corvette: ~8 records (Base, Coupe, Convertible)
|
||||
- 2015 Corvette: ~20 records (Stingray, Z06, Grand Sport, Base)
|
||||
- Total records: ~400,000-600,000
|
||||
- No invalid year/trim combinations
|
||||
|
||||
### Validation Checks
|
||||
1. No 1992 Corvette Z06
|
||||
2. No 2000 Corvette Stingray
|
||||
3. No 1995 Mustang Mach-E
|
||||
4. Year range: 1990-2026
|
||||
|
||||
---
|
||||
|
||||
## Data Source Coverage
|
||||
|
||||
**automobiles.json trim coverage (samples):**
|
||||
| Model | Entries | Trims Found |
|
||||
|-------|---------|-------------|
|
||||
| Civic | 67 | Si, Type R, eHEV, Sedan, Hatchback |
|
||||
| Mustang | 38 | GT, Dark Horse, Mach-E GT, GTD |
|
||||
| Accord | 35 | Sedan, Coupe (various years) |
|
||||
| Corvette | 31 | Z06, ZR1, Stingray, Grand Sport |
|
||||
| Camaro | 29 | ZL1, Convertible, Coupe |
|
||||
| F-150 | 19 | Lightning, Raptor, Tremor |
|
||||
|
||||
---
|
||||
|
||||
## Environment Variables
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `MIN_YEAR` | 1990 | Minimum year to include |
|
||||
| `MAX_YEAR` | 2026 | Maximum year to include |
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### NHTSA API Rate Limiting
|
||||
The script includes 100ms delays between requests. If you encounter rate limiting:
|
||||
- Increase `request_delay` in `nhtsa_fetch.py`
|
||||
- Use cached data in `nhtsa_cache/` directory
|
||||
|
||||
### Missing Models
|
||||
If NHTSA returns fewer models than expected:
|
||||
- Check if the make name matches exactly
|
||||
- Some brands (BMW, GMC) need uppercase handling
|
||||
- Verify the year range is supported (NHTSA has data back to ~1995)
|
||||
|
||||
### Cache Refresh
|
||||
To force refresh NHTSA data:
|
||||
```bash
|
||||
rm nhtsa_vehicles.json
|
||||
rm -rf nhtsa_cache/
|
||||
python3 nhtsa_fetch.py
|
||||
```
|
||||
Reference in New Issue
Block a user