519 lines
16 KiB
Markdown
519 lines
16 KiB
Markdown
# ETL Fix V2: Year-Accurate Vehicle Dropdown Data
|
|
|
|
## Executive Summary
|
|
|
|
This document provides a complete implementation plan for fixing the vehicle dropdown ETL to produce year-accurate data. The fix addresses impossible year/trim combinations (e.g., "1992 Corvette Z06") by using the NHTSA VPIC API for authoritative Year/Make/Model validation and automobiles.json for evidence-based trim data with year ranges.
|
|
|
|
---
|
|
|
|
## Problem Statement
|
|
|
|
### Current Issues
|
|
1. **Year-inaccurate trims**: The `makes-filter/*.json` files contain ALL trims ever made for a model, applied to EVERY year
|
|
2. **Impossible combinations**: Users can select "1992 Corvette Z06" (Z06 didn't exist until 2001)
|
|
3. **Data bloat**: 400 records for 2000 Corvette with 20 trims instead of ~3-4
|
|
|
|
### Root Cause
|
|
The `makes-filter/*.json` data structure does NOT have year-specific trims. Example from `chevrolet.json`:
|
|
```json
|
|
{
|
|
"year": "2025",
|
|
"models": [{
|
|
"name": "corvette",
|
|
"submodels": ["LT", "35th Anniversary Edition", "427", "Z06", "ZR1", ...]
|
|
}]
|
|
}
|
|
```
|
|
The same `submodels` array is repeated for every year, making ALL trims appear valid for ALL years.
|
|
|
|
---
|
|
|
|
## Solution Architecture
|
|
|
|
### Data Sources (Priority Order)
|
|
1. **NHTSA VPIC API** - Authoritative Year/Make/Model validation
|
|
2. **automobiles.json** - Primary trim source with year-range evidence
|
|
3. **makes-filter/*.json** - Engine data enrichment
|
|
4. **Defaults** - "Base" trim, "Gas" engine, "Manual"/"Automatic" transmission
|
|
|
|
### Year Range
|
|
- Minimum: 1990
|
|
- Maximum: 2026
|
|
|
|
---
|
|
|
|
## Implementation Steps
|
|
|
|
### Phase 1: Create NHTSA Data Fetcher
|
|
|
|
**Create file:** `data/make-model-import/nhtsa_fetch.py`
|
|
|
|
```python
|
|
#!/usr/bin/env python3
|
|
"""
|
|
NHTSA VPIC API Data Fetcher
|
|
Fetches authoritative Year/Make/Model data from the US government database.
|
|
"""
|
|
|
|
import json
|
|
import os
|
|
import time
|
|
from pathlib import Path
|
|
from typing import Dict, List, Set
|
|
import urllib.request
|
|
import urllib.error
|
|
|
|
class NHTSAFetcher:
|
|
BASE_URL = "https://vpic.nhtsa.dot.gov/api/vehicles"
|
|
CACHE_DIR = Path("nhtsa_cache")
|
|
OUTPUT_FILE = Path("nhtsa_vehicles.json")
|
|
|
|
def __init__(self):
|
|
self.min_year = int(os.getenv("MIN_YEAR", "1990"))
|
|
self.max_year = int(os.getenv("MAX_YEAR", "2026"))
|
|
self.request_delay = 0.1 # 100ms between requests
|
|
|
|
# Makes we care about (from makes-filter)
|
|
self.target_makes = self._load_target_makes()
|
|
|
|
def _load_target_makes(self) -> Set[str]:
|
|
"""Load makes from makes-filter directory."""
|
|
makes_dir = Path("makes-filter")
|
|
makes = set()
|
|
for f in makes_dir.glob("*.json"):
|
|
make_name = f.stem.replace("_", " ").title()
|
|
makes.add(make_name)
|
|
return makes
|
|
|
|
def fetch_url(self, url: str) -> dict:
|
|
"""Fetch JSON from URL with error handling."""
|
|
try:
|
|
with urllib.request.urlopen(url, timeout=30) as response:
|
|
return json.loads(response.read().decode())
|
|
except urllib.error.URLError as e:
|
|
print(f" Error fetching {url}: {e}")
|
|
return {"Results": []}
|
|
|
|
def get_all_makes(self) -> List[Dict]:
|
|
"""Fetch all makes for passenger cars and trucks."""
|
|
makes = []
|
|
for vehicle_type in ["car", "truck"]:
|
|
url = f"{self.BASE_URL}/GetMakesForVehicleType/{vehicle_type}?format=json"
|
|
data = self.fetch_url(url)
|
|
makes.extend(data.get("Results", []))
|
|
return makes
|
|
|
|
def get_models_for_make_year(self, make: str, year: int) -> List[str]:
|
|
"""Fetch models for a specific make and year."""
|
|
cache_file = self.CACHE_DIR / f"{make.lower().replace(' ', '_')}_{year}.json"
|
|
|
|
# Check cache first
|
|
if cache_file.exists():
|
|
with open(cache_file) as f:
|
|
return json.load(f)
|
|
|
|
url = f"{self.BASE_URL}/GetModelsForMakeYear/make/{make}/modelyear/{year}?format=json"
|
|
time.sleep(self.request_delay)
|
|
data = self.fetch_url(url)
|
|
|
|
models = list(set(r.get("Model_Name", "") for r in data.get("Results", []) if r.get("Model_Name")))
|
|
|
|
# Cache result
|
|
self.CACHE_DIR.mkdir(exist_ok=True)
|
|
with open(cache_file, "w") as f:
|
|
json.dump(models, f)
|
|
|
|
return models
|
|
|
|
def fetch_all_data(self) -> Dict:
|
|
"""Fetch all Year/Make/Model data."""
|
|
print("Fetching NHTSA data...")
|
|
|
|
# Filter to target makes
|
|
all_makes = self.get_all_makes()
|
|
target_make_names = [m["MakeName"] for m in all_makes
|
|
if m["MakeName"].title() in self.target_makes
|
|
or m["MakeName"].upper() in ["BMW", "GMC", "RAM"]]
|
|
|
|
print(f"Found {len(target_make_names)} matching makes")
|
|
|
|
result = {}
|
|
for year in range(self.min_year, self.max_year + 1):
|
|
result[str(year)] = {}
|
|
for make in target_make_names:
|
|
models = self.get_models_for_make_year(make, year)
|
|
if models:
|
|
# Normalize make name
|
|
make_normalized = make.title()
|
|
if make.upper() in ["BMW", "GMC", "RAM"]:
|
|
make_normalized = make.upper()
|
|
result[str(year)][make_normalized] = sorted(models)
|
|
print(f" Year {year}: {sum(len(v) for v in result[str(year)].values())} models")
|
|
|
|
# Save output
|
|
with open(self.OUTPUT_FILE, "w") as f:
|
|
json.dump(result, f, indent=2)
|
|
|
|
print(f"Saved to {self.OUTPUT_FILE}")
|
|
return result
|
|
|
|
if __name__ == "__main__":
|
|
NHTSAFetcher().fetch_all_data()
|
|
```
|
|
|
|
### Phase 2: Refactor ETL Script
|
|
|
|
**Modify file:** `data/make-model-import/etl_generate_sql.py`
|
|
|
|
Key changes:
|
|
|
|
#### 2.1 Load NHTSA data as primary source
|
|
```python
|
|
def load_nhtsa_data(self):
|
|
"""Load NHTSA Year/Make/Model data."""
|
|
nhtsa_file = Path("nhtsa_vehicles.json")
|
|
if not nhtsa_file.exists():
|
|
raise FileNotFoundError("Run nhtsa_fetch.py first to generate nhtsa_vehicles.json")
|
|
|
|
with open(nhtsa_file) as f:
|
|
self.nhtsa_data = json.load(f)
|
|
print(f" Loaded NHTSA data for {len(self.nhtsa_data)} years")
|
|
```
|
|
|
|
#### 2.2 Build trim evidence from automobiles.json
|
|
```python
|
|
def build_trim_evidence(self):
|
|
"""
|
|
Parse automobiles.json to build year-range evidence for trims.
|
|
"""
|
|
self.trim_evidence: Dict[Tuple[str, str], List[Dict]] = defaultdict(list)
|
|
|
|
brand_lookup = {b.get("id"): self.get_canonical_make_name(b.get("name", ""))
|
|
for b in self.brands_data}
|
|
|
|
for auto in self.automobiles_data:
|
|
brand_id = auto.get("brand_id")
|
|
make = brand_lookup.get(brand_id)
|
|
if not make:
|
|
continue
|
|
|
|
name = auto.get("name", "")
|
|
year_range = self.parse_year_range_from_name(name)
|
|
if not year_range:
|
|
continue
|
|
|
|
year_start, year_end = year_range
|
|
|
|
# Extract model and trim from name
|
|
model, trim = self.extract_model_trim_from_name(name, make)
|
|
if not model:
|
|
continue
|
|
|
|
self.trim_evidence[(make, model)].append({
|
|
"trim": trim or "Base",
|
|
"year_start": year_start,
|
|
"year_end": year_end,
|
|
"source_name": name
|
|
})
|
|
|
|
print(f" Built trim evidence for {len(self.trim_evidence)} make/model combinations")
|
|
|
|
def extract_model_trim_from_name(self, name: str, make: str) -> Tuple[Optional[str], Optional[str]]:
|
|
"""
|
|
Extract model and trim from automobile name.
|
|
Examples:
|
|
"CHEVROLET Corvette Z06 2021-Present" -> ("Corvette", "Z06")
|
|
"2020 Chevrolet Corvette C8 Stingray" -> ("Corvette", "Stingray")
|
|
"FORD F-150 Raptor 2021-Present" -> ("F-150", "Raptor")
|
|
"""
|
|
# Remove make prefix
|
|
clean = re.sub(rf"^{re.escape(make)}\s+", "", name, flags=re.IGNORECASE)
|
|
|
|
# Remove year/range patterns
|
|
clean = re.sub(r"\d{4}(-\d{4}|-Present)?", "", clean)
|
|
|
|
# Remove common suffixes
|
|
clean = re.sub(r"\s*(Photos|engines|full specs|&).*$", "", clean, flags=re.IGNORECASE)
|
|
|
|
# Clean up extra spaces
|
|
clean = " ".join(clean.split())
|
|
|
|
# Try to match against known models
|
|
known_models = self.known_models_by_make.get(make, set())
|
|
|
|
for model in sorted(known_models, key=len, reverse=True):
|
|
pattern = re.compile(rf"^{re.escape(model)}\b\s*(.*)", re.IGNORECASE)
|
|
match = pattern.match(clean)
|
|
if match:
|
|
trim = match.group(1).strip()
|
|
# Remove generation codes like C5, C6, C7, C8
|
|
trim = re.sub(r"^C\d+\s*", "", trim)
|
|
return (model, trim if trim else None)
|
|
|
|
return (None, None)
|
|
```
|
|
|
|
#### 2.3 New trim resolution logic
|
|
```python
|
|
def get_trims_for_vehicle(self, year: int, make: str, model: str) -> List[str]:
|
|
"""
|
|
Get valid trims for a year/make/model combination.
|
|
Uses evidence from automobiles.json, falls back to "Base".
|
|
"""
|
|
evidence = self.trim_evidence.get((make, model), [])
|
|
valid_trims = set()
|
|
|
|
for entry in evidence:
|
|
if entry['year_start'] <= year <= entry['year_end']:
|
|
valid_trims.add(entry['trim'])
|
|
|
|
# Always include "Base" as an option
|
|
valid_trims.add("Base")
|
|
|
|
return sorted(valid_trims)
|
|
```
|
|
|
|
#### 2.4 Updated vehicle record building
|
|
```python
|
|
def build_vehicle_records(self):
|
|
"""Build vehicle records using NHTSA for Y/M/M, evidence for trims."""
|
|
print("\n Building vehicle option records...")
|
|
records = []
|
|
|
|
for year_str, makes in self.nhtsa_data.items():
|
|
year = int(year_str)
|
|
if year < self.min_year or year > self.max_year:
|
|
continue
|
|
|
|
for make, models in makes.items():
|
|
for model in models:
|
|
# Get valid trims from evidence
|
|
trims = self.get_trims_for_vehicle(year, make, model)
|
|
|
|
# Get engines from makes-filter (or default)
|
|
engines = self.get_engines_for_vehicle(year, make, model)
|
|
|
|
# Default transmissions
|
|
transmissions = ["Manual", "Automatic"]
|
|
|
|
for trim in trims:
|
|
for engine in engines:
|
|
for trans in transmissions:
|
|
records.append({
|
|
"year": year,
|
|
"make": make,
|
|
"model": model,
|
|
"trim": trim,
|
|
"engine_name": engine,
|
|
"trans_name": trans
|
|
})
|
|
|
|
# Deduplicate
|
|
unique_set = set()
|
|
deduped = []
|
|
for r in records:
|
|
key = (r["year"], r["make"].lower(), r["model"].lower(),
|
|
r["trim"].lower(), r["engine_name"].lower(), r["trans_name"].lower())
|
|
if key not in unique_set:
|
|
unique_set.add(key)
|
|
deduped.append(r)
|
|
|
|
self.vehicle_records = deduped
|
|
print(f" Vehicle records: {len(self.vehicle_records):,}")
|
|
|
|
def get_engines_for_vehicle(self, year: int, make: str, model: str) -> List[str]:
|
|
"""Get engines from makes-filter or use defaults."""
|
|
# Try to find in makes-filter data
|
|
for baseline in self.baseline_records:
|
|
if (baseline['year'] == year and
|
|
baseline['make'].lower() == make.lower() and
|
|
baseline['model'].lower() == model.lower()):
|
|
engines = []
|
|
for trim_data in baseline.get('trims', []):
|
|
engines.extend(trim_data.get('engines', []))
|
|
if engines:
|
|
return list(set(engines))
|
|
|
|
# Default based on make/model patterns
|
|
model_lower = model.lower()
|
|
if 'electric' in model_lower or 'ev' in model_lower or 'lightning' in model_lower:
|
|
return ["Electric"]
|
|
|
|
return ["Gas"]
|
|
```
|
|
|
|
### Phase 3: Update Import Script
|
|
|
|
**Modify file:** `data/make-model-import/import_data.sh`
|
|
|
|
Add NHTSA cache check:
|
|
```bash
|
|
#!/bin/bash
|
|
# Import generated SQL files into PostgreSQL database
|
|
|
|
set -e
|
|
|
|
echo "=========================================="
|
|
echo " Automotive Database Import"
|
|
echo "=========================================="
|
|
|
|
# Check NHTSA cache freshness
|
|
NHTSA_FILE="nhtsa_vehicles.json"
|
|
CACHE_AGE_DAYS=30
|
|
|
|
if [ ! -f "$NHTSA_FILE" ]; then
|
|
echo "NHTSA data not found. Fetching..."
|
|
python3 nhtsa_fetch.py
|
|
elif [ $(find "$NHTSA_FILE" -mtime +$CACHE_AGE_DAYS 2>/dev/null | wc -l) -gt 0 ]; then
|
|
echo "NHTSA cache is stale (>$CACHE_AGE_DAYS days). Refreshing..."
|
|
python3 nhtsa_fetch.py
|
|
else
|
|
echo "Using cached NHTSA data"
|
|
fi
|
|
|
|
# Continue with existing import logic...
|
|
```
|
|
|
|
### Phase 4: Update QA Validation
|
|
|
|
**Modify file:** `data/make-model-import/qa_validate.py`
|
|
|
|
Add invalid combination checks:
|
|
```python
|
|
def check_invalid_combinations():
|
|
"""Verify known invalid combinations do not exist."""
|
|
invalid_combos = [
|
|
# (year, make, model, trim) - known to be invalid
|
|
(1992, 'Chevrolet', 'Corvette', 'Z06'), # Z06 started 2001
|
|
(2000, 'Chevrolet', 'Corvette', '35th Anniversary Edition'), # Was 1988
|
|
(2000, 'Chevrolet', 'Corvette', 'Stingray'), # Stingray started 2014
|
|
(1995, 'Ford', 'Mustang', 'Mach-E'), # Mach-E is 2021+
|
|
]
|
|
|
|
issues = []
|
|
for year, make, model, trim in invalid_combos:
|
|
query = f"""
|
|
SELECT COUNT(*) FROM vehicle_options
|
|
WHERE year = {year}
|
|
AND make = '{make}'
|
|
AND model = '{model}'
|
|
AND trim = '{trim}'
|
|
"""
|
|
count = int(run_psql(query).strip())
|
|
if count > 0:
|
|
issues.append(f"Invalid combo found: {year} {make} {model} {trim}")
|
|
|
|
return issues
|
|
|
|
def check_trim_coverage():
|
|
"""Report on trim coverage statistics."""
|
|
query = """
|
|
SELECT
|
|
COUNT(DISTINCT (year, make, model)) as total_models,
|
|
COUNT(DISTINCT (year, make, model)) FILTER (WHERE trim = 'Base') as base_only,
|
|
COUNT(DISTINCT (year, make, model)) FILTER (WHERE trim != 'Base') as has_specific_trims
|
|
FROM vehicle_options
|
|
"""
|
|
result = run_psql(query).strip()
|
|
print(f"Trim coverage: {result}")
|
|
```
|
|
|
|
---
|
|
|
|
## Files Summary
|
|
|
|
| File | Action | Purpose |
|
|
|------|--------|---------|
|
|
| `data/make-model-import/nhtsa_fetch.py` | CREATE | Fetch Year/Make/Model from NHTSA API |
|
|
| `data/make-model-import/etl_generate_sql.py` | MODIFY | Use NHTSA data, evidence-based trims |
|
|
| `data/make-model-import/import_data.sh` | MODIFY | Add NHTSA cache refresh |
|
|
| `data/make-model-import/qa_validate.py` | MODIFY | Add invalid combo checks |
|
|
|
|
---
|
|
|
|
## Execution Order
|
|
|
|
```bash
|
|
# 1. Navigate to ETL directory
|
|
cd data/make-model-import
|
|
|
|
# 2. Fetch NHTSA data (creates nhtsa_vehicles.json)
|
|
python3 nhtsa_fetch.py
|
|
|
|
# 3. Generate SQL files
|
|
python3 etl_generate_sql.py
|
|
|
|
# 4. Import to database
|
|
./import_data.sh
|
|
|
|
# 5. Validate results
|
|
python3 qa_validate.py
|
|
```
|
|
|
|
---
|
|
|
|
## Expected Results
|
|
|
|
### Before
|
|
- 2000 Corvette: 400 records, 20 trims (most invalid)
|
|
- Total records: ~1,675,335
|
|
- Many impossible year/trim combinations
|
|
|
|
### After
|
|
- 2000 Corvette: ~8 records (Base, Coupe, Convertible)
|
|
- 2015 Corvette: ~20 records (Stingray, Z06, Grand Sport, Base)
|
|
- Total records: ~400,000-600,000
|
|
- No invalid year/trim combinations
|
|
|
|
### Validation Checks
|
|
1. No 1992 Corvette Z06
|
|
2. No 2000 Corvette Stingray
|
|
3. No 1995 Mustang Mach-E
|
|
4. Year range: 1990-2026
|
|
|
|
---
|
|
|
|
## Data Source Coverage
|
|
|
|
**automobiles.json trim coverage (samples):**
|
|
| Model | Entries | Trims Found |
|
|
|-------|---------|-------------|
|
|
| Civic | 67 | Si, Type R, eHEV, Sedan, Hatchback |
|
|
| Mustang | 38 | GT, Dark Horse, Mach-E GT, GTD |
|
|
| Accord | 35 | Sedan, Coupe (various years) |
|
|
| Corvette | 31 | Z06, ZR1, Stingray, Grand Sport |
|
|
| Camaro | 29 | ZL1, Convertible, Coupe |
|
|
| F-150 | 19 | Lightning, Raptor, Tremor |
|
|
|
|
---
|
|
|
|
## Environment Variables
|
|
|
|
| Variable | Default | Description |
|
|
|----------|---------|-------------|
|
|
| `MIN_YEAR` | 1990 | Minimum year to include |
|
|
| `MAX_YEAR` | 2026 | Maximum year to include |
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
### NHTSA API Rate Limiting
|
|
The script includes 100ms delays between requests. If you encounter rate limiting:
|
|
- Increase `request_delay` in `nhtsa_fetch.py`
|
|
- Use cached data in `nhtsa_cache/` directory
|
|
|
|
### Missing Models
|
|
If NHTSA returns fewer models than expected:
|
|
- Check if the make name matches exactly
|
|
- Some brands (BMW, GMC) need uppercase handling
|
|
- Verify the year range is supported (NHTSA has data back to ~1995)
|
|
|
|
### Cache Refresh
|
|
To force refresh NHTSA data:
|
|
```bash
|
|
rm nhtsa_vehicles.json
|
|
rm -rf nhtsa_cache/
|
|
python3 nhtsa_fetch.py
|
|
```
|