19 KiB
vPIC ETL Implementation Plan v2
Overview
Extract vehicle dropdown data from NHTSA vPIC database for MY2022+ to supplement existing VehAPI data. This revised plan uses a make-specific extraction approach with proper VIN schema parsing.
Key Changes from v1
- Limit to VehAPI makes only - Only extract the 48 makes that exist in VehAPI data
- VIN schema-based extraction - Extract directly from VIN patterns, not defs_model
- Proper field formatting - Match VehAPI display string formats
- Make-specific logic - Handle different manufacturers' data patterns
Critical Discovery: WMI Linkage
Must use wmi_make junction table (many-to-many), NOT wmi.makeid (one-to-many):
-- CORRECT: via wmi_make (finds all makes including Toyota, Hyundai, etc.)
FROM vpic.make m
JOIN vpic.wmi_make wm ON wm.makeid = m.id
JOIN vpic.wmi w ON w.id = wm.wmiid
-- WRONG: via wmi.makeid (misses many major brands)
FROM vpic.make m
JOIN vpic.wmi w ON w.makeid = m.id
Make Availability Summary
| Status | Count | Makes |
|---|---|---|
| Available (2022+ schemas) | 46 | See table below |
| No 2022+ data | 2 | Hummer (discontinued 2010), Scion (discontinued 2016) |
Per-Make Analysis
Group 1: Japanese Manufacturers (Honda/Acura, Toyota/Lexus, Nissan/Infiniti)
| Make | VehAPI Name | vPIC Name | Schemas (2022+) | Status |
|---|---|---|---|---|
| Acura | Acura | Acura | 48 | Ready |
| Honda | Honda | Honda | 238 | Ready |
| Lexus | Lexus | Lexus | 90 | Ready |
| Toyota | Toyota | Toyota | 152 | Ready |
| Infiniti | INFINITI | Infiniti | 76 | Ready |
| Nissan | Nissan | Nissan | 85 | Ready |
| Mazda | Mazda | Mazda | 37 | Ready |
| Mitsubishi | Mitsubishi | Mitsubishi | 11 | Ready |
| Subaru | Subaru | Subaru | 75 | Ready |
| Isuzu | Isuzu | Isuzu | 11 | Ready |
Group 2: Korean Manufacturers (Hyundai/Kia/Genesis)
| Make | VehAPI Name | vPIC Name | Schemas (2022+) | Status |
|---|---|---|---|---|
| Genesis | Genesis | Genesis | 74 | Ready |
| Hyundai | Hyundai | Hyundai | 177 | Ready |
| Kia | Kia | Kia | 72 | Ready |
Group 3: American - GM (Chevrolet, GMC, Buick, Cadillac)
| Make | VehAPI Name | vPIC Name | Schemas (2022+) | Status |
|---|---|---|---|---|
| Buick | Buick | Buick | 20 | Ready |
| Cadillac | Cadillac | Cadillac | 50 | Ready |
| Chevrolet | Chevrolet | Chevrolet | 185 | Ready |
| GMC | GMC | GMC | 107 | Ready |
| Oldsmobile | Oldsmobile | Oldsmobile | 1 | Limited |
| Pontiac | Pontiac | Pontiac | 5 | Limited (2022-2024) |
Group 4: American - Stellantis (Chrysler, Dodge, Jeep, Ram, Fiat)
| Make | VehAPI Name | vPIC Name | Schemas (2022+) | Status |
|---|---|---|---|---|
| Chrysler | Chrysler | Chrysler | 81 | Ready |
| Dodge | Dodge | Dodge | 86 | Ready |
| FIAT | FIAT | Fiat | 91 | Ready (case diff) |
| Jeep | Jeep | Jeep | 81 | Ready |
| RAM | RAM | Ram | 81 | Ready (case diff) |
| Plymouth | Plymouth | Plymouth | 4 | Limited |
Group 5: American - Ford
| Make | VehAPI Name | vPIC Name | Schemas (2022+) | Status |
|---|---|---|---|---|
| Ford | Ford | Ford | 108 | Ready |
| Lincoln | Lincoln | Lincoln | 21 | Ready |
| Mercury | Mercury | Mercury | 0 | No data (discontinued 2011) |
Group 6: American - EV Startups
| Make | VehAPI Name | vPIC Name | Schemas (2022+) | Status |
|---|---|---|---|---|
| Polestar | Polestar | Polestar | 12 | Ready |
| Rivian | Rivian | RIVIAN | 10 | Ready (case diff) |
| Tesla | Tesla | Tesla | 14 | Ready |
Group 7: German Manufacturers
| Make | VehAPI Name | vPIC Name | Schemas (2022+) | Status |
|---|---|---|---|---|
| Audi | Audi | Audi | 55 | Ready |
| BMW | BMW | BMW | 61 | Ready |
| Mercedes-Benz | Mercedes-Benz | Mercedes-Benz | 39 | Ready |
| MINI | MINI | MINI | 10 | Ready |
| Porsche | Porsche | Porsche | 23 | Ready |
| smart | smart | smart | 5 | Ready |
| Volkswagen | Volkswagen | Volkswagen | 134 | Ready |
Group 8: European Luxury
| Make | VehAPI Name | vPIC Name | Schemas (2022+) | Status |
|---|---|---|---|---|
| Bentley | Bentley | Bentley | 48 | Ready |
| Ferrari | Ferrari | Ferrari | 9 | Ready |
| Jaguar | Jaguar | Jaguar | 17 | Ready |
| Lamborghini | Lamborghini | Lamborghini | 10 | Ready |
| Lotus | Lotus | Lotus | 5 | Ready |
| Maserati | Maserati | Maserati | 19 | Ready |
| McLaren | McLaren | McLaren | 4 | Ready |
| Volvo | Volvo | Volvo | 80 | Ready |
Group 9: Discontinued (No 2022+ Data)
| Make | VehAPI Name | Reason | Action |
|---|---|---|---|
| Hummer | Hummer | Discontinued 2010 (new EV under GMC) | Skip - use existing VehAPI |
| Scion | Scion | Discontinued 2016 | Skip - use existing VehAPI |
| Saab | Saab | Discontinued 2012 | Limited schemas (9) |
| Mercury | Mercury | Discontinued 2011 | No schemas |
Extraction Architecture
Data Flow
vPIC VIN Schemas → Pattern Extraction → Format Transformation → SQLite Pairs
↓
Filter by:
- 48 VehAPI makes only
- Year >= 2022
- Vehicle types (exclude motorcycles, trailers, buses)
Core Query Strategy
For each allowed make:
- Find WMIs linked to that make
- Get VIN schemas for years 2022+
- Extract from patterns:
- Model (from schema name or pattern)
- Trim (Element: Trim)
- Displacement (Element: DisplacementL)
- Horsepower (Element: EngineHP)
- Cylinders (Element: EngineCylinders)
- Engine Config (Element: EngineConfiguration)
- Transmission Style (Element: TransmissionStyle)
- Transmission Speeds (Element: TransmissionSpeeds)
Acura Extraction Template
This pattern applies to Honda/Acura and similar well-structured manufacturers.
Sample VIN Schema: Acura MDX 2025 (schema_id: 26929)
| Element | Code | Values |
|---|---|---|
| Trim | Trim | MDX, Technology, SH-AWD, SH-AWD Technology, SH-AWD A-Spec, SH-AWD Advance, SH-AWD A-Spec Advance, SH-AWD TYPE S ADVANCE |
| Displacement | DisplacementL | 3.5, 3.0 |
| Horsepower | EngineHP | 290, 355 |
| Cylinders | EngineCylinders | 6 |
| Engine Config | EngineConfiguration | V-Shaped |
| Trans Style | TransmissionStyle | Automatic |
| Trans Speeds | TransmissionSpeeds | 10 |
Output Format
Engine Display (match VehAPI):
{DisplacementL}L {EngineHP} hp V{EngineCylinders}
→ "3.5L 290 hp V6"
Transmission Display (match VehAPI):
{TransmissionSpeeds}-Speed {TransmissionStyle}
→ "10-Speed Automatic"
Extraction SQL Template
WITH schema_data AS (
SELECT DISTINCT
vs.id AS schema_id,
vs.name AS schema_name,
wvs.yearfrom,
COALESCE(wvs.yearto, 2027) AS yearto,
m.name AS make_name
FROM vpic.wmi w
JOIN vpic.make m ON w.makeid = m.id
JOIN vpic.wmi_vinschema wvs ON w.id = wvs.wmiid
JOIN vpic.vinschema vs ON wvs.vinschemaid = vs.id
WHERE LOWER(m.name) IN ('acura', 'honda', ...) -- VehAPI makes
AND wvs.yearfrom >= 2022 OR (wvs.yearto >= 2022)
),
trim_data AS (
SELECT DISTINCT sd.schema_id, p.attributeid AS trim
FROM schema_data sd
JOIN vpic.pattern p ON p.vinschemaid = sd.schema_id
JOIN vpic.element e ON p.elementid = e.id
WHERE e.code = 'Trim'
),
engine_data AS (
SELECT DISTINCT
sd.schema_id,
MAX(CASE WHEN e.code = 'DisplacementL' THEN p.attributeid END) AS displacement,
MAX(CASE WHEN e.code = 'EngineHP' THEN p.attributeid END) AS hp,
MAX(CASE WHEN e.code = 'EngineCylinders' THEN p.attributeid END) AS cylinders,
MAX(CASE WHEN e.code = 'EngineConfiguration' THEN ec.name END) AS config
FROM schema_data sd
JOIN vpic.pattern p ON p.vinschemaid = sd.schema_id
JOIN vpic.element e ON p.elementid = e.id
LEFT JOIN vpic.engineconfiguration ec ON e.code = 'EngineConfiguration'
AND p.attributeid ~ '^[0-9]+$' AND ec.id = CAST(p.attributeid AS INT)
WHERE e.code IN ('DisplacementL', 'EngineHP', 'EngineCylinders', 'EngineConfiguration')
GROUP BY sd.schema_id, p.keys -- Group by VIN pattern position
),
trans_data AS (
SELECT DISTINCT
sd.schema_id,
t.name AS style,
MAX(CASE WHEN e.code = 'TransmissionSpeeds' THEN p.attributeid END) AS speeds
FROM schema_data sd
JOIN vpic.pattern p ON p.vinschemaid = sd.schema_id
JOIN vpic.element e ON p.elementid = e.id
LEFT JOIN vpic.transmission t ON e.code = 'TransmissionStyle'
AND p.attributeid ~ '^[0-9]+$' AND t.id = CAST(p.attributeid AS INT)
WHERE e.code IN ('TransmissionStyle', 'TransmissionSpeeds')
GROUP BY sd.schema_id, t.name
)
SELECT ...
Allowed Makes (48 from VehAPI)
ALLOWED_MAKES = [
'Acura', 'Audi', 'Bentley', 'BMW', 'Buick', 'Cadillac', 'Chevrolet',
'Chrysler', 'Dodge', 'Ferrari', 'FIAT', 'Ford', 'Genesis', 'GMC',
'Honda', 'Hummer', 'Hyundai', 'INFINITI', 'Isuzu', 'Jaguar', 'Jeep',
'Kia', 'Lamborghini', 'Lexus', 'Lincoln', 'Lotus', 'Maserati', 'Mazda',
'McLaren', 'Mercedes-Benz', 'Mercury', 'MINI', 'Mitsubishi', 'Nissan',
'Oldsmobile', 'Plymouth', 'Polestar', 'Pontiac', 'Porsche', 'RAM',
'Rivian', 'Saab', 'Scion', 'smart', 'Subaru', 'Tesla', 'Toyota',
'Volkswagen', 'Volvo'
]
Note: Some makes may have different names in vPIC (case variations, abbreviations).
Implementation Steps
Phase 1: Rewrite vpic_extract.py
File: vpic_extract.py
Core extraction query (uses wmi_make junction table):
WITH base AS (
SELECT DISTINCT
m.name AS make_name,
vs.id AS schema_id,
vs.name AS schema_name,
generate_series(
GREATEST(wvs.yearfrom, 2022),
COALESCE(wvs.yearto, EXTRACT(YEAR FROM NOW()) + 2)
)::INT AS year
FROM vpic.make m
JOIN vpic.wmi_make wm ON wm.makeid = m.id
JOIN vpic.wmi w ON w.id = wm.wmiid
JOIN vpic.wmi_vinschema wvs ON w.id = wvs.wmiid
JOIN vpic.vinschema vs ON wvs.vinschemaid = vs.id
WHERE LOWER(m.name) IN ({allowed_makes})
AND (wvs.yearfrom >= 2022 OR wvs.yearto >= 2022)
)
SELECT ...
Key functions to implement:
extract_model_from_schema_name(schema_name)- Parse "Acura MDX Schema..." → "MDX"get_schema_patterns(schema_id)- Get all pattern data for a schemaformat_engine_display(disp, hp, cyl, config)- Format as "3.5L 290 hp V6"format_trans_display(style, speeds)- Format as "10-Speed Automatic"generate_trans_records(has_data, style, speeds)- Return 1 or 2 records
Make name normalization:
MAKE_MAPPING = {
'INFINITI': 'INFINITI', # VehAPI uses all-caps
'FIAT': 'FIAT',
'RAM': 'RAM',
'RIVIAN': 'Rivian', # vPIC uses all-caps, normalize
# ... etc
}
Phase 2: Test Extraction
Test with validated VINs:
source .venv/bin/activate
python3 vpic_extract.py --test-vin 5J8YE1H05SL018611 # Acura MDX
python3 vpic_extract.py --test-vin 5TFJA5DB4SX327537 # Toyota Tundra
python3 vpic_extract.py --test-vin 3GTUUFEL6PG140748 # GMC Sierra
Phase 3: Full Extraction
python3 vpic_extract.py --min-year 2022 --output-dir snapshots/vpic-2025-12
Phase 4: Merge & Import
# Merge vPIC with existing VehAPI data
sqlite3 snapshots/merged/snapshot.sqlite "
CREATE TABLE pairs(...);
ATTACH 'snapshots/vehicle-drop-down.sqlite' AS db1;
ATTACH 'snapshots/vpic-2025-12/snapshot.sqlite' AS db2;
INSERT OR IGNORE INTO pairs SELECT * FROM db1.pairs WHERE year < 2022;
INSERT OR IGNORE INTO pairs SELECT * FROM db2.pairs;
"
# Generate SQL and import
python3 etl_generate_sql.py --snapshot-path snapshots/merged/snapshot.sqlite
./import_data.sh
Files to Modify
| File | Changes |
|---|---|
vpic_extract.py |
Complete rewrite: VIN schema extraction, dual-record trans logic |
README.md |
Already updated with workflow |
Success Criteria
- Extract all 41 makes with 2022+ VIN schemas
- ~2,500-5,000 unique vehicle configurations (Year/Make/Model/Trim/Engine)
- Transmission: Use vPIC data where available (7 makes), dual-record elsewhere
- Output format matches VehAPI: "3.5L 290 hp V6" / "10-Speed Automatic"
- Merge preserves 2015-2021 VehAPI data
- QA validation passes after import
Make Analysis Status (All Families Validated)
| Family | Makes | Status | Trans Data | Strategy |
|---|---|---|---|---|
| Honda/Acura | Acura, Honda | VALIDATED | YES (93-97%) | Use vPIC trans data |
| Toyota/Lexus | Toyota, Lexus | VALIDATED | PARTIAL (Toyota 23%, Lexus 0%) | Dual-record for Lexus |
| Nissan/Infiniti | Nissan, Infiniti, Mitsubishi | VALIDATED | LOW (5%) | Dual-record |
| GM | Chevrolet, GMC, Buick, Cadillac | VALIDATED | LOW (0-7%) | Dual-record |
| Stellantis | Chrysler, Dodge, Jeep, Ram, Fiat | VALIDATED | NONE (0%) | Dual-record |
| Ford | Ford, Lincoln | VALIDATED | NONE (0%) | Dual-record |
| VW Group | Volkswagen, Audi, Porsche, Bentley, Lamborghini | VALIDATED | MIXED (0-84%) | VW/Audi use vPIC; others dual-record |
| BMW | BMW, MINI | VALIDATED | NONE (0%) | Dual-record |
| Mercedes | Mercedes-Benz, smart | VALIDATED | YES (52%) | Use vPIC trans data |
| Hyundai/Kia/Genesis | Hyundai, Kia, Genesis | VALIDATED | NONE (0%) | Dual-record |
| Subaru | Subaru | VALIDATED | YES (64%) | Use vPIC trans data |
| Mazda | Mazda | VALIDATED | LOW (11%) | Dual-record |
| Volvo | Volvo, Polestar | VALIDATED | LOW (3%/0%) | Dual-record |
| Exotics | Ferrari, Maserati, Jaguar, Lotus, McLaren | VALIDATED | MIXED | Per-make handling |
| EV | Tesla, Rivian | VALIDATED | NONE (0%) | Dual-record (though EVs don't have "manual") |
Special Cases
-
Electric Vehicles (Tesla, Rivian, Polestar): Don't have manual transmissions
- Still create dual-record for consistency with dropdown
- User can select "Automatic" (single-speed EV)
-
Luxury Exotics (Ferrari, Lamborghini, etc.): Mix of automated manual/DCT
- Dual-record covers all options
CRITICAL FINDING: Transmission Data Availability
Most manufacturers do NOT encode transmission info in VINs.
VIN Decode Validation Results (12 Families)
| Family | VIN | Make | Model | Year | Trim | Engine | Trans |
|---|---|---|---|---|---|---|---|
| Honda/Acura | 5J8YE1H05SL018611 | ACURA | MDX | 2025 | SH-AWD A-Spec | 3.5L V6 290hp | 10-Spd Auto |
| Honda/Acura | 2HGFE4F88SH315466 | HONDA | Civic | 2025 | Sport Hybrid | 2.0L I4 141hp | e-CVT |
| Toyota/Lexus | 5TFJA5DB4SX327537 | TOYOTA | Tundra | 2025 | Limited | 3.4L V6 389hp | 10-Spd Auto |
| Nissan/Infiniti | 5N1AL1FW9TC332353 | INFINITI | QX60 | 2026 | Luxe | 2.0L (no cyl/hp) | MISSING |
| GM | 3GTUUFEL6PG140748 | GMC | Sierra | 2023 | AT4X | 6.2L V8 (no hp) | MISSING |
| Stellantis | 1C4HJXEG7PW506480 | JEEP | Wrangler | 2023 | Sahara | 3.6L V6 285hp | MISSING |
| Ford | 1FTFW4L59SFC03038 | FORD | F-150 | 2025 | Tremor | 5.0L V8 (no hp) | MISSING |
| VW Group | WVWEB7CD9RW229116 | VOLKSWAGEN | Golf R | 2024 | MISSING | 2.0L 4cyl 315hp | Auto (no spd) |
| BMW | 5YM13ET06R9S31554 | BMW | X5 | 2024 | X5 M Competition | 4.4L 8cyl 617hp | MISSING |
| Mercedes | W1KAF4HB1SR287126 | MERCEDES-BENZ | C-Class | 2025 | C300 4MATIC | 2.0L I4 255hp | 9-Spd Auto |
| Hyundai/Kia | 5XYRLDJC0SG336002 | KIA | Sorento | 2025 | S | 2.5L 4cyl 191hp | MISSING |
| Subaru | JF1VBAF67P9806852 | SUBARU | WRX | 2023 | Premium | 2.4L 4cyl 271hp | 6-Spd Manual |
| Mazda | JM3KFBCL3R0522361 | MAZDA | CX-5 | 2024 | Preferred Pkg | 2.5L I4 187hp | 6-Spd Auto |
| Volvo | YV4M12RJ9S1094167 | VOLVO | XC60 | 2025 | Core | 2.0L 4cyl 247hp | 8-Spd Auto |
Transmission Data Coverage in vPIC Schemas
| Coverage | Makes | Trans Schemas / Total |
|---|---|---|
| HIGH (>40%) | Honda, Acura, Subaru, Audi, VW, Mercedes, Jaguar | 225/233, 42/45, 47/74, 46/55, 47/132, 13/25, 17/17 |
| LOW (<10%) | Chevrolet, Cadillac, Nissan, Infiniti, Mazda, Volvo | 4/164, 7/43, 4/82, 4/74, 4/36, 2/72 |
| NONE (0%) | GMC, Buick, Ford, Lincoln, Jeep, Dodge, Chrysler, Ram, Fiat, BMW, MINI, Porsche, Hyundai, Kia, Genesis, Lexus, Tesla, Rivian, Polestar | 0% |
Makes WITHOUT Transmission Data (22 of 41 makes = 54%)
- ALL Stellantis: Chrysler, Dodge, Jeep, Ram, Fiat
- ALL Ford: Ford, Lincoln
- ALL Korean: Hyundai, Kia, Genesis
- ALL BMW Group: BMW, MINI
- GM (partial): GMC, Buick (Chevy/Cadillac have minimal)
- Others: Lexus, Porsche, Bentley, Lamborghini, Tesla, Rivian, Polestar
Extraction Strategy (SELECTED)
Dual-Record Strategy for Missing Transmission Data
When transmission data is NOT available from vPIC:
- Create TWO records for each vehicle configuration
- One with
trans_display = "Automatic",trans_canon = "automatic" - One with
trans_display = "Manual",trans_canon = "manual"
This ensures:
- All transmission options available in dropdown for user selection
- User can select the correct transmission type
- No false "Unknown" values that break filtering
Implementation Logic
def generate_trans_records(has_trans_data: bool, trans_style: str, trans_speeds: str):
if has_trans_data:
# Use actual vPIC data
return [(format_trans_display(trans_style, trans_speeds),
canonicalize_trans(trans_style))]
else:
# Generate both options
return [
("Automatic", "automatic"),
("Manual", "manual")
]
Expected Output Growth
For makes without trans data, record count approximately doubles:
- GMC Sierra AT4X + 6.2L V8 → 2 records (Auto + Manual)
- Ford F-150 Tremor + 5.0L V8 → 2 records (Auto + Manual)
This is acceptable as it provides complete dropdown coverage.
Validated Extraction Examples
Acura MDX 2025 (VIN: 5J8YE1H05SL018611)
- vPIC: Make=ACURA, Model=MDX, Trim=SH-AWD A-Spec, Engine=3.5L V6 290hp, Trans=10-Speed Automatic
- Output:
3.5L 290 hp V6|10-Speed Automatic
Honda Civic 2025 (VIN: 2HGFE4F88SH315466)
- vPIC: Make=HONDA, Model=Civic, Trim=Sport Hybrid / Sport Touring Hybrid, Engine=2L I4 141hp, Trans=e-CVT
- Output:
2.0L 141 hp I4|Electronic Continuously Variable (e-CVT)
Toyota Tundra 2025 (VIN: 5TFJA5DB4SX327537)
- vPIC: Make=TOYOTA, Model=Tundra, Trim=Limited, Engine=3.4L V6 389hp, Trans=10-Speed Automatic
- Output:
3.4L 389 hp V6|10-Speed Automatic
Mercedes C-Class 2025 (VIN: W1KAF4HB1SR287126)
- vPIC: Make=MERCEDES-BENZ, Model=C-Class, Trim=C300 4MATIC, Engine=2.0L I4 255hp, Trans=9-Speed Automatic
- Output:
2.0L 255 hp I4|9-Speed Automatic
Subaru WRX 2023 (VIN: JF1VBAF67P9806852)
- vPIC: Make=SUBARU, Model=WRX, Trim=Premium, Engine=2.4L 4cyl 271hp, Trans=6-Speed Manual
- Output:
2.4L 271 hp 4cyl|6-Speed Manual
Mazda CX-5 2024 (VIN: JM3KFBCL3R0522361)
- vPIC: Make=MAZDA, Model=CX-5, Trim=Preferred Package, Engine=2.5L I4 187hp, Trans=6-Speed Automatic
- Output:
2.5L 187 hp I4|6-Speed Automatic
Volvo XC60 2025 (VIN: YV4M12RJ9S1094167)
- vPIC: Make=VOLVO, Model=XC60, Trim=Core, Engine=2.0L 4cyl 247hp, Trans=8-Speed Automatic
- Output:
2.0L 247 hp 4cyl|8-Speed Automatic