# vPIC ETL Implementation Plan v2 ## Overview Extract vehicle dropdown data from NHTSA vPIC database for MY2022+ to supplement existing VehAPI data. This revised plan uses a make-specific extraction approach with proper VIN schema parsing. ## Key Changes from v1 1. **Limit to VehAPI makes only** - Only extract the 48 makes that exist in VehAPI data 2. **VIN schema-based extraction** - Extract directly from VIN patterns, not defs_model 3. **Proper field formatting** - Match VehAPI display string formats 4. **Make-specific logic** - Handle different manufacturers' data patterns ## Critical Discovery: WMI Linkage **Must use `wmi_make` junction table (many-to-many), NOT `wmi.makeid` (one-to-many):** ```sql -- CORRECT: via wmi_make (finds all makes including Toyota, Hyundai, etc.) FROM vpic.make m JOIN vpic.wmi_make wm ON wm.makeid = m.id JOIN vpic.wmi w ON w.id = wm.wmiid -- WRONG: via wmi.makeid (misses many major brands) FROM vpic.make m JOIN vpic.wmi w ON w.makeid = m.id ``` --- ## Make Availability Summary | Status | Count | Makes | |--------|-------|-------| | **Available (2022+ schemas)** | 46 | See table below | | **No 2022+ data** | 2 | Hummer (discontinued 2010), Scion (discontinued 2016) | --- ## Per-Make Analysis ### Group 1: Japanese Manufacturers (Honda/Acura, Toyota/Lexus, Nissan/Infiniti) | Make | VehAPI Name | vPIC Name | Schemas (2022+) | Status | |------|-------------|-----------|-----------------|--------| | Acura | Acura | Acura | 48 | Ready | | Honda | Honda | Honda | 238 | Ready | | Lexus | Lexus | Lexus | 90 | Ready | | Toyota | Toyota | Toyota | 152 | Ready | | Infiniti | INFINITI | Infiniti | 76 | Ready | | Nissan | Nissan | Nissan | 85 | Ready | | Mazda | Mazda | Mazda | 37 | Ready | | Mitsubishi | Mitsubishi | Mitsubishi | 11 | Ready | | Subaru | Subaru | Subaru | 75 | Ready | | Isuzu | Isuzu | Isuzu | 11 | Ready | ### Group 2: Korean Manufacturers (Hyundai/Kia/Genesis) | Make | VehAPI Name | vPIC Name | Schemas (2022+) | Status | |------|-------------|-----------|-----------------|--------| | Genesis | Genesis | Genesis | 74 | Ready | | Hyundai | Hyundai | Hyundai | 177 | Ready | | Kia | Kia | Kia | 72 | Ready | ### Group 3: American - GM (Chevrolet, GMC, Buick, Cadillac) | Make | VehAPI Name | vPIC Name | Schemas (2022+) | Status | |------|-------------|-----------|-----------------|--------| | Buick | Buick | Buick | 20 | Ready | | Cadillac | Cadillac | Cadillac | 50 | Ready | | Chevrolet | Chevrolet | Chevrolet | 185 | Ready | | GMC | GMC | GMC | 107 | Ready | | Oldsmobile | Oldsmobile | Oldsmobile | 1 | Limited | | Pontiac | Pontiac | Pontiac | 5 | Limited (2022-2024) | ### Group 4: American - Stellantis (Chrysler, Dodge, Jeep, Ram, Fiat) | Make | VehAPI Name | vPIC Name | Schemas (2022+) | Status | |------|-------------|-----------|-----------------|--------| | Chrysler | Chrysler | Chrysler | 81 | Ready | | Dodge | Dodge | Dodge | 86 | Ready | | FIAT | FIAT | Fiat | 91 | Ready (case diff) | | Jeep | Jeep | Jeep | 81 | Ready | | RAM | RAM | Ram | 81 | Ready (case diff) | | Plymouth | Plymouth | Plymouth | 4 | Limited | ### Group 5: American - Ford | Make | VehAPI Name | vPIC Name | Schemas (2022+) | Status | |------|-------------|-----------|-----------------|--------| | Ford | Ford | Ford | 108 | Ready | | Lincoln | Lincoln | Lincoln | 21 | Ready | | Mercury | Mercury | Mercury | 0 | No data (discontinued 2011) | ### Group 6: American - EV Startups | Make | VehAPI Name | vPIC Name | Schemas (2022+) | Status | |------|-------------|-----------|-----------------|--------| | Polestar | Polestar | Polestar | 12 | Ready | | Rivian | Rivian | RIVIAN | 10 | Ready (case diff) | | Tesla | Tesla | Tesla | 14 | Ready | ### Group 7: German Manufacturers | Make | VehAPI Name | vPIC Name | Schemas (2022+) | Status | |------|-------------|-----------|-----------------|--------| | Audi | Audi | Audi | 55 | Ready | | BMW | BMW | BMW | 61 | Ready | | Mercedes-Benz | Mercedes-Benz | Mercedes-Benz | 39 | Ready | | MINI | MINI | MINI | 10 | Ready | | Porsche | Porsche | Porsche | 23 | Ready | | smart | smart | smart | 5 | Ready | | Volkswagen | Volkswagen | Volkswagen | 134 | Ready | ### Group 8: European Luxury | Make | VehAPI Name | vPIC Name | Schemas (2022+) | Status | |------|-------------|-----------|-----------------|--------| | Bentley | Bentley | Bentley | 48 | Ready | | Ferrari | Ferrari | Ferrari | 9 | Ready | | Jaguar | Jaguar | Jaguar | 17 | Ready | | Lamborghini | Lamborghini | Lamborghini | 10 | Ready | | Lotus | Lotus | Lotus | 5 | Ready | | Maserati | Maserati | Maserati | 19 | Ready | | McLaren | McLaren | McLaren | 4 | Ready | | Volvo | Volvo | Volvo | 80 | Ready | ### Group 9: Discontinued (No 2022+ Data) | Make | VehAPI Name | Reason | Action | |------|-------------|--------|--------| | Hummer | Hummer | Discontinued 2010 (new EV under GMC) | Skip - use existing VehAPI | | Scion | Scion | Discontinued 2016 | Skip - use existing VehAPI | | Saab | Saab | Discontinued 2012 | Limited schemas (9) | | Mercury | Mercury | Discontinued 2011 | No schemas | --- ## Extraction Architecture ### Data Flow ``` vPIC VIN Schemas → Pattern Extraction → Format Transformation → SQLite Pairs ↓ Filter by: - 48 VehAPI makes only - Year >= 2022 - Vehicle types (exclude motorcycles, trailers, buses) ``` ### Core Query Strategy For each allowed make: 1. Find WMIs linked to that make 2. Get VIN schemas for years 2022+ 3. Extract from patterns: - Model (from schema name or pattern) - Trim (Element: Trim) - Displacement (Element: DisplacementL) - Horsepower (Element: EngineHP) - Cylinders (Element: EngineCylinders) - Engine Config (Element: EngineConfiguration) - Transmission Style (Element: TransmissionStyle) - Transmission Speeds (Element: TransmissionSpeeds) --- ## Acura Extraction Template This pattern applies to Honda/Acura and similar well-structured manufacturers. ### Sample VIN Schema: Acura MDX 2025 (schema_id: 26929) | Element | Code | Values | |---------|------|--------| | Trim | Trim | MDX, Technology, SH-AWD, SH-AWD Technology, SH-AWD A-Spec, SH-AWD Advance, SH-AWD A-Spec Advance, SH-AWD TYPE S ADVANCE | | Displacement | DisplacementL | 3.5, 3.0 | | Horsepower | EngineHP | 290, 355 | | Cylinders | EngineCylinders | 6 | | Engine Config | EngineConfiguration | V-Shaped | | Trans Style | TransmissionStyle | Automatic | | Trans Speeds | TransmissionSpeeds | 10 | ### Output Format **Engine Display** (match VehAPI): ``` {DisplacementL}L {EngineHP} hp V{EngineCylinders} → "3.5L 290 hp V6" ``` **Transmission Display** (match VehAPI): ``` {TransmissionSpeeds}-Speed {TransmissionStyle} → "10-Speed Automatic" ``` ### Extraction SQL Template ```sql WITH schema_data AS ( SELECT DISTINCT vs.id AS schema_id, vs.name AS schema_name, wvs.yearfrom, COALESCE(wvs.yearto, 2027) AS yearto, m.name AS make_name FROM vpic.wmi w JOIN vpic.make m ON w.makeid = m.id JOIN vpic.wmi_vinschema wvs ON w.id = wvs.wmiid JOIN vpic.vinschema vs ON wvs.vinschemaid = vs.id WHERE LOWER(m.name) IN ('acura', 'honda', ...) -- VehAPI makes AND wvs.yearfrom >= 2022 OR (wvs.yearto >= 2022) ), trim_data AS ( SELECT DISTINCT sd.schema_id, p.attributeid AS trim FROM schema_data sd JOIN vpic.pattern p ON p.vinschemaid = sd.schema_id JOIN vpic.element e ON p.elementid = e.id WHERE e.code = 'Trim' ), engine_data AS ( SELECT DISTINCT sd.schema_id, MAX(CASE WHEN e.code = 'DisplacementL' THEN p.attributeid END) AS displacement, MAX(CASE WHEN e.code = 'EngineHP' THEN p.attributeid END) AS hp, MAX(CASE WHEN e.code = 'EngineCylinders' THEN p.attributeid END) AS cylinders, MAX(CASE WHEN e.code = 'EngineConfiguration' THEN ec.name END) AS config FROM schema_data sd JOIN vpic.pattern p ON p.vinschemaid = sd.schema_id JOIN vpic.element e ON p.elementid = e.id LEFT JOIN vpic.engineconfiguration ec ON e.code = 'EngineConfiguration' AND p.attributeid ~ '^[0-9]+$' AND ec.id = CAST(p.attributeid AS INT) WHERE e.code IN ('DisplacementL', 'EngineHP', 'EngineCylinders', 'EngineConfiguration') GROUP BY sd.schema_id, p.keys -- Group by VIN pattern position ), trans_data AS ( SELECT DISTINCT sd.schema_id, t.name AS style, MAX(CASE WHEN e.code = 'TransmissionSpeeds' THEN p.attributeid END) AS speeds FROM schema_data sd JOIN vpic.pattern p ON p.vinschemaid = sd.schema_id JOIN vpic.element e ON p.elementid = e.id LEFT JOIN vpic.transmission t ON e.code = 'TransmissionStyle' AND p.attributeid ~ '^[0-9]+$' AND t.id = CAST(p.attributeid AS INT) WHERE e.code IN ('TransmissionStyle', 'TransmissionSpeeds') GROUP BY sd.schema_id, t.name ) SELECT ... ``` --- ## Allowed Makes (48 from VehAPI) ```python ALLOWED_MAKES = [ 'Acura', 'Audi', 'Bentley', 'BMW', 'Buick', 'Cadillac', 'Chevrolet', 'Chrysler', 'Dodge', 'Ferrari', 'FIAT', 'Ford', 'Genesis', 'GMC', 'Honda', 'Hummer', 'Hyundai', 'INFINITI', 'Isuzu', 'Jaguar', 'Jeep', 'Kia', 'Lamborghini', 'Lexus', 'Lincoln', 'Lotus', 'Maserati', 'Mazda', 'McLaren', 'Mercedes-Benz', 'Mercury', 'MINI', 'Mitsubishi', 'Nissan', 'Oldsmobile', 'Plymouth', 'Polestar', 'Pontiac', 'Porsche', 'RAM', 'Rivian', 'Saab', 'Scion', 'smart', 'Subaru', 'Tesla', 'Toyota', 'Volkswagen', 'Volvo' ] ``` Note: Some makes may have different names in vPIC (case variations, abbreviations). --- ## Implementation Steps ### Phase 1: Rewrite vpic_extract.py **File:** `vpic_extract.py` Core extraction query (uses wmi_make junction table): ```sql WITH base AS ( SELECT DISTINCT m.name AS make_name, vs.id AS schema_id, vs.name AS schema_name, generate_series( GREATEST(wvs.yearfrom, 2022), COALESCE(wvs.yearto, EXTRACT(YEAR FROM NOW()) + 2) )::INT AS year FROM vpic.make m JOIN vpic.wmi_make wm ON wm.makeid = m.id JOIN vpic.wmi w ON w.id = wm.wmiid JOIN vpic.wmi_vinschema wvs ON w.id = wvs.wmiid JOIN vpic.vinschema vs ON wvs.vinschemaid = vs.id WHERE LOWER(m.name) IN ({allowed_makes}) AND (wvs.yearfrom >= 2022 OR wvs.yearto >= 2022) ) SELECT ... ``` **Key functions to implement:** 1. `extract_model_from_schema_name(schema_name)` - Parse "Acura MDX Schema..." → "MDX" 2. `get_schema_patterns(schema_id)` - Get all pattern data for a schema 3. `format_engine_display(disp, hp, cyl, config)` - Format as "3.5L 290 hp V6" 4. `format_trans_display(style, speeds)` - Format as "10-Speed Automatic" 5. `generate_trans_records(has_data, style, speeds)` - Return 1 or 2 records **Make name normalization:** ```python MAKE_MAPPING = { 'INFINITI': 'INFINITI', # VehAPI uses all-caps 'FIAT': 'FIAT', 'RAM': 'RAM', 'RIVIAN': 'Rivian', # vPIC uses all-caps, normalize # ... etc } ``` ### Phase 2: Test Extraction Test with validated VINs: ```bash source .venv/bin/activate python3 vpic_extract.py --test-vin 5J8YE1H05SL018611 # Acura MDX python3 vpic_extract.py --test-vin 5TFJA5DB4SX327537 # Toyota Tundra python3 vpic_extract.py --test-vin 3GTUUFEL6PG140748 # GMC Sierra ``` ### Phase 3: Full Extraction ```bash python3 vpic_extract.py --min-year 2022 --output-dir snapshots/vpic-2025-12 ``` ### Phase 4: Merge & Import ```bash # Merge vPIC with existing VehAPI data sqlite3 snapshots/merged/snapshot.sqlite " CREATE TABLE pairs(...); ATTACH 'snapshots/vehicle-drop-down.sqlite' AS db1; ATTACH 'snapshots/vpic-2025-12/snapshot.sqlite' AS db2; INSERT OR IGNORE INTO pairs SELECT * FROM db1.pairs WHERE year < 2022; INSERT OR IGNORE INTO pairs SELECT * FROM db2.pairs; " # Generate SQL and import python3 etl_generate_sql.py --snapshot-path snapshots/merged/snapshot.sqlite ./import_data.sh ``` --- ## Files to Modify | File | Changes | |------|---------| | `vpic_extract.py` | Complete rewrite: VIN schema extraction, dual-record trans logic | | `README.md` | Already updated with workflow | --- ## Success Criteria 1. Extract all 41 makes with 2022+ VIN schemas 2. ~2,500-5,000 unique vehicle configurations (Year/Make/Model/Trim/Engine) 3. Transmission: Use vPIC data where available (7 makes), dual-record elsewhere 4. Output format matches VehAPI: "3.5L 290 hp V6" / "10-Speed Automatic" 5. Merge preserves 2015-2021 VehAPI data 6. QA validation passes after import --- ## Make Analysis Status (All Families Validated) | Family | Makes | Status | Trans Data | Strategy | |--------|-------|--------|------------|----------| | Honda/Acura | Acura, Honda | VALIDATED | YES (93-97%) | Use vPIC trans data | | Toyota/Lexus | Toyota, Lexus | VALIDATED | PARTIAL (Toyota 23%, Lexus 0%) | Dual-record for Lexus | | Nissan/Infiniti | Nissan, Infiniti, Mitsubishi | VALIDATED | LOW (5%) | Dual-record | | GM | Chevrolet, GMC, Buick, Cadillac | VALIDATED | LOW (0-7%) | Dual-record | | Stellantis | Chrysler, Dodge, Jeep, Ram, Fiat | VALIDATED | NONE (0%) | Dual-record | | Ford | Ford, Lincoln | VALIDATED | NONE (0%) | Dual-record | | VW Group | Volkswagen, Audi, Porsche, Bentley, Lamborghini | VALIDATED | MIXED (0-84%) | VW/Audi use vPIC; others dual-record | | BMW | BMW, MINI | VALIDATED | NONE (0%) | Dual-record | | Mercedes | Mercedes-Benz, smart | VALIDATED | YES (52%) | Use vPIC trans data | | Hyundai/Kia/Genesis | Hyundai, Kia, Genesis | VALIDATED | NONE (0%) | Dual-record | | Subaru | Subaru | VALIDATED | YES (64%) | Use vPIC trans data | | Mazda | Mazda | VALIDATED | LOW (11%) | Dual-record | | Volvo | Volvo, Polestar | VALIDATED | LOW (3%/0%) | Dual-record | | Exotics | Ferrari, Maserati, Jaguar, Lotus, McLaren | VALIDATED | MIXED | Per-make handling | | EV | Tesla, Rivian | VALIDATED | NONE (0%) | Dual-record (though EVs don't have "manual") | ### Special Cases 1. **Electric Vehicles** (Tesla, Rivian, Polestar): Don't have manual transmissions - Still create dual-record for consistency with dropdown - User can select "Automatic" (single-speed EV) 2. **Luxury Exotics** (Ferrari, Lamborghini, etc.): Mix of automated manual/DCT - Dual-record covers all options --- ## CRITICAL FINDING: Transmission Data Availability **Most manufacturers do NOT encode transmission info in VINs.** ### VIN Decode Validation Results (12 Families) | Family | VIN | Make | Model | Year | Trim | Engine | Trans | |--------|-----|------|-------|------|------|--------|-------| | Honda/Acura | 5J8YE1H05SL018611 | ACURA | MDX | 2025 | SH-AWD A-Spec | 3.5L V6 290hp | 10-Spd Auto | | Honda/Acura | 2HGFE4F88SH315466 | HONDA | Civic | 2025 | Sport Hybrid | 2.0L I4 141hp | e-CVT | | Toyota/Lexus | 5TFJA5DB4SX327537 | TOYOTA | Tundra | 2025 | Limited | 3.4L V6 389hp | 10-Spd Auto | | Nissan/Infiniti | 5N1AL1FW9TC332353 | INFINITI | QX60 | 2026 | Luxe | 2.0L (no cyl/hp) | **MISSING** | | GM | 3GTUUFEL6PG140748 | GMC | Sierra | 2023 | AT4X | 6.2L V8 (no hp) | **MISSING** | | Stellantis | 1C4HJXEG7PW506480 | JEEP | Wrangler | 2023 | Sahara | 3.6L V6 285hp | **MISSING** | | Ford | 1FTFW4L59SFC03038 | FORD | F-150 | 2025 | Tremor | 5.0L V8 (no hp) | **MISSING** | | VW Group | WVWEB7CD9RW229116 | VOLKSWAGEN | Golf R | 2024 | **MISSING** | 2.0L 4cyl 315hp | Auto (no spd) | | BMW | 5YM13ET06R9S31554 | BMW | X5 | 2024 | X5 M Competition | 4.4L 8cyl 617hp | **MISSING** | | Mercedes | W1KAF4HB1SR287126 | MERCEDES-BENZ | C-Class | 2025 | C300 4MATIC | 2.0L I4 255hp | 9-Spd Auto | | Hyundai/Kia | 5XYRLDJC0SG336002 | KIA | Sorento | 2025 | S | 2.5L 4cyl 191hp | **MISSING** | | Subaru | JF1VBAF67P9806852 | SUBARU | WRX | 2023 | Premium | 2.4L 4cyl 271hp | 6-Spd Manual | | Mazda | JM3KFBCL3R0522361 | MAZDA | CX-5 | 2024 | Preferred Pkg | 2.5L I4 187hp | 6-Spd Auto | | Volvo | YV4M12RJ9S1094167 | VOLVO | XC60 | 2025 | Core | 2.0L 4cyl 247hp | 8-Spd Auto | ### Transmission Data Coverage in vPIC Schemas | Coverage | Makes | Trans Schemas / Total | |----------|-------|----------------------| | **HIGH (>40%)** | Honda, Acura, Subaru, Audi, VW, Mercedes, Jaguar | 225/233, 42/45, 47/74, 46/55, 47/132, 13/25, 17/17 | | **LOW (<10%)** | Chevrolet, Cadillac, Nissan, Infiniti, Mazda, Volvo | 4/164, 7/43, 4/82, 4/74, 4/36, 2/72 | | **NONE (0%)** | GMC, Buick, Ford, Lincoln, Jeep, Dodge, Chrysler, Ram, Fiat, BMW, MINI, Porsche, Hyundai, Kia, Genesis, Lexus, Tesla, Rivian, Polestar | 0% | ### Makes WITHOUT Transmission Data (22 of 41 makes = 54%) - **ALL Stellantis**: Chrysler, Dodge, Jeep, Ram, Fiat - **ALL Ford**: Ford, Lincoln - **ALL Korean**: Hyundai, Kia, Genesis - **ALL BMW Group**: BMW, MINI - **GM (partial)**: GMC, Buick (Chevy/Cadillac have minimal) - **Others**: Lexus, Porsche, Bentley, Lamborghini, Tesla, Rivian, Polestar --- ## Extraction Strategy (SELECTED) ### Dual-Record Strategy for Missing Transmission Data When transmission data is NOT available from vPIC: - **Create TWO records** for each vehicle configuration - One with `trans_display = "Automatic"`, `trans_canon = "automatic"` - One with `trans_display = "Manual"`, `trans_canon = "manual"` This ensures: - All transmission options available in dropdown for user selection - User can select the correct transmission type - No false "Unknown" values that break filtering ### Implementation Logic ```python def generate_trans_records(has_trans_data: bool, trans_style: str, trans_speeds: str): if has_trans_data: # Use actual vPIC data return [(format_trans_display(trans_style, trans_speeds), canonicalize_trans(trans_style))] else: # Generate both options return [ ("Automatic", "automatic"), ("Manual", "manual") ] ``` ### Expected Output Growth For makes without trans data, record count approximately doubles: - GMC Sierra AT4X + 6.2L V8 → 2 records (Auto + Manual) - Ford F-150 Tremor + 5.0L V8 → 2 records (Auto + Manual) This is acceptable as it provides complete dropdown coverage. --- ## Validated Extraction Examples ### Acura MDX 2025 (VIN: 5J8YE1H05SL018611) - **vPIC**: Make=ACURA, Model=MDX, Trim=SH-AWD A-Spec, Engine=3.5L V6 290hp, Trans=10-Speed Automatic - **Output**: `3.5L 290 hp V6` | `10-Speed Automatic` ### Honda Civic 2025 (VIN: 2HGFE4F88SH315466) - **vPIC**: Make=HONDA, Model=Civic, Trim=Sport Hybrid / Sport Touring Hybrid, Engine=2L I4 141hp, Trans=e-CVT - **Output**: `2.0L 141 hp I4` | `Electronic Continuously Variable (e-CVT)` ### Toyota Tundra 2025 (VIN: 5TFJA5DB4SX327537) - **vPIC**: Make=TOYOTA, Model=Tundra, Trim=Limited, Engine=3.4L V6 389hp, Trans=10-Speed Automatic - **Output**: `3.4L 389 hp V6` | `10-Speed Automatic` ### Mercedes C-Class 2025 (VIN: W1KAF4HB1SR287126) - **vPIC**: Make=MERCEDES-BENZ, Model=C-Class, Trim=C300 4MATIC, Engine=2.0L I4 255hp, Trans=9-Speed Automatic - **Output**: `2.0L 255 hp I4` | `9-Speed Automatic` ### Subaru WRX 2023 (VIN: JF1VBAF67P9806852) - **vPIC**: Make=SUBARU, Model=WRX, Trim=Premium, Engine=2.4L 4cyl 271hp, Trans=6-Speed Manual - **Output**: `2.4L 271 hp 4cyl` | `6-Speed Manual` ### Mazda CX-5 2024 (VIN: JM3KFBCL3R0522361) - **vPIC**: Make=MAZDA, Model=CX-5, Trim=Preferred Package, Engine=2.5L I4 187hp, Trans=6-Speed Automatic - **Output**: `2.5L 187 hp I4` | `6-Speed Automatic` ### Volvo XC60 2025 (VIN: YV4M12RJ9S1094167) - **vPIC**: Make=VOLVO, Model=XC60, Trim=Core, Engine=2.0L 4cyl 247hp, Trans=8-Speed Automatic - **Output**: `2.0L 247 hp 4cyl` | `8-Speed Automatic`