Files
motovaultpro/data/vehicle-etl/logical-plotting-hartmanis.md
2025-12-24 17:20:11 -06:00

19 KiB

vPIC ETL Implementation Plan v2

Overview

Extract vehicle dropdown data from NHTSA vPIC database for MY2022+ to supplement existing VehAPI data. This revised plan uses a make-specific extraction approach with proper VIN schema parsing.

Key Changes from v1

  1. Limit to VehAPI makes only - Only extract the 48 makes that exist in VehAPI data
  2. VIN schema-based extraction - Extract directly from VIN patterns, not defs_model
  3. Proper field formatting - Match VehAPI display string formats
  4. Make-specific logic - Handle different manufacturers' data patterns

Critical Discovery: WMI Linkage

Must use wmi_make junction table (many-to-many), NOT wmi.makeid (one-to-many):

-- CORRECT: via wmi_make (finds all makes including Toyota, Hyundai, etc.)
FROM vpic.make m
JOIN vpic.wmi_make wm ON wm.makeid = m.id
JOIN vpic.wmi w ON w.id = wm.wmiid

-- WRONG: via wmi.makeid (misses many major brands)
FROM vpic.make m
JOIN vpic.wmi w ON w.makeid = m.id

Make Availability Summary

Status Count Makes
Available (2022+ schemas) 46 See table below
No 2022+ data 2 Hummer (discontinued 2010), Scion (discontinued 2016)

Per-Make Analysis

Group 1: Japanese Manufacturers (Honda/Acura, Toyota/Lexus, Nissan/Infiniti)

Make VehAPI Name vPIC Name Schemas (2022+) Status
Acura Acura Acura 48 Ready
Honda Honda Honda 238 Ready
Lexus Lexus Lexus 90 Ready
Toyota Toyota Toyota 152 Ready
Infiniti INFINITI Infiniti 76 Ready
Nissan Nissan Nissan 85 Ready
Mazda Mazda Mazda 37 Ready
Mitsubishi Mitsubishi Mitsubishi 11 Ready
Subaru Subaru Subaru 75 Ready
Isuzu Isuzu Isuzu 11 Ready

Group 2: Korean Manufacturers (Hyundai/Kia/Genesis)

Make VehAPI Name vPIC Name Schemas (2022+) Status
Genesis Genesis Genesis 74 Ready
Hyundai Hyundai Hyundai 177 Ready
Kia Kia Kia 72 Ready

Group 3: American - GM (Chevrolet, GMC, Buick, Cadillac)

Make VehAPI Name vPIC Name Schemas (2022+) Status
Buick Buick Buick 20 Ready
Cadillac Cadillac Cadillac 50 Ready
Chevrolet Chevrolet Chevrolet 185 Ready
GMC GMC GMC 107 Ready
Oldsmobile Oldsmobile Oldsmobile 1 Limited
Pontiac Pontiac Pontiac 5 Limited (2022-2024)

Group 4: American - Stellantis (Chrysler, Dodge, Jeep, Ram, Fiat)

Make VehAPI Name vPIC Name Schemas (2022+) Status
Chrysler Chrysler Chrysler 81 Ready
Dodge Dodge Dodge 86 Ready
FIAT FIAT Fiat 91 Ready (case diff)
Jeep Jeep Jeep 81 Ready
RAM RAM Ram 81 Ready (case diff)
Plymouth Plymouth Plymouth 4 Limited

Group 5: American - Ford

Make VehAPI Name vPIC Name Schemas (2022+) Status
Ford Ford Ford 108 Ready
Lincoln Lincoln Lincoln 21 Ready
Mercury Mercury Mercury 0 No data (discontinued 2011)

Group 6: American - EV Startups

Make VehAPI Name vPIC Name Schemas (2022+) Status
Polestar Polestar Polestar 12 Ready
Rivian Rivian RIVIAN 10 Ready (case diff)
Tesla Tesla Tesla 14 Ready

Group 7: German Manufacturers

Make VehAPI Name vPIC Name Schemas (2022+) Status
Audi Audi Audi 55 Ready
BMW BMW BMW 61 Ready
Mercedes-Benz Mercedes-Benz Mercedes-Benz 39 Ready
MINI MINI MINI 10 Ready
Porsche Porsche Porsche 23 Ready
smart smart smart 5 Ready
Volkswagen Volkswagen Volkswagen 134 Ready

Group 8: European Luxury

Make VehAPI Name vPIC Name Schemas (2022+) Status
Bentley Bentley Bentley 48 Ready
Ferrari Ferrari Ferrari 9 Ready
Jaguar Jaguar Jaguar 17 Ready
Lamborghini Lamborghini Lamborghini 10 Ready
Lotus Lotus Lotus 5 Ready
Maserati Maserati Maserati 19 Ready
McLaren McLaren McLaren 4 Ready
Volvo Volvo Volvo 80 Ready

Group 9: Discontinued (No 2022+ Data)

Make VehAPI Name Reason Action
Hummer Hummer Discontinued 2010 (new EV under GMC) Skip - use existing VehAPI
Scion Scion Discontinued 2016 Skip - use existing VehAPI
Saab Saab Discontinued 2012 Limited schemas (9)
Mercury Mercury Discontinued 2011 No schemas

Extraction Architecture

Data Flow

vPIC VIN Schemas → Pattern Extraction → Format Transformation → SQLite Pairs
         ↓
    Filter by:
    - 48 VehAPI makes only
    - Year >= 2022
    - Vehicle types (exclude motorcycles, trailers, buses)

Core Query Strategy

For each allowed make:

  1. Find WMIs linked to that make
  2. Get VIN schemas for years 2022+
  3. Extract from patterns:
    • Model (from schema name or pattern)
    • Trim (Element: Trim)
    • Displacement (Element: DisplacementL)
    • Horsepower (Element: EngineHP)
    • Cylinders (Element: EngineCylinders)
    • Engine Config (Element: EngineConfiguration)
    • Transmission Style (Element: TransmissionStyle)
    • Transmission Speeds (Element: TransmissionSpeeds)

Acura Extraction Template

This pattern applies to Honda/Acura and similar well-structured manufacturers.

Sample VIN Schema: Acura MDX 2025 (schema_id: 26929)

Element Code Values
Trim Trim MDX, Technology, SH-AWD, SH-AWD Technology, SH-AWD A-Spec, SH-AWD Advance, SH-AWD A-Spec Advance, SH-AWD TYPE S ADVANCE
Displacement DisplacementL 3.5, 3.0
Horsepower EngineHP 290, 355
Cylinders EngineCylinders 6
Engine Config EngineConfiguration V-Shaped
Trans Style TransmissionStyle Automatic
Trans Speeds TransmissionSpeeds 10

Output Format

Engine Display (match VehAPI):

{DisplacementL}L {EngineHP} hp V{EngineCylinders}
→ "3.5L 290 hp V6"

Transmission Display (match VehAPI):

{TransmissionSpeeds}-Speed {TransmissionStyle}
→ "10-Speed Automatic"

Extraction SQL Template

WITH schema_data AS (
    SELECT DISTINCT
        vs.id AS schema_id,
        vs.name AS schema_name,
        wvs.yearfrom,
        COALESCE(wvs.yearto, 2027) AS yearto,
        m.name AS make_name
    FROM vpic.wmi w
    JOIN vpic.make m ON w.makeid = m.id
    JOIN vpic.wmi_vinschema wvs ON w.id = wvs.wmiid
    JOIN vpic.vinschema vs ON wvs.vinschemaid = vs.id
    WHERE LOWER(m.name) IN ('acura', 'honda', ...)  -- VehAPI makes
      AND wvs.yearfrom >= 2022 OR (wvs.yearto >= 2022)
),
trim_data AS (
    SELECT DISTINCT sd.schema_id, p.attributeid AS trim
    FROM schema_data sd
    JOIN vpic.pattern p ON p.vinschemaid = sd.schema_id
    JOIN vpic.element e ON p.elementid = e.id
    WHERE e.code = 'Trim'
),
engine_data AS (
    SELECT DISTINCT
        sd.schema_id,
        MAX(CASE WHEN e.code = 'DisplacementL' THEN p.attributeid END) AS displacement,
        MAX(CASE WHEN e.code = 'EngineHP' THEN p.attributeid END) AS hp,
        MAX(CASE WHEN e.code = 'EngineCylinders' THEN p.attributeid END) AS cylinders,
        MAX(CASE WHEN e.code = 'EngineConfiguration' THEN ec.name END) AS config
    FROM schema_data sd
    JOIN vpic.pattern p ON p.vinschemaid = sd.schema_id
    JOIN vpic.element e ON p.elementid = e.id
    LEFT JOIN vpic.engineconfiguration ec ON e.code = 'EngineConfiguration'
        AND p.attributeid ~ '^[0-9]+$' AND ec.id = CAST(p.attributeid AS INT)
    WHERE e.code IN ('DisplacementL', 'EngineHP', 'EngineCylinders', 'EngineConfiguration')
    GROUP BY sd.schema_id, p.keys  -- Group by VIN pattern position
),
trans_data AS (
    SELECT DISTINCT
        sd.schema_id,
        t.name AS style,
        MAX(CASE WHEN e.code = 'TransmissionSpeeds' THEN p.attributeid END) AS speeds
    FROM schema_data sd
    JOIN vpic.pattern p ON p.vinschemaid = sd.schema_id
    JOIN vpic.element e ON p.elementid = e.id
    LEFT JOIN vpic.transmission t ON e.code = 'TransmissionStyle'
        AND p.attributeid ~ '^[0-9]+$' AND t.id = CAST(p.attributeid AS INT)
    WHERE e.code IN ('TransmissionStyle', 'TransmissionSpeeds')
    GROUP BY sd.schema_id, t.name
)
SELECT ...

Allowed Makes (48 from VehAPI)

ALLOWED_MAKES = [
    'Acura', 'Audi', 'Bentley', 'BMW', 'Buick', 'Cadillac', 'Chevrolet',
    'Chrysler', 'Dodge', 'Ferrari', 'FIAT', 'Ford', 'Genesis', 'GMC',
    'Honda', 'Hummer', 'Hyundai', 'INFINITI', 'Isuzu', 'Jaguar', 'Jeep',
    'Kia', 'Lamborghini', 'Lexus', 'Lincoln', 'Lotus', 'Maserati', 'Mazda',
    'McLaren', 'Mercedes-Benz', 'Mercury', 'MINI', 'Mitsubishi', 'Nissan',
    'Oldsmobile', 'Plymouth', 'Polestar', 'Pontiac', 'Porsche', 'RAM',
    'Rivian', 'Saab', 'Scion', 'smart', 'Subaru', 'Tesla', 'Toyota',
    'Volkswagen', 'Volvo'
]

Note: Some makes may have different names in vPIC (case variations, abbreviations).


Implementation Steps

Phase 1: Rewrite vpic_extract.py

File: vpic_extract.py

Core extraction query (uses wmi_make junction table):

WITH base AS (
    SELECT DISTINCT
        m.name AS make_name,
        vs.id AS schema_id,
        vs.name AS schema_name,
        generate_series(
            GREATEST(wvs.yearfrom, 2022),
            COALESCE(wvs.yearto, EXTRACT(YEAR FROM NOW()) + 2)
        )::INT AS year
    FROM vpic.make m
    JOIN vpic.wmi_make wm ON wm.makeid = m.id
    JOIN vpic.wmi w ON w.id = wm.wmiid
    JOIN vpic.wmi_vinschema wvs ON w.id = wvs.wmiid
    JOIN vpic.vinschema vs ON wvs.vinschemaid = vs.id
    WHERE LOWER(m.name) IN ({allowed_makes})
      AND (wvs.yearfrom >= 2022 OR wvs.yearto >= 2022)
)
SELECT ...

Key functions to implement:

  1. extract_model_from_schema_name(schema_name) - Parse "Acura MDX Schema..." → "MDX"
  2. get_schema_patterns(schema_id) - Get all pattern data for a schema
  3. format_engine_display(disp, hp, cyl, config) - Format as "3.5L 290 hp V6"
  4. format_trans_display(style, speeds) - Format as "10-Speed Automatic"
  5. generate_trans_records(has_data, style, speeds) - Return 1 or 2 records

Make name normalization:

MAKE_MAPPING = {
    'INFINITI': 'INFINITI',  # VehAPI uses all-caps
    'FIAT': 'FIAT',
    'RAM': 'RAM',
    'RIVIAN': 'Rivian',      # vPIC uses all-caps, normalize
    # ... etc
}

Phase 2: Test Extraction

Test with validated VINs:

source .venv/bin/activate
python3 vpic_extract.py --test-vin 5J8YE1H05SL018611  # Acura MDX
python3 vpic_extract.py --test-vin 5TFJA5DB4SX327537  # Toyota Tundra
python3 vpic_extract.py --test-vin 3GTUUFEL6PG140748  # GMC Sierra

Phase 3: Full Extraction

python3 vpic_extract.py --min-year 2022 --output-dir snapshots/vpic-2025-12

Phase 4: Merge & Import

# Merge vPIC with existing VehAPI data
sqlite3 snapshots/merged/snapshot.sqlite "
CREATE TABLE pairs(...);
ATTACH 'snapshots/vehicle-drop-down.sqlite' AS db1;
ATTACH 'snapshots/vpic-2025-12/snapshot.sqlite' AS db2;
INSERT OR IGNORE INTO pairs SELECT * FROM db1.pairs WHERE year < 2022;
INSERT OR IGNORE INTO pairs SELECT * FROM db2.pairs;
"

# Generate SQL and import
python3 etl_generate_sql.py --snapshot-path snapshots/merged/snapshot.sqlite
./import_data.sh

Files to Modify

File Changes
vpic_extract.py Complete rewrite: VIN schema extraction, dual-record trans logic
README.md Already updated with workflow

Success Criteria

  1. Extract all 41 makes with 2022+ VIN schemas
  2. ~2,500-5,000 unique vehicle configurations (Year/Make/Model/Trim/Engine)
  3. Transmission: Use vPIC data where available (7 makes), dual-record elsewhere
  4. Output format matches VehAPI: "3.5L 290 hp V6" / "10-Speed Automatic"
  5. Merge preserves 2015-2021 VehAPI data
  6. QA validation passes after import

Make Analysis Status (All Families Validated)

Family Makes Status Trans Data Strategy
Honda/Acura Acura, Honda VALIDATED YES (93-97%) Use vPIC trans data
Toyota/Lexus Toyota, Lexus VALIDATED PARTIAL (Toyota 23%, Lexus 0%) Dual-record for Lexus
Nissan/Infiniti Nissan, Infiniti, Mitsubishi VALIDATED LOW (5%) Dual-record
GM Chevrolet, GMC, Buick, Cadillac VALIDATED LOW (0-7%) Dual-record
Stellantis Chrysler, Dodge, Jeep, Ram, Fiat VALIDATED NONE (0%) Dual-record
Ford Ford, Lincoln VALIDATED NONE (0%) Dual-record
VW Group Volkswagen, Audi, Porsche, Bentley, Lamborghini VALIDATED MIXED (0-84%) VW/Audi use vPIC; others dual-record
BMW BMW, MINI VALIDATED NONE (0%) Dual-record
Mercedes Mercedes-Benz, smart VALIDATED YES (52%) Use vPIC trans data
Hyundai/Kia/Genesis Hyundai, Kia, Genesis VALIDATED NONE (0%) Dual-record
Subaru Subaru VALIDATED YES (64%) Use vPIC trans data
Mazda Mazda VALIDATED LOW (11%) Dual-record
Volvo Volvo, Polestar VALIDATED LOW (3%/0%) Dual-record
Exotics Ferrari, Maserati, Jaguar, Lotus, McLaren VALIDATED MIXED Per-make handling
EV Tesla, Rivian VALIDATED NONE (0%) Dual-record (though EVs don't have "manual")

Special Cases

  1. Electric Vehicles (Tesla, Rivian, Polestar): Don't have manual transmissions

    • Still create dual-record for consistency with dropdown
    • User can select "Automatic" (single-speed EV)
  2. Luxury Exotics (Ferrari, Lamborghini, etc.): Mix of automated manual/DCT

    • Dual-record covers all options

CRITICAL FINDING: Transmission Data Availability

Most manufacturers do NOT encode transmission info in VINs.

VIN Decode Validation Results (12 Families)

Family VIN Make Model Year Trim Engine Trans
Honda/Acura 5J8YE1H05SL018611 ACURA MDX 2025 SH-AWD A-Spec 3.5L V6 290hp 10-Spd Auto
Honda/Acura 2HGFE4F88SH315466 HONDA Civic 2025 Sport Hybrid 2.0L I4 141hp e-CVT
Toyota/Lexus 5TFJA5DB4SX327537 TOYOTA Tundra 2025 Limited 3.4L V6 389hp 10-Spd Auto
Nissan/Infiniti 5N1AL1FW9TC332353 INFINITI QX60 2026 Luxe 2.0L (no cyl/hp) MISSING
GM 3GTUUFEL6PG140748 GMC Sierra 2023 AT4X 6.2L V8 (no hp) MISSING
Stellantis 1C4HJXEG7PW506480 JEEP Wrangler 2023 Sahara 3.6L V6 285hp MISSING
Ford 1FTFW4L59SFC03038 FORD F-150 2025 Tremor 5.0L V8 (no hp) MISSING
VW Group WVWEB7CD9RW229116 VOLKSWAGEN Golf R 2024 MISSING 2.0L 4cyl 315hp Auto (no spd)
BMW 5YM13ET06R9S31554 BMW X5 2024 X5 M Competition 4.4L 8cyl 617hp MISSING
Mercedes W1KAF4HB1SR287126 MERCEDES-BENZ C-Class 2025 C300 4MATIC 2.0L I4 255hp 9-Spd Auto
Hyundai/Kia 5XYRLDJC0SG336002 KIA Sorento 2025 S 2.5L 4cyl 191hp MISSING
Subaru JF1VBAF67P9806852 SUBARU WRX 2023 Premium 2.4L 4cyl 271hp 6-Spd Manual
Mazda JM3KFBCL3R0522361 MAZDA CX-5 2024 Preferred Pkg 2.5L I4 187hp 6-Spd Auto
Volvo YV4M12RJ9S1094167 VOLVO XC60 2025 Core 2.0L 4cyl 247hp 8-Spd Auto

Transmission Data Coverage in vPIC Schemas

Coverage Makes Trans Schemas / Total
HIGH (>40%) Honda, Acura, Subaru, Audi, VW, Mercedes, Jaguar 225/233, 42/45, 47/74, 46/55, 47/132, 13/25, 17/17
LOW (<10%) Chevrolet, Cadillac, Nissan, Infiniti, Mazda, Volvo 4/164, 7/43, 4/82, 4/74, 4/36, 2/72
NONE (0%) GMC, Buick, Ford, Lincoln, Jeep, Dodge, Chrysler, Ram, Fiat, BMW, MINI, Porsche, Hyundai, Kia, Genesis, Lexus, Tesla, Rivian, Polestar 0%

Makes WITHOUT Transmission Data (22 of 41 makes = 54%)

  • ALL Stellantis: Chrysler, Dodge, Jeep, Ram, Fiat
  • ALL Ford: Ford, Lincoln
  • ALL Korean: Hyundai, Kia, Genesis
  • ALL BMW Group: BMW, MINI
  • GM (partial): GMC, Buick (Chevy/Cadillac have minimal)
  • Others: Lexus, Porsche, Bentley, Lamborghini, Tesla, Rivian, Polestar

Extraction Strategy (SELECTED)

Dual-Record Strategy for Missing Transmission Data

When transmission data is NOT available from vPIC:

  • Create TWO records for each vehicle configuration
  • One with trans_display = "Automatic", trans_canon = "automatic"
  • One with trans_display = "Manual", trans_canon = "manual"

This ensures:

  • All transmission options available in dropdown for user selection
  • User can select the correct transmission type
  • No false "Unknown" values that break filtering

Implementation Logic

def generate_trans_records(has_trans_data: bool, trans_style: str, trans_speeds: str):
    if has_trans_data:
        # Use actual vPIC data
        return [(format_trans_display(trans_style, trans_speeds),
                 canonicalize_trans(trans_style))]
    else:
        # Generate both options
        return [
            ("Automatic", "automatic"),
            ("Manual", "manual")
        ]

Expected Output Growth

For makes without trans data, record count approximately doubles:

  • GMC Sierra AT4X + 6.2L V8 → 2 records (Auto + Manual)
  • Ford F-150 Tremor + 5.0L V8 → 2 records (Auto + Manual)

This is acceptable as it provides complete dropdown coverage.


Validated Extraction Examples

Acura MDX 2025 (VIN: 5J8YE1H05SL018611)

  • vPIC: Make=ACURA, Model=MDX, Trim=SH-AWD A-Spec, Engine=3.5L V6 290hp, Trans=10-Speed Automatic
  • Output: 3.5L 290 hp V6 | 10-Speed Automatic

Honda Civic 2025 (VIN: 2HGFE4F88SH315466)

  • vPIC: Make=HONDA, Model=Civic, Trim=Sport Hybrid / Sport Touring Hybrid, Engine=2L I4 141hp, Trans=e-CVT
  • Output: 2.0L 141 hp I4 | Electronic Continuously Variable (e-CVT)

Toyota Tundra 2025 (VIN: 5TFJA5DB4SX327537)

  • vPIC: Make=TOYOTA, Model=Tundra, Trim=Limited, Engine=3.4L V6 389hp, Trans=10-Speed Automatic
  • Output: 3.4L 389 hp V6 | 10-Speed Automatic

Mercedes C-Class 2025 (VIN: W1KAF4HB1SR287126)

  • vPIC: Make=MERCEDES-BENZ, Model=C-Class, Trim=C300 4MATIC, Engine=2.0L I4 255hp, Trans=9-Speed Automatic
  • Output: 2.0L 255 hp I4 | 9-Speed Automatic

Subaru WRX 2023 (VIN: JF1VBAF67P9806852)

  • vPIC: Make=SUBARU, Model=WRX, Trim=Premium, Engine=2.4L 4cyl 271hp, Trans=6-Speed Manual
  • Output: 2.4L 271 hp 4cyl | 6-Speed Manual

Mazda CX-5 2024 (VIN: JM3KFBCL3R0522361)

  • vPIC: Make=MAZDA, Model=CX-5, Trim=Preferred Package, Engine=2.5L I4 187hp, Trans=6-Speed Automatic
  • Output: 2.5L 187 hp I4 | 6-Speed Automatic

Volvo XC60 2025 (VIN: YV4M12RJ9S1094167)

  • vPIC: Make=VOLVO, Model=XC60, Trim=Core, Engine=2.0L 4cyl 247hp, Trans=8-Speed Automatic
  • Output: 2.0L 247 hp 4cyl | 8-Speed Automatic