Files
motovaultpro/data/make-model-import/ETL_README.md
2025-11-10 11:20:31 -06:00

6.5 KiB

Automotive Vehicle Selection Database - ETL Documentation

Overview

This ETL pipeline creates a PostgreSQL database optimized for cascading dropdown vehicle selection: Year → Make → Model → Trim → Engine/Transmission

Database Schema

Tables

  1. engines - Detailed engine specifications

    • Displacement, configuration, horsepower, torque
    • Fuel type, fuel system, aspiration
    • Full specs stored as JSONB
  2. transmissions - Transmission specifications

    • Type (Manual, Automatic, CVT, etc.)
    • Number of speeds
    • Drive type (FWD, RWD, AWD, 4WD)
  3. vehicle_options - Denormalized vehicle configurations

    • Year, Make, Model, Trim
    • Foreign keys to engines and transmissions
    • Optimized indexes for dropdown queries

Views

  • available_years - All distinct years
  • makes_by_year - Makes grouped by year
  • models_by_year_make - Models grouped by year/make
  • trims_by_year_make_model - Trims grouped by year/make/model
  • complete_vehicle_configs - Full vehicle details with engine/transmission

Functions

  • get_makes_for_year(year) - Returns makes for a specific year
  • get_models_for_year_make(year, make) - Returns models for year/make
  • get_trims_for_year_make_model(year, make, model) - Returns trims
  • get_options_for_vehicle(year, make, model, trim) - Returns engine/transmission options

Data Sources

Primary Source

makes-filter/*.json (57 makes)

  • Filtered manufacturer data
  • Year/model/trim/engine hierarchy
  • Engine specs as simple strings (e.g., "2.0L I4")

Detailed Specs

engines.json (30,066+ records)

  • Complete engine specifications
  • Performance data, fuel economy
  • Transmission details

automobiles.json (7,207 models)

  • Model descriptions
  • Used for hybrid backfill of recent years (2023-2025)

brands.json (124 brands)

  • Brand metadata
  • Used for brand name mapping

ETL Process

Step 1: Import Engine & Transmission Specs

  • Parse all records from engines.json
  • Extract detailed specifications
  • Create engines and transmissions tables
  • Build in-memory caches for fast lookups

Step 2: Process Makes-Filter Data

  • Read all 57 JSON files from makes-filter/
  • Extract year/make/model/trim/engine combinations
  • Match engine strings to detailed specs using displacement + configuration
  • Build vehicle_options records

Step 3: Hybrid Backfill

  • Check automobiles.json for recent years (2023-2025)
  • Add any missing year/make/model combinations
  • Only backfill for the 57 filtered makes
  • Limit to 3 engines per backfilled model

Step 4: Insert Vehicle Options

  • Batch insert all vehicle_options records
  • Create indexes for optimal query performance
  • Generate views and functions

Step 5: Validation

  • Count records in each table
  • Test dropdown cascade queries
  • Display sample data

Running the ETL

Prerequisites

  • Docker container mvp-postgres running
  • Python 3 with psycopg2
  • JSON source files in project root

Quick Start

./run_migration.sh

Manual Steps

# 1. Run migration
docker compose exec mvp-postgres psql -U postgres -d motovaultpro < migrations/001_create_vehicle_database.sql

# 2. Install Python dependencies
pip3 install psycopg2-binary

# 3. Run ETL script
python3 etl_vehicle_data.py

Query Examples

Get all available years

SELECT * FROM available_years;

Get makes for 2024

SELECT * FROM get_makes_for_year(2024);

Get models for 2024 Ford

SELECT * FROM get_models_for_year_make(2024, 'Ford');

Get trims for 2024 Ford F-150

SELECT * FROM get_trims_for_year_make_model(2024, 'Ford', 'F-150');

Get engine/transmission options for specific vehicle

SELECT * FROM get_options_for_vehicle(2024, 'Ford', 'F-150', 'XLT');

Complete vehicle configurations

SELECT * FROM complete_vehicle_configs
WHERE year = 2024 AND make = 'Tesla'
ORDER BY model, trim;

Performance Optimization

Indexes Created

  • idx_vehicle_year - Single column index on year
  • idx_vehicle_make - Single column index on make
  • idx_vehicle_model - Single column index on model
  • idx_vehicle_year_make - Composite index for year/make queries
  • idx_vehicle_year_make_model - Composite index for year/make/model queries
  • idx_vehicle_year_make_model_trim - Composite index for full cascade

Query Performance

Dropdown queries are optimized to return results in < 50ms for typical datasets.

Data Matching Logic

Engine Matching

The ETL uses intelligent pattern matching to link simple engine strings from makes-filter to detailed specs:

  1. Parse engine string: Extract displacement (e.g., "2.0L") and configuration (e.g., "I4")
  2. Normalize: Convert to uppercase, standardize format
  3. Match to cache: Look up in engine cache by (displacement, configuration)
  4. Handle variations: Account for I4/L4, V6/V-6, etc.

Configuration Equivalents

  • I4 = L4 = INLINE-4
  • V6 = V-6
  • V8 = V-8

Filtered Makes (57 Total)

American Brands (12)

Acura, Buick, Cadillac, Chevrolet, Chrysler, Dodge, Ford, GMC, Hummer, Jeep, Lincoln, Ram

Luxury/Performance (13)

Aston Martin, Bentley, Ferrari, Lamborghini, Maserati, McLaren, Porsche, Rolls-Royce, Tesla, Jaguar, Audi, BMW, Land Rover

Japanese (7)

Honda, Infiniti, Lexus, Mazda, Mitsubishi, Nissan, Subaru, Toyota

European (13)

Alfa Romeo, Fiat, Mini, Saab, Saturn, Scion, Smart, Volkswagen, Volvo

Other (12)

Genesis, Geo, Hyundai, Kia, Lucid, Polestar, Rivian, Lotus, Mercury, Oldsmobile, Plymouth, Pontiac

Troubleshooting

Container Not Running

docker compose up -d
docker compose ps

Database Connection Issues

Check connection parameters in etl_vehicle_data.py:

DB_CONFIG = {
    'host': 'localhost',
    'database': 'motovaultpro',
    'user': 'postgres',
    'password': 'postgres',
    'port': 5432
}

Missing JSON Files

Ensure these files exist in project root:

  • engines.json
  • automobiles.json
  • brands.json
  • makes-filter/*.json (57 files)

Python Dependencies

pip3 install psycopg2-binary

Expected Results

After successful ETL:

  • Engines: ~30,000 records
  • Transmissions: ~500-1000 unique combinations
  • Vehicle Options: ~50,000-100,000 configurations
  • Years: 10-15 distinct years
  • Makes: 57 manufacturers
  • Models: 1,000-2,000 unique models

Next Steps

  1. Create API endpoints for dropdown queries
  2. Add caching layer for frequently accessed queries
  3. Implement full-text search for models
  4. Add vehicle images and detailed specs display
  5. Create admin interface for data management