Files
motovaultpro/data/make-model-import/ETL_README.md
2025-11-11 10:29:02 -06:00

8.4 KiB

Automotive Vehicle Selection Database - ETL Documentation

Overview

This ETL pipeline creates a PostgreSQL database optimized for cascading dropdown vehicle selection: Year → Make → Model → Trim → Engine/Transmission

Database Schema

Tables

  1. engines - Simplified engine specifications

    • id (Primary Key)
    • name (Display format: "V8 3.5L", "L4 2.0L Turbo", "V6 6.2L Supercharged")
  2. transmissions - Simplified transmission specifications

    • id (Primary Key)
    • type (Display format: "8-Speed Automatic", "6-Speed Manual", "CVT")
  3. vehicle_options - Denormalized vehicle configurations

    • Year, Make (Title Case: "Ford", "Acura", "Land Rover"), Model, Trim
    • Foreign keys to engines and transmissions
    • Optimized indexes for dropdown queries

Views

  • available_years - All distinct years
  • makes_by_year - Makes grouped by year
  • models_by_year_make - Models grouped by year/make
  • trims_by_year_make_model - Trims grouped by year/make/model
  • complete_vehicle_configs - Full vehicle details with engine/transmission

Functions

  • get_makes_for_year(year) - Returns makes for a specific year
  • get_models_for_year_make(year, make) - Returns models for year/make
  • get_trims_for_year_make_model(year, make, model) - Returns trims
  • get_options_for_vehicle(year, make, model, trim) - Returns engine/transmission options

Data Sources

Primary Source

makes-filter/*.json (57 makes)

  • Filtered manufacturer data
  • Year/model/trim/engine hierarchy
  • Engine specs as simple strings (e.g., "2.0L I4")

Detailed Specs

engines.json (30,066+ records)

  • Complete engine specifications
  • Performance data, fuel economy
  • Transmission details

automobiles.json (7,207 models)

  • Model descriptions
  • Used for hybrid backfill of recent years (2023-2025)

brands.json (124 brands)

  • Brand metadata
  • Used for brand name mapping

ETL Process

Step 1: Load Source Data

  • Load engines.json (30,066 records)
  • Load brands.json (124 brands)
  • Load automobiles.json (7,207 models)
  • Load all makes-filter/*.json files (55 files)

Step 2: Transform Brand Names

  • Convert ALL CAPS brand names to Title Case ("FORD" → "Ford")
  • Preserve acronyms (BMW, GMC, KIA remain uppercase)
  • Handle special cases (DeLorean, McLaren)

Step 3: Process Engine Specifications

  • Extract engine specs from engines.json
  • Create simplified display names (e.g., "V8 3.5L Turbo")
  • Normalize displacement (Cm3 → Liters) for matching
  • Build engine cache with (displacement, configuration) keys
  • Generate engines SQL with only id and name columns

Step 4: Process Transmission Specifications

  • Extract transmission specs from engines.json
  • Create simplified display names (e.g., "8-Speed Automatic")
  • Parse speed count and transmission type
  • Build transmission cache for linking
  • Generate transmissions SQL with only id and type columns

Step 5: Process Makes-Filter Data

  • Read all JSON files from makes-filter/
  • Extract year/make/model/trim/engine combinations
  • Match engine strings to detailed specs using displacement + configuration
  • Link transmissions to vehicle records (98.9% success rate)
  • Apply year filter (1980 and newer only)
  • Build vehicle_options records

Step 6: Hybrid Backfill

  • Check automobiles.json for recent years (2023-2025)
  • Add any missing year/make/model combinations
  • Only backfill for filtered makes
  • Link transmissions for backfilled records
  • Limit to 3 engines per backfilled model

Step 7: Generate SQL Output

  • Write SQL files with proper escaping (newlines, quotes, special characters)
  • Convert empty strings to NULL for data integrity
  • Use batched inserts (1000 records per batch)
  • Output to output/ directory

Running the ETL

Prerequisites

  • Docker container mvp-postgres running
  • Python 3 (no additional dependencies required)
  • JSON source files in project root

Quick Start

# Step 1: Generate SQL files from JSON data
python3 etl_generate_sql.py

# Step 2: Import SQL files into database
./import_data.sh

What Gets Generated

  • output/01_engines.sql (~632KB, 30,066 records)
  • output/02_transmissions.sql (~21KB, 828 records)
  • output/03_vehicle_options.sql (~51MB, 1,122,644 records)

Query Examples

Get all available years

SELECT * FROM available_years;

Get makes for 2024

SELECT * FROM get_makes_for_year(2024);

Get models for 2025 Ford

SELECT * FROM get_models_for_year_make(2025, 'Ford');

Get trims for 2025 Ford F-150

SELECT * FROM get_trims_for_year_make_model(2025, 'Ford', 'f-150');

Get engine/transmission options for specific vehicle

SELECT * FROM get_options_for_vehicle(2025, 'Ford', 'f-150', 'XLT');

Complete vehicle configurations

SELECT * FROM complete_vehicle_configs
WHERE year = 2025 AND make = 'Ford' AND model = 'f-150'
LIMIT 10;

Performance Optimization

Indexes Created

  • idx_vehicle_year - Single column index on year
  • idx_vehicle_make - Single column index on make
  • idx_vehicle_model - Single column index on model
  • idx_vehicle_year_make - Composite index for year/make queries
  • idx_vehicle_year_make_model - Composite index for year/make/model queries
  • idx_vehicle_year_make_model_trim - Composite index for full cascade

Query Performance

Dropdown queries are optimized to return results in < 50ms for typical datasets.

Data Matching Logic

Brand Name Transformation

  • Source data (brands.json) stores names in ALL CAPS: "FORD", "ACURA", "ALFA ROMEO"
  • ETL converts to Title Case: "Ford", "Acura", "Alfa Romeo"
  • Preserves acronyms: BMW, GMC, KIA, MINI, FIAT, RAM
  • Special cases: DeLorean, McLaren

Engine Matching

The ETL uses intelligent pattern matching to link simple engine strings from makes-filter to detailed specs:

  1. Parse engine string: Extract displacement (e.g., "2.0L") and configuration (e.g., "I4")
  2. Normalize displacement: Convert Cm3 to Liters ("3506 Cm3" → "3.5L")
  3. Match to cache: Look up in engine cache by (displacement, configuration)
  4. Create display name: Format as "V8 3.5L", "L4 2.0L Turbo", etc.

Transmission Linking

  • Transmission data is embedded in engines.json under "Transmission Specs"
  • Each engine record includes gearbox type (e.g., "6-Speed Manual")
  • ETL links transmissions to vehicle records based on engine match
  • Success rate: 98.9% (1,109,510 of 1,122,644 records)
  • Unlinked records: primarily electric vehicles without traditional transmissions

Configuration Equivalents

  • I4 = L4 = INLINE-4 = 4 Inline
  • V6 = V-6
  • V8 = V-8

Filtered Makes (53 Total)

All brand names are stored in Title Case format for user-friendly display.

American Brands (12)

Acura, Buick, Cadillac, Chevrolet, Chrysler, Dodge, Ford, GMC, Hummer, Jeep, Lincoln, RAM

Luxury/Performance (13)

Aston Martin, Bentley, Ferrari, Lamborghini, Maserati, McLaren, Porsche, Rolls Royce, Tesla, Jaguar, Audi, BMW, Land Rover

Japanese (8)

Honda, Infiniti, Lexus, Mazda, Mitsubishi, Nissan, Subaru, Toyota

European (9)

Alfa Romeo, FIAT, MINI, Saab, Saturn, Scion, Smart, Volkswagen, Volvo

Other (11)

Genesis, Geo, Hyundai, KIA, Lucid, Polestar, Rivian, Lotus, Mercury, Oldsmobile, Plymouth, Pontiac

Troubleshooting

Container Not Running

docker compose up -d
docker compose ps

Database Connection Issues

Check connection parameters in etl_vehicle_data.py:

DB_CONFIG = {
    'host': 'localhost',
    'database': 'motovaultpro',
    'user': 'postgres',
    'password': 'postgres',
    'port': 5432
}

Missing JSON Files

Ensure these files exist in project root:

  • engines.json
  • automobiles.json
  • brands.json
  • makes-filter/*.json (57 files)

Python Dependencies

pip3 install psycopg2-binary

Expected Results

After successful ETL:

  • Engines: 30,066 records
  • Transmissions: 828 records
  • Vehicle Options: 1,122,644 configurations
  • Years: 47 years (1980-2026)
  • Makes: 53 manufacturers
  • Models: 1,741 unique models
  • Transmission Linking: 98.9% success rate
  • Output Files: ~52MB total (632KB engines + 21KB transmissions + 51MB vehicles)

Next Steps

  1. Create API endpoints for dropdown queries
  2. Add caching layer for frequently accessed queries
  3. Implement full-text search for models
  4. Add vehicle images and detailed specs display
  5. Create admin interface for data management