# Automotive Vehicle Selection Database - ETL Documentation ## Overview This ETL pipeline creates a PostgreSQL database optimized for cascading dropdown vehicle selection: **Year → Make → Model → Trim → Engine/Transmission** ## Database Schema ### Tables 1. **engines** - Detailed engine specifications - Displacement, configuration, horsepower, torque - Fuel type, fuel system, aspiration - Full specs stored as JSONB 2. **transmissions** - Transmission specifications - Type (Manual, Automatic, CVT, etc.) - Number of speeds - Drive type (FWD, RWD, AWD, 4WD) 3. **vehicle_options** - Denormalized vehicle configurations - Year, Make, Model, Trim - Foreign keys to engines and transmissions - Optimized indexes for dropdown queries ### Views - `available_years` - All distinct years - `makes_by_year` - Makes grouped by year - `models_by_year_make` - Models grouped by year/make - `trims_by_year_make_model` - Trims grouped by year/make/model - `complete_vehicle_configs` - Full vehicle details with engine/transmission ### Functions - `get_makes_for_year(year)` - Returns makes for a specific year - `get_models_for_year_make(year, make)` - Returns models for year/make - `get_trims_for_year_make_model(year, make, model)` - Returns trims - `get_options_for_vehicle(year, make, model, trim)` - Returns engine/transmission options ## Data Sources ### Primary Source **makes-filter/*.json** (57 makes) - Filtered manufacturer data - Year/model/trim/engine hierarchy - Engine specs as simple strings (e.g., "2.0L I4") ### Detailed Specs **engines.json** (30,066+ records) - Complete engine specifications - Performance data, fuel economy - Transmission details **automobiles.json** (7,207 models) - Model descriptions - Used for hybrid backfill of recent years (2023-2025) **brands.json** (124 brands) - Brand metadata - Used for brand name mapping ## ETL Process ### Step 1: Import Engine & Transmission Specs - Parse all records from `engines.json` - Extract detailed specifications - Create engines and transmissions tables - Build in-memory caches for fast lookups ### Step 2: Process Makes-Filter Data - Read all 57 JSON files from `makes-filter/` - Extract year/make/model/trim/engine combinations - Match engine strings to detailed specs using displacement + configuration - Build vehicle_options records ### Step 3: Hybrid Backfill - Check `automobiles.json` for recent years (2023-2025) - Add any missing year/make/model combinations - Only backfill for the 57 filtered makes - Limit to 3 engines per backfilled model ### Step 4: Insert Vehicle Options - Batch insert all vehicle_options records - Create indexes for optimal query performance - Generate views and functions ### Step 5: Validation - Count records in each table - Test dropdown cascade queries - Display sample data ## Running the ETL ### Prerequisites - Docker container `mvp-postgres` running - Python 3 with psycopg2 - JSON source files in project root ### Quick Start ```bash ./run_migration.sh ``` ### Manual Steps ```bash # 1. Run migration docker compose exec mvp-postgres psql -U postgres -d motovaultpro < migrations/001_create_vehicle_database.sql # 2. Install Python dependencies pip3 install psycopg2-binary # 3. Run ETL script python3 etl_vehicle_data.py ``` ## Query Examples ### Get all available years ```sql SELECT * FROM available_years; ``` ### Get makes for 2024 ```sql SELECT * FROM get_makes_for_year(2024); ``` ### Get models for 2024 Ford ```sql SELECT * FROM get_models_for_year_make(2024, 'Ford'); ``` ### Get trims for 2024 Ford F-150 ```sql SELECT * FROM get_trims_for_year_make_model(2024, 'Ford', 'F-150'); ``` ### Get engine/transmission options for specific vehicle ```sql SELECT * FROM get_options_for_vehicle(2024, 'Ford', 'F-150', 'XLT'); ``` ### Complete vehicle configurations ```sql SELECT * FROM complete_vehicle_configs WHERE year = 2024 AND make = 'Tesla' ORDER BY model, trim; ``` ## Performance Optimization ### Indexes Created - `idx_vehicle_year` - Single column index on year - `idx_vehicle_make` - Single column index on make - `idx_vehicle_model` - Single column index on model - `idx_vehicle_year_make` - Composite index for year/make queries - `idx_vehicle_year_make_model` - Composite index for year/make/model queries - `idx_vehicle_year_make_model_trim` - Composite index for full cascade ### Query Performance Dropdown queries are optimized to return results in < 50ms for typical datasets. ## Data Matching Logic ### Engine Matching The ETL uses intelligent pattern matching to link simple engine strings from makes-filter to detailed specs: 1. **Parse engine string**: Extract displacement (e.g., "2.0L") and configuration (e.g., "I4") 2. **Normalize**: Convert to uppercase, standardize format 3. **Match to cache**: Look up in engine cache by (displacement, configuration) 4. **Handle variations**: Account for I4/L4, V6/V-6, etc. ### Configuration Equivalents - `I4` = `L4` = `INLINE-4` - `V6` = `V-6` - `V8` = `V-8` ## Filtered Makes (57 Total) ### American Brands (12) Acura, Buick, Cadillac, Chevrolet, Chrysler, Dodge, Ford, GMC, Hummer, Jeep, Lincoln, Ram ### Luxury/Performance (13) Aston Martin, Bentley, Ferrari, Lamborghini, Maserati, McLaren, Porsche, Rolls-Royce, Tesla, Jaguar, Audi, BMW, Land Rover ### Japanese (7) Honda, Infiniti, Lexus, Mazda, Mitsubishi, Nissan, Subaru, Toyota ### European (13) Alfa Romeo, Fiat, Mini, Saab, Saturn, Scion, Smart, Volkswagen, Volvo ### Other (12) Genesis, Geo, Hyundai, Kia, Lucid, Polestar, Rivian, Lotus, Mercury, Oldsmobile, Plymouth, Pontiac ## Troubleshooting ### Container Not Running ```bash docker compose up -d docker compose ps ``` ### Database Connection Issues Check connection parameters in `etl_vehicle_data.py`: ```python DB_CONFIG = { 'host': 'localhost', 'database': 'motovaultpro', 'user': 'postgres', 'password': 'postgres', 'port': 5432 } ``` ### Missing JSON Files Ensure these files exist in project root: - `engines.json` - `automobiles.json` - `brands.json` - `makes-filter/*.json` (57 files) ### Python Dependencies ```bash pip3 install psycopg2-binary ``` ## Expected Results After successful ETL: - **Engines**: ~30,000 records - **Transmissions**: ~500-1000 unique combinations - **Vehicle Options**: ~50,000-100,000 configurations - **Years**: 10-15 distinct years - **Makes**: 57 manufacturers - **Models**: 1,000-2,000 unique models ## Next Steps 1. Create API endpoints for dropdown queries 2. Add caching layer for frequently accessed queries 3. Implement full-text search for models 4. Add vehicle images and detailed specs display 5. Create admin interface for data management