# Automotive Vehicle Selection Database - ETL Documentation ## Overview This ETL pipeline creates a PostgreSQL database optimized for cascading dropdown vehicle selection: **Year → Make → Model → Trim → Engine/Transmission** ## Database Schema ### Tables 1. **engines** - Simplified engine specifications - id (Primary Key) - name (Display format: "V8 3.5L", "L4 2.0L Turbo", "V6 6.2L Supercharged") 2. **transmissions** - Simplified transmission specifications - id (Primary Key) - type (Display format: "8-Speed Automatic", "6-Speed Manual", "CVT") 3. **vehicle_options** - Denormalized vehicle configurations - Year, Make (Title Case: "Ford", "Acura", "Land Rover"), Model, Trim - Foreign keys to engines and transmissions - Optimized indexes for dropdown queries ### Views - `available_years` - All distinct years - `makes_by_year` - Makes grouped by year - `models_by_year_make` - Models grouped by year/make - `trims_by_year_make_model` - Trims grouped by year/make/model - `complete_vehicle_configs` - Full vehicle details with engine/transmission ### Functions - `get_makes_for_year(year)` - Returns makes for a specific year - `get_models_for_year_make(year, make)` - Returns models for year/make - `get_trims_for_year_make_model(year, make, model)` - Returns trims - `get_options_for_vehicle(year, make, model, trim)` - Returns engine/transmission options ## Data Sources ### Primary Source **makes-filter/*.json** (57 makes) - Filtered manufacturer data - Year/model/trim/engine hierarchy - Engine specs as simple strings (e.g., "2.0L I4") ### Detailed Specs **engines.json** (30,066+ records) - Complete engine specifications - Performance data, fuel economy - Transmission details **automobiles.json** (7,207 models) - Model descriptions - Used for hybrid backfill of recent years (2023-2025) **brands.json** (124 brands) - Brand metadata - Used for brand name mapping ## ETL Process ### Step 1: Load Source Data - Load `engines.json` (30,066 records) - Load `brands.json` (124 brands) - Load `automobiles.json` (7,207 models) - Load all `makes-filter/*.json` files (55 files) ### Step 2: Transform Brand Names - Convert ALL CAPS brand names to Title Case ("FORD" → "Ford") - Preserve acronyms (BMW, GMC, KIA remain uppercase) - Handle special cases (DeLorean, McLaren) ### Step 3: Process Engine Specifications - Extract engine specs from engines.json - Create simplified display names (e.g., "V8 3.5L Turbo") - Normalize displacement (Cm3 → Liters) for matching - Build engine cache with (displacement, configuration) keys - Generate engines SQL with only id and name columns ### Step 4: Process Transmission Specifications - Extract transmission specs from engines.json - Create simplified display names (e.g., "8-Speed Automatic") - Parse speed count and transmission type - Build transmission cache for linking - Generate transmissions SQL with only id and type columns ### Step 5: Process Makes-Filter Data - Read all JSON files from `makes-filter/` - Extract year/make/model/trim/engine combinations - Match engine strings to detailed specs using displacement + configuration - Link transmissions to vehicle records (98.9% success rate) - Apply year filter (1980 and newer only) - Build vehicle_options records ### Step 6: Hybrid Backfill - Check `automobiles.json` for recent years (2023-2025) - Add any missing year/make/model combinations - Only backfill for filtered makes - Link transmissions for backfilled records - Limit to 3 engines per backfilled model ### Step 7: Generate SQL Output - Write SQL files with proper escaping (newlines, quotes, special characters) - Convert empty strings to NULL for data integrity - Use batched inserts (1000 records per batch) - Output to `output/` directory ## Running the ETL ### Prerequisites - Docker container `mvp-postgres` running - Python 3 (no additional dependencies required) - JSON source files in project root ### Quick Start ```bash # Step 1: Generate SQL files from JSON data python3 etl_generate_sql.py # Step 2: Import SQL files into database ./import_data.sh ``` ### What Gets Generated - `output/01_engines.sql` (~632KB, 30,066 records) - `output/02_transmissions.sql` (~21KB, 828 records) - `output/03_vehicle_options.sql` (~51MB, 1,122,644 records) ## Query Examples ### Get all available years ```sql SELECT * FROM available_years; ``` ### Get makes for 2024 ```sql SELECT * FROM get_makes_for_year(2024); ``` ### Get models for 2025 Ford ```sql SELECT * FROM get_models_for_year_make(2025, 'Ford'); ``` ### Get trims for 2025 Ford F-150 ```sql SELECT * FROM get_trims_for_year_make_model(2025, 'Ford', 'f-150'); ``` ### Get engine/transmission options for specific vehicle ```sql SELECT * FROM get_options_for_vehicle(2025, 'Ford', 'f-150', 'XLT'); ``` ### Complete vehicle configurations ```sql SELECT * FROM complete_vehicle_configs WHERE year = 2025 AND make = 'Ford' AND model = 'f-150' LIMIT 10; ``` ## Performance Optimization ### Indexes Created - `idx_vehicle_year` - Single column index on year - `idx_vehicle_make` - Single column index on make - `idx_vehicle_model` - Single column index on model - `idx_vehicle_year_make` - Composite index for year/make queries - `idx_vehicle_year_make_model` - Composite index for year/make/model queries - `idx_vehicle_year_make_model_trim` - Composite index for full cascade ### Query Performance Dropdown queries are optimized to return results in < 50ms for typical datasets. ## Data Matching Logic ### Brand Name Transformation - Source data (brands.json) stores names in ALL CAPS: "FORD", "ACURA", "ALFA ROMEO" - ETL converts to Title Case: "Ford", "Acura", "Alfa Romeo" - Preserves acronyms: BMW, GMC, KIA, MINI, FIAT, RAM - Special cases: DeLorean, McLaren ### Engine Matching The ETL uses intelligent pattern matching to link simple engine strings from makes-filter to detailed specs: 1. **Parse engine string**: Extract displacement (e.g., "2.0L") and configuration (e.g., "I4") 2. **Normalize displacement**: Convert Cm3 to Liters ("3506 Cm3" → "3.5L") 3. **Match to cache**: Look up in engine cache by (displacement, configuration) 4. **Create display name**: Format as "V8 3.5L", "L4 2.0L Turbo", etc. ### Transmission Linking - Transmission data is embedded in engines.json under "Transmission Specs" - Each engine record includes gearbox type (e.g., "6-Speed Manual") - ETL links transmissions to vehicle records based on engine match - Success rate: 98.9% (1,109,510 of 1,122,644 records) - Unlinked records: primarily electric vehicles without traditional transmissions ### Configuration Equivalents - `I4` = `L4` = `INLINE-4` = `4 Inline` - `V6` = `V-6` - `V8` = `V-8` ## Filtered Makes (53 Total) All brand names are stored in Title Case format for user-friendly display. ### American Brands (12) Acura, Buick, Cadillac, Chevrolet, Chrysler, Dodge, Ford, GMC, Hummer, Jeep, Lincoln, RAM ### Luxury/Performance (13) Aston Martin, Bentley, Ferrari, Lamborghini, Maserati, McLaren, Porsche, Rolls Royce, Tesla, Jaguar, Audi, BMW, Land Rover ### Japanese (8) Honda, Infiniti, Lexus, Mazda, Mitsubishi, Nissan, Subaru, Toyota ### European (9) Alfa Romeo, FIAT, MINI, Saab, Saturn, Scion, Smart, Volkswagen, Volvo ### Other (11) Genesis, Geo, Hyundai, KIA, Lucid, Polestar, Rivian, Lotus, Mercury, Oldsmobile, Plymouth, Pontiac ## Troubleshooting ### Container Not Running ```bash docker compose up -d docker compose ps ``` ### Database Connection Issues Check connection parameters in `etl_vehicle_data.py`: ```python DB_CONFIG = { 'host': 'localhost', 'database': 'motovaultpro', 'user': 'postgres', 'password': 'postgres', 'port': 5432 } ``` ### Missing JSON Files Ensure these files exist in project root: - `engines.json` - `automobiles.json` - `brands.json` - `makes-filter/*.json` (57 files) ### Python Dependencies ```bash pip3 install psycopg2-binary ``` ## Expected Results After successful ETL: - **Engines**: 30,066 records - **Transmissions**: 828 records - **Vehicle Options**: 1,122,644 configurations - **Years**: 47 years (1980-2026) - **Makes**: 53 manufacturers - **Models**: 1,741 unique models - **Transmission Linking**: 98.9% success rate - **Output Files**: ~52MB total (632KB engines + 21KB transmissions + 51MB vehicles) ## Next Steps 1. Create API endpoints for dropdown queries 2. Add caching layer for frequently accessed queries 3. Implement full-text search for models 4. Add vehicle images and detailed specs display 5. Create admin interface for data management