6.5 KiB
Automotive Vehicle Selection Database - ETL Documentation
Overview
This ETL pipeline creates a PostgreSQL database optimized for cascading dropdown vehicle selection: Year → Make → Model → Trim → Engine/Transmission
Database Schema
Tables
-
engines - Detailed engine specifications
- Displacement, configuration, horsepower, torque
- Fuel type, fuel system, aspiration
- Full specs stored as JSONB
-
transmissions - Transmission specifications
- Type (Manual, Automatic, CVT, etc.)
- Number of speeds
- Drive type (FWD, RWD, AWD, 4WD)
-
vehicle_options - Denormalized vehicle configurations
- Year, Make, Model, Trim
- Foreign keys to engines and transmissions
- Optimized indexes for dropdown queries
Views
available_years- All distinct yearsmakes_by_year- Makes grouped by yearmodels_by_year_make- Models grouped by year/maketrims_by_year_make_model- Trims grouped by year/make/modelcomplete_vehicle_configs- Full vehicle details with engine/transmission
Functions
get_makes_for_year(year)- Returns makes for a specific yearget_models_for_year_make(year, make)- Returns models for year/makeget_trims_for_year_make_model(year, make, model)- Returns trimsget_options_for_vehicle(year, make, model, trim)- Returns engine/transmission options
Data Sources
Primary Source
makes-filter/*.json (57 makes)
- Filtered manufacturer data
- Year/model/trim/engine hierarchy
- Engine specs as simple strings (e.g., "2.0L I4")
Detailed Specs
engines.json (30,066+ records)
- Complete engine specifications
- Performance data, fuel economy
- Transmission details
automobiles.json (7,207 models)
- Model descriptions
- Used for hybrid backfill of recent years (2023-2025)
brands.json (124 brands)
- Brand metadata
- Used for brand name mapping
ETL Process
Step 1: Import Engine & Transmission Specs
- Parse all records from
engines.json - Extract detailed specifications
- Create engines and transmissions tables
- Build in-memory caches for fast lookups
Step 2: Process Makes-Filter Data
- Read all 57 JSON files from
makes-filter/ - Extract year/make/model/trim/engine combinations
- Match engine strings to detailed specs using displacement + configuration
- Build vehicle_options records
Step 3: Hybrid Backfill
- Check
automobiles.jsonfor recent years (2023-2025) - Add any missing year/make/model combinations
- Only backfill for the 57 filtered makes
- Limit to 3 engines per backfilled model
Step 4: Insert Vehicle Options
- Batch insert all vehicle_options records
- Create indexes for optimal query performance
- Generate views and functions
Step 5: Validation
- Count records in each table
- Test dropdown cascade queries
- Display sample data
Running the ETL
Prerequisites
- Docker container
mvp-postgresrunning - Python 3 with psycopg2
- JSON source files in project root
Quick Start
./run_migration.sh
Manual Steps
# 1. Run migration
docker compose exec mvp-postgres psql -U postgres -d motovaultpro < migrations/001_create_vehicle_database.sql
# 2. Install Python dependencies
pip3 install psycopg2-binary
# 3. Run ETL script
python3 etl_vehicle_data.py
Query Examples
Get all available years
SELECT * FROM available_years;
Get makes for 2024
SELECT * FROM get_makes_for_year(2024);
Get models for 2024 Ford
SELECT * FROM get_models_for_year_make(2024, 'Ford');
Get trims for 2024 Ford F-150
SELECT * FROM get_trims_for_year_make_model(2024, 'Ford', 'F-150');
Get engine/transmission options for specific vehicle
SELECT * FROM get_options_for_vehicle(2024, 'Ford', 'F-150', 'XLT');
Complete vehicle configurations
SELECT * FROM complete_vehicle_configs
WHERE year = 2024 AND make = 'Tesla'
ORDER BY model, trim;
Performance Optimization
Indexes Created
idx_vehicle_year- Single column index on yearidx_vehicle_make- Single column index on makeidx_vehicle_model- Single column index on modelidx_vehicle_year_make- Composite index for year/make queriesidx_vehicle_year_make_model- Composite index for year/make/model queriesidx_vehicle_year_make_model_trim- Composite index for full cascade
Query Performance
Dropdown queries are optimized to return results in < 50ms for typical datasets.
Data Matching Logic
Engine Matching
The ETL uses intelligent pattern matching to link simple engine strings from makes-filter to detailed specs:
- Parse engine string: Extract displacement (e.g., "2.0L") and configuration (e.g., "I4")
- Normalize: Convert to uppercase, standardize format
- Match to cache: Look up in engine cache by (displacement, configuration)
- Handle variations: Account for I4/L4, V6/V-6, etc.
Configuration Equivalents
I4=L4=INLINE-4V6=V-6V8=V-8
Filtered Makes (57 Total)
American Brands (12)
Acura, Buick, Cadillac, Chevrolet, Chrysler, Dodge, Ford, GMC, Hummer, Jeep, Lincoln, Ram
Luxury/Performance (13)
Aston Martin, Bentley, Ferrari, Lamborghini, Maserati, McLaren, Porsche, Rolls-Royce, Tesla, Jaguar, Audi, BMW, Land Rover
Japanese (7)
Honda, Infiniti, Lexus, Mazda, Mitsubishi, Nissan, Subaru, Toyota
European (13)
Alfa Romeo, Fiat, Mini, Saab, Saturn, Scion, Smart, Volkswagen, Volvo
Other (12)
Genesis, Geo, Hyundai, Kia, Lucid, Polestar, Rivian, Lotus, Mercury, Oldsmobile, Plymouth, Pontiac
Troubleshooting
Container Not Running
docker compose up -d
docker compose ps
Database Connection Issues
Check connection parameters in etl_vehicle_data.py:
DB_CONFIG = {
'host': 'localhost',
'database': 'motovaultpro',
'user': 'postgres',
'password': 'postgres',
'port': 5432
}
Missing JSON Files
Ensure these files exist in project root:
engines.jsonautomobiles.jsonbrands.jsonmakes-filter/*.json(57 files)
Python Dependencies
pip3 install psycopg2-binary
Expected Results
After successful ETL:
- Engines: ~30,000 records
- Transmissions: ~500-1000 unique combinations
- Vehicle Options: ~50,000-100,000 configurations
- Years: 10-15 distinct years
- Makes: 57 manufacturers
- Models: 1,000-2,000 unique models
Next Steps
- Create API endpoints for dropdown queries
- Add caching layer for frequently accessed queries
- Implement full-text search for models
- Add vehicle images and detailed specs display
- Create admin interface for data management