8.4 KiB
Automotive Vehicle Selection Database - ETL Documentation
Overview
This ETL pipeline creates a PostgreSQL database optimized for cascading dropdown vehicle selection: Year → Make → Model → Trim → Engine/Transmission
Database Schema
Tables
-
engines - Simplified engine specifications
- id (Primary Key)
- name (Display format: "V8 3.5L", "L4 2.0L Turbo", "V6 6.2L Supercharged")
-
transmissions - Simplified transmission specifications
- id (Primary Key)
- type (Display format: "8-Speed Automatic", "6-Speed Manual", "CVT")
-
vehicle_options - Denormalized vehicle configurations
- Year, Make (Title Case: "Ford", "Acura", "Land Rover"), Model, Trim
- Foreign keys to engines and transmissions
- Optimized indexes for dropdown queries
Views
available_years- All distinct yearsmakes_by_year- Makes grouped by yearmodels_by_year_make- Models grouped by year/maketrims_by_year_make_model- Trims grouped by year/make/modelcomplete_vehicle_configs- Full vehicle details with engine/transmission
Functions
get_makes_for_year(year)- Returns makes for a specific yearget_models_for_year_make(year, make)- Returns models for year/makeget_trims_for_year_make_model(year, make, model)- Returns trimsget_options_for_vehicle(year, make, model, trim)- Returns engine/transmission options
Data Sources
Primary Source
makes-filter/*.json (57 makes)
- Filtered manufacturer data
- Year/model/trim/engine hierarchy
- Engine specs as simple strings (e.g., "2.0L I4")
Detailed Specs
engines.json (30,066+ records)
- Complete engine specifications
- Performance data, fuel economy
- Transmission details
automobiles.json (7,207 models)
- Model descriptions
- Used for hybrid backfill of recent years (2023-2025)
brands.json (124 brands)
- Brand metadata
- Used for brand name mapping
ETL Process
Step 1: Load Source Data
- Load
engines.json(30,066 records) - Load
brands.json(124 brands) - Load
automobiles.json(7,207 models) - Load all
makes-filter/*.jsonfiles (55 files)
Step 2: Transform Brand Names
- Convert ALL CAPS brand names to Title Case ("FORD" → "Ford")
- Preserve acronyms (BMW, GMC, KIA remain uppercase)
- Handle special cases (DeLorean, McLaren)
Step 3: Process Engine Specifications
- Extract engine specs from engines.json
- Create simplified display names (e.g., "V8 3.5L Turbo")
- Normalize displacement (Cm3 → Liters) for matching
- Build engine cache with (displacement, configuration) keys
- Generate engines SQL with only id and name columns
Step 4: Process Transmission Specifications
- Extract transmission specs from engines.json
- Create simplified display names (e.g., "8-Speed Automatic")
- Parse speed count and transmission type
- Build transmission cache for linking
- Generate transmissions SQL with only id and type columns
Step 5: Process Makes-Filter Data
- Read all JSON files from
makes-filter/ - Extract year/make/model/trim/engine combinations
- Match engine strings to detailed specs using displacement + configuration
- Link transmissions to vehicle records (98.9% success rate)
- Apply year filter (1980 and newer only)
- Build vehicle_options records
Step 6: Hybrid Backfill
- Check
automobiles.jsonfor recent years (2023-2025) - Add any missing year/make/model combinations
- Only backfill for filtered makes
- Link transmissions for backfilled records
- Limit to 3 engines per backfilled model
Step 7: Generate SQL Output
- Write SQL files with proper escaping (newlines, quotes, special characters)
- Convert empty strings to NULL for data integrity
- Use batched inserts (1000 records per batch)
- Output to
output/directory
Running the ETL
Prerequisites
- Docker container
mvp-postgresrunning - Python 3 (no additional dependencies required)
- JSON source files in project root
Quick Start
# Step 1: Generate SQL files from JSON data
python3 etl_generate_sql.py
# Step 2: Import SQL files into database
./import_data.sh
What Gets Generated
output/01_engines.sql(~632KB, 30,066 records)output/02_transmissions.sql(~21KB, 828 records)output/03_vehicle_options.sql(~51MB, 1,122,644 records)
Query Examples
Get all available years
SELECT * FROM available_years;
Get makes for 2024
SELECT * FROM get_makes_for_year(2024);
Get models for 2025 Ford
SELECT * FROM get_models_for_year_make(2025, 'Ford');
Get trims for 2025 Ford F-150
SELECT * FROM get_trims_for_year_make_model(2025, 'Ford', 'f-150');
Get engine/transmission options for specific vehicle
SELECT * FROM get_options_for_vehicle(2025, 'Ford', 'f-150', 'XLT');
Complete vehicle configurations
SELECT * FROM complete_vehicle_configs
WHERE year = 2025 AND make = 'Ford' AND model = 'f-150'
LIMIT 10;
Performance Optimization
Indexes Created
idx_vehicle_year- Single column index on yearidx_vehicle_make- Single column index on makeidx_vehicle_model- Single column index on modelidx_vehicle_year_make- Composite index for year/make queriesidx_vehicle_year_make_model- Composite index for year/make/model queriesidx_vehicle_year_make_model_trim- Composite index for full cascade
Query Performance
Dropdown queries are optimized to return results in < 50ms for typical datasets.
Data Matching Logic
Brand Name Transformation
- Source data (brands.json) stores names in ALL CAPS: "FORD", "ACURA", "ALFA ROMEO"
- ETL converts to Title Case: "Ford", "Acura", "Alfa Romeo"
- Preserves acronyms: BMW, GMC, KIA, MINI, FIAT, RAM
- Special cases: DeLorean, McLaren
Engine Matching
The ETL uses intelligent pattern matching to link simple engine strings from makes-filter to detailed specs:
- Parse engine string: Extract displacement (e.g., "2.0L") and configuration (e.g., "I4")
- Normalize displacement: Convert Cm3 to Liters ("3506 Cm3" → "3.5L")
- Match to cache: Look up in engine cache by (displacement, configuration)
- Create display name: Format as "V8 3.5L", "L4 2.0L Turbo", etc.
Transmission Linking
- Transmission data is embedded in engines.json under "Transmission Specs"
- Each engine record includes gearbox type (e.g., "6-Speed Manual")
- ETL links transmissions to vehicle records based on engine match
- Success rate: 98.9% (1,109,510 of 1,122,644 records)
- Unlinked records: primarily electric vehicles without traditional transmissions
Configuration Equivalents
I4=L4=INLINE-4=4 InlineV6=V-6V8=V-8
Filtered Makes (53 Total)
All brand names are stored in Title Case format for user-friendly display.
American Brands (12)
Acura, Buick, Cadillac, Chevrolet, Chrysler, Dodge, Ford, GMC, Hummer, Jeep, Lincoln, RAM
Luxury/Performance (13)
Aston Martin, Bentley, Ferrari, Lamborghini, Maserati, McLaren, Porsche, Rolls Royce, Tesla, Jaguar, Audi, BMW, Land Rover
Japanese (8)
Honda, Infiniti, Lexus, Mazda, Mitsubishi, Nissan, Subaru, Toyota
European (9)
Alfa Romeo, FIAT, MINI, Saab, Saturn, Scion, Smart, Volkswagen, Volvo
Other (11)
Genesis, Geo, Hyundai, KIA, Lucid, Polestar, Rivian, Lotus, Mercury, Oldsmobile, Plymouth, Pontiac
Troubleshooting
Container Not Running
docker compose up -d
docker compose ps
Database Connection Issues
Check connection parameters in etl_vehicle_data.py:
DB_CONFIG = {
'host': 'localhost',
'database': 'motovaultpro',
'user': 'postgres',
'password': 'postgres',
'port': 5432
}
Missing JSON Files
Ensure these files exist in project root:
engines.jsonautomobiles.jsonbrands.jsonmakes-filter/*.json(57 files)
Python Dependencies
pip3 install psycopg2-binary
Expected Results
After successful ETL:
- Engines: 30,066 records
- Transmissions: 828 records
- Vehicle Options: 1,122,644 configurations
- Years: 47 years (1980-2026)
- Makes: 53 manufacturers
- Models: 1,741 unique models
- Transmission Linking: 98.9% success rate
- Output Files: ~52MB total (632KB engines + 21KB transmissions + 51MB vehicles)
Next Steps
- Create API endpoints for dropdown queries
- Add caching layer for frequently accessed queries
- Implement full-text search for models
- Add vehicle images and detailed specs display
- Create admin interface for data management