246 lines
6.5 KiB
Markdown
246 lines
6.5 KiB
Markdown
# Automotive Vehicle Selection Database - ETL Documentation
|
|
|
|
## Overview
|
|
|
|
This ETL pipeline creates a PostgreSQL database optimized for cascading dropdown vehicle selection:
|
|
**Year → Make → Model → Trim → Engine/Transmission**
|
|
|
|
## Database Schema
|
|
|
|
### Tables
|
|
|
|
1. **engines** - Detailed engine specifications
|
|
- Displacement, configuration, horsepower, torque
|
|
- Fuel type, fuel system, aspiration
|
|
- Full specs stored as JSONB
|
|
|
|
2. **transmissions** - Transmission specifications
|
|
- Type (Manual, Automatic, CVT, etc.)
|
|
- Number of speeds
|
|
- Drive type (FWD, RWD, AWD, 4WD)
|
|
|
|
3. **vehicle_options** - Denormalized vehicle configurations
|
|
- Year, Make, Model, Trim
|
|
- Foreign keys to engines and transmissions
|
|
- Optimized indexes for dropdown queries
|
|
|
|
### Views
|
|
|
|
- `available_years` - All distinct years
|
|
- `makes_by_year` - Makes grouped by year
|
|
- `models_by_year_make` - Models grouped by year/make
|
|
- `trims_by_year_make_model` - Trims grouped by year/make/model
|
|
- `complete_vehicle_configs` - Full vehicle details with engine/transmission
|
|
|
|
### Functions
|
|
|
|
- `get_makes_for_year(year)` - Returns makes for a specific year
|
|
- `get_models_for_year_make(year, make)` - Returns models for year/make
|
|
- `get_trims_for_year_make_model(year, make, model)` - Returns trims
|
|
- `get_options_for_vehicle(year, make, model, trim)` - Returns engine/transmission options
|
|
|
|
## Data Sources
|
|
|
|
### Primary Source
|
|
**makes-filter/*.json** (57 makes)
|
|
- Filtered manufacturer data
|
|
- Year/model/trim/engine hierarchy
|
|
- Engine specs as simple strings (e.g., "2.0L I4")
|
|
|
|
### Detailed Specs
|
|
**engines.json** (30,066+ records)
|
|
- Complete engine specifications
|
|
- Performance data, fuel economy
|
|
- Transmission details
|
|
|
|
**automobiles.json** (7,207 models)
|
|
- Model descriptions
|
|
- Used for hybrid backfill of recent years (2023-2025)
|
|
|
|
**brands.json** (124 brands)
|
|
- Brand metadata
|
|
- Used for brand name mapping
|
|
|
|
## ETL Process
|
|
|
|
### Step 1: Import Engine & Transmission Specs
|
|
- Parse all records from `engines.json`
|
|
- Extract detailed specifications
|
|
- Create engines and transmissions tables
|
|
- Build in-memory caches for fast lookups
|
|
|
|
### Step 2: Process Makes-Filter Data
|
|
- Read all 57 JSON files from `makes-filter/`
|
|
- Extract year/make/model/trim/engine combinations
|
|
- Match engine strings to detailed specs using displacement + configuration
|
|
- Build vehicle_options records
|
|
|
|
### Step 3: Hybrid Backfill
|
|
- Check `automobiles.json` for recent years (2023-2025)
|
|
- Add any missing year/make/model combinations
|
|
- Only backfill for the 57 filtered makes
|
|
- Limit to 3 engines per backfilled model
|
|
|
|
### Step 4: Insert Vehicle Options
|
|
- Batch insert all vehicle_options records
|
|
- Create indexes for optimal query performance
|
|
- Generate views and functions
|
|
|
|
### Step 5: Validation
|
|
- Count records in each table
|
|
- Test dropdown cascade queries
|
|
- Display sample data
|
|
|
|
## Running the ETL
|
|
|
|
### Prerequisites
|
|
- Docker container `mvp-postgres` running
|
|
- Python 3 with psycopg2
|
|
- JSON source files in project root
|
|
|
|
### Quick Start
|
|
```bash
|
|
./run_migration.sh
|
|
```
|
|
|
|
### Manual Steps
|
|
```bash
|
|
# 1. Run migration
|
|
docker compose exec mvp-postgres psql -U postgres -d motovaultpro < migrations/001_create_vehicle_database.sql
|
|
|
|
# 2. Install Python dependencies
|
|
pip3 install psycopg2-binary
|
|
|
|
# 3. Run ETL script
|
|
python3 etl_vehicle_data.py
|
|
```
|
|
|
|
## Query Examples
|
|
|
|
### Get all available years
|
|
```sql
|
|
SELECT * FROM available_years;
|
|
```
|
|
|
|
### Get makes for 2024
|
|
```sql
|
|
SELECT * FROM get_makes_for_year(2024);
|
|
```
|
|
|
|
### Get models for 2024 Ford
|
|
```sql
|
|
SELECT * FROM get_models_for_year_make(2024, 'Ford');
|
|
```
|
|
|
|
### Get trims for 2024 Ford F-150
|
|
```sql
|
|
SELECT * FROM get_trims_for_year_make_model(2024, 'Ford', 'F-150');
|
|
```
|
|
|
|
### Get engine/transmission options for specific vehicle
|
|
```sql
|
|
SELECT * FROM get_options_for_vehicle(2024, 'Ford', 'F-150', 'XLT');
|
|
```
|
|
|
|
### Complete vehicle configurations
|
|
```sql
|
|
SELECT * FROM complete_vehicle_configs
|
|
WHERE year = 2024 AND make = 'Tesla'
|
|
ORDER BY model, trim;
|
|
```
|
|
|
|
## Performance Optimization
|
|
|
|
### Indexes Created
|
|
- `idx_vehicle_year` - Single column index on year
|
|
- `idx_vehicle_make` - Single column index on make
|
|
- `idx_vehicle_model` - Single column index on model
|
|
- `idx_vehicle_year_make` - Composite index for year/make queries
|
|
- `idx_vehicle_year_make_model` - Composite index for year/make/model queries
|
|
- `idx_vehicle_year_make_model_trim` - Composite index for full cascade
|
|
|
|
### Query Performance
|
|
Dropdown queries are optimized to return results in < 50ms for typical datasets.
|
|
|
|
## Data Matching Logic
|
|
|
|
### Engine Matching
|
|
The ETL uses intelligent pattern matching to link simple engine strings from makes-filter to detailed specs:
|
|
|
|
1. **Parse engine string**: Extract displacement (e.g., "2.0L") and configuration (e.g., "I4")
|
|
2. **Normalize**: Convert to uppercase, standardize format
|
|
3. **Match to cache**: Look up in engine cache by (displacement, configuration)
|
|
4. **Handle variations**: Account for I4/L4, V6/V-6, etc.
|
|
|
|
### Configuration Equivalents
|
|
- `I4` = `L4` = `INLINE-4`
|
|
- `V6` = `V-6`
|
|
- `V8` = `V-8`
|
|
|
|
## Filtered Makes (57 Total)
|
|
|
|
### American Brands (12)
|
|
Acura, Buick, Cadillac, Chevrolet, Chrysler, Dodge, Ford, GMC, Hummer, Jeep, Lincoln, Ram
|
|
|
|
### Luxury/Performance (13)
|
|
Aston Martin, Bentley, Ferrari, Lamborghini, Maserati, McLaren, Porsche, Rolls-Royce, Tesla, Jaguar, Audi, BMW, Land Rover
|
|
|
|
### Japanese (7)
|
|
Honda, Infiniti, Lexus, Mazda, Mitsubishi, Nissan, Subaru, Toyota
|
|
|
|
### European (13)
|
|
Alfa Romeo, Fiat, Mini, Saab, Saturn, Scion, Smart, Volkswagen, Volvo
|
|
|
|
### Other (12)
|
|
Genesis, Geo, Hyundai, Kia, Lucid, Polestar, Rivian, Lotus, Mercury, Oldsmobile, Plymouth, Pontiac
|
|
|
|
## Troubleshooting
|
|
|
|
### Container Not Running
|
|
```bash
|
|
docker compose up -d
|
|
docker compose ps
|
|
```
|
|
|
|
### Database Connection Issues
|
|
Check connection parameters in `etl_vehicle_data.py`:
|
|
```python
|
|
DB_CONFIG = {
|
|
'host': 'localhost',
|
|
'database': 'motovaultpro',
|
|
'user': 'postgres',
|
|
'password': 'postgres',
|
|
'port': 5432
|
|
}
|
|
```
|
|
|
|
### Missing JSON Files
|
|
Ensure these files exist in project root:
|
|
- `engines.json`
|
|
- `automobiles.json`
|
|
- `brands.json`
|
|
- `makes-filter/*.json` (57 files)
|
|
|
|
### Python Dependencies
|
|
```bash
|
|
pip3 install psycopg2-binary
|
|
```
|
|
|
|
## Expected Results
|
|
|
|
After successful ETL:
|
|
- **Engines**: ~30,000 records
|
|
- **Transmissions**: ~500-1000 unique combinations
|
|
- **Vehicle Options**: ~50,000-100,000 configurations
|
|
- **Years**: 10-15 distinct years
|
|
- **Makes**: 57 manufacturers
|
|
- **Models**: 1,000-2,000 unique models
|
|
|
|
## Next Steps
|
|
|
|
1. Create API endpoints for dropdown queries
|
|
2. Add caching layer for frequently accessed queries
|
|
3. Implement full-text search for models
|
|
4. Add vehicle images and detailed specs display
|
|
5. Create admin interface for data management
|