Files
motovaultpro/data/make-model-import/ETL_README.md
2025-11-11 10:29:02 -06:00

276 lines
8.4 KiB
Markdown

# Automotive Vehicle Selection Database - ETL Documentation
## Overview
This ETL pipeline creates a PostgreSQL database optimized for cascading dropdown vehicle selection:
**Year → Make → Model → Trim → Engine/Transmission**
## Database Schema
### Tables
1. **engines** - Simplified engine specifications
- id (Primary Key)
- name (Display format: "V8 3.5L", "L4 2.0L Turbo", "V6 6.2L Supercharged")
2. **transmissions** - Simplified transmission specifications
- id (Primary Key)
- type (Display format: "8-Speed Automatic", "6-Speed Manual", "CVT")
3. **vehicle_options** - Denormalized vehicle configurations
- Year, Make (Title Case: "Ford", "Acura", "Land Rover"), Model, Trim
- Foreign keys to engines and transmissions
- Optimized indexes for dropdown queries
### Views
- `available_years` - All distinct years
- `makes_by_year` - Makes grouped by year
- `models_by_year_make` - Models grouped by year/make
- `trims_by_year_make_model` - Trims grouped by year/make/model
- `complete_vehicle_configs` - Full vehicle details with engine/transmission
### Functions
- `get_makes_for_year(year)` - Returns makes for a specific year
- `get_models_for_year_make(year, make)` - Returns models for year/make
- `get_trims_for_year_make_model(year, make, model)` - Returns trims
- `get_options_for_vehicle(year, make, model, trim)` - Returns engine/transmission options
## Data Sources
### Primary Source
**makes-filter/*.json** (57 makes)
- Filtered manufacturer data
- Year/model/trim/engine hierarchy
- Engine specs as simple strings (e.g., "2.0L I4")
### Detailed Specs
**engines.json** (30,066+ records)
- Complete engine specifications
- Performance data, fuel economy
- Transmission details
**automobiles.json** (7,207 models)
- Model descriptions
- Used for hybrid backfill of recent years (2023-2025)
**brands.json** (124 brands)
- Brand metadata
- Used for brand name mapping
## ETL Process
### Step 1: Load Source Data
- Load `engines.json` (30,066 records)
- Load `brands.json` (124 brands)
- Load `automobiles.json` (7,207 models)
- Load all `makes-filter/*.json` files (55 files)
### Step 2: Transform Brand Names
- Convert ALL CAPS brand names to Title Case ("FORD" → "Ford")
- Preserve acronyms (BMW, GMC, KIA remain uppercase)
- Handle special cases (DeLorean, McLaren)
### Step 3: Process Engine Specifications
- Extract engine specs from engines.json
- Create simplified display names (e.g., "V8 3.5L Turbo")
- Normalize displacement (Cm3 → Liters) for matching
- Build engine cache with (displacement, configuration) keys
- Generate engines SQL with only id and name columns
### Step 4: Process Transmission Specifications
- Extract transmission specs from engines.json
- Create simplified display names (e.g., "8-Speed Automatic")
- Parse speed count and transmission type
- Build transmission cache for linking
- Generate transmissions SQL with only id and type columns
### Step 5: Process Makes-Filter Data
- Read all JSON files from `makes-filter/`
- Extract year/make/model/trim/engine combinations
- Match engine strings to detailed specs using displacement + configuration
- Link transmissions to vehicle records (98.9% success rate)
- Apply year filter (1980 and newer only)
- Build vehicle_options records
### Step 6: Hybrid Backfill
- Check `automobiles.json` for recent years (2023-2025)
- Add any missing year/make/model combinations
- Only backfill for filtered makes
- Link transmissions for backfilled records
- Limit to 3 engines per backfilled model
### Step 7: Generate SQL Output
- Write SQL files with proper escaping (newlines, quotes, special characters)
- Convert empty strings to NULL for data integrity
- Use batched inserts (1000 records per batch)
- Output to `output/` directory
## Running the ETL
### Prerequisites
- Docker container `mvp-postgres` running
- Python 3 (no additional dependencies required)
- JSON source files in project root
### Quick Start
```bash
# Step 1: Generate SQL files from JSON data
python3 etl_generate_sql.py
# Step 2: Import SQL files into database
./import_data.sh
```
### What Gets Generated
- `output/01_engines.sql` (~632KB, 30,066 records)
- `output/02_transmissions.sql` (~21KB, 828 records)
- `output/03_vehicle_options.sql` (~51MB, 1,122,644 records)
## Query Examples
### Get all available years
```sql
SELECT * FROM available_years;
```
### Get makes for 2024
```sql
SELECT * FROM get_makes_for_year(2024);
```
### Get models for 2025 Ford
```sql
SELECT * FROM get_models_for_year_make(2025, 'Ford');
```
### Get trims for 2025 Ford F-150
```sql
SELECT * FROM get_trims_for_year_make_model(2025, 'Ford', 'f-150');
```
### Get engine/transmission options for specific vehicle
```sql
SELECT * FROM get_options_for_vehicle(2025, 'Ford', 'f-150', 'XLT');
```
### Complete vehicle configurations
```sql
SELECT * FROM complete_vehicle_configs
WHERE year = 2025 AND make = 'Ford' AND model = 'f-150'
LIMIT 10;
```
## Performance Optimization
### Indexes Created
- `idx_vehicle_year` - Single column index on year
- `idx_vehicle_make` - Single column index on make
- `idx_vehicle_model` - Single column index on model
- `idx_vehicle_year_make` - Composite index for year/make queries
- `idx_vehicle_year_make_model` - Composite index for year/make/model queries
- `idx_vehicle_year_make_model_trim` - Composite index for full cascade
### Query Performance
Dropdown queries are optimized to return results in < 50ms for typical datasets.
## Data Matching Logic
### Brand Name Transformation
- Source data (brands.json) stores names in ALL CAPS: "FORD", "ACURA", "ALFA ROMEO"
- ETL converts to Title Case: "Ford", "Acura", "Alfa Romeo"
- Preserves acronyms: BMW, GMC, KIA, MINI, FIAT, RAM
- Special cases: DeLorean, McLaren
### Engine Matching
The ETL uses intelligent pattern matching to link simple engine strings from makes-filter to detailed specs:
1. **Parse engine string**: Extract displacement (e.g., "2.0L") and configuration (e.g., "I4")
2. **Normalize displacement**: Convert Cm3 to Liters ("3506 Cm3" → "3.5L")
3. **Match to cache**: Look up in engine cache by (displacement, configuration)
4. **Create display name**: Format as "V8 3.5L", "L4 2.0L Turbo", etc.
### Transmission Linking
- Transmission data is embedded in engines.json under "Transmission Specs"
- Each engine record includes gearbox type (e.g., "6-Speed Manual")
- ETL links transmissions to vehicle records based on engine match
- Success rate: 98.9% (1,109,510 of 1,122,644 records)
- Unlinked records: primarily electric vehicles without traditional transmissions
### Configuration Equivalents
- `I4` = `L4` = `INLINE-4` = `4 Inline`
- `V6` = `V-6`
- `V8` = `V-8`
## Filtered Makes (53 Total)
All brand names are stored in Title Case format for user-friendly display.
### American Brands (12)
Acura, Buick, Cadillac, Chevrolet, Chrysler, Dodge, Ford, GMC, Hummer, Jeep, Lincoln, RAM
### Luxury/Performance (13)
Aston Martin, Bentley, Ferrari, Lamborghini, Maserati, McLaren, Porsche, Rolls Royce, Tesla, Jaguar, Audi, BMW, Land Rover
### Japanese (8)
Honda, Infiniti, Lexus, Mazda, Mitsubishi, Nissan, Subaru, Toyota
### European (9)
Alfa Romeo, FIAT, MINI, Saab, Saturn, Scion, Smart, Volkswagen, Volvo
### Other (11)
Genesis, Geo, Hyundai, KIA, Lucid, Polestar, Rivian, Lotus, Mercury, Oldsmobile, Plymouth, Pontiac
## Troubleshooting
### Container Not Running
```bash
docker compose up -d
docker compose ps
```
### Database Connection Issues
Check connection parameters in `etl_vehicle_data.py`:
```python
DB_CONFIG = {
'host': 'localhost',
'database': 'motovaultpro',
'user': 'postgres',
'password': 'postgres',
'port': 5432
}
```
### Missing JSON Files
Ensure these files exist in project root:
- `engines.json`
- `automobiles.json`
- `brands.json`
- `makes-filter/*.json` (57 files)
### Python Dependencies
```bash
pip3 install psycopg2-binary
```
## Expected Results
After successful ETL:
- **Engines**: 30,066 records
- **Transmissions**: 828 records
- **Vehicle Options**: 1,122,644 configurations
- **Years**: 47 years (1980-2026)
- **Makes**: 53 manufacturers
- **Models**: 1,741 unique models
- **Transmission Linking**: 98.9% success rate
- **Output Files**: ~52MB total (632KB engines + 21KB transmissions + 51MB vehicles)
## Next Steps
1. Create API endpoints for dropdown queries
2. Add caching layer for frequently accessed queries
3. Implement full-text search for models
4. Add vehicle images and detailed specs display
5. Create admin interface for data management