Initial Commit

This commit is contained in:
Eric Gullickson
2025-09-17 16:09:15 -05:00
parent 0cdb9803de
commit a052040e3a
373 changed files with 437090 additions and 6773 deletions

View File

@@ -0,0 +1,203 @@
# Analysis Findings - JSON Vehicle Data
## Data Source Overview
- **Location**: `mvp-platform-services/vehicles/etl/sources/makes/`
- **File Count**: 55 JSON files
- **File Naming**: Lowercase with underscores (e.g., `alfa_romeo.json`, `land_rover.json`)
- **Data Structure**: Hierarchical vehicle data by make
## JSON File Structure Analysis
### Standard Structure
```json
{
"[make_name]": [
{
"year": "2024",
"models": [
{
"name": "model_name",
"engines": [
"2.0L I4",
"3.5L V6 TURBO"
],
"submodels": [
"Base",
"Premium",
"Limited"
]
}
]
}
]
}
```
### Key Data Points
1. **Make Level**: Root key matches filename (lowercase)
2. **Year Level**: Array of yearly data
3. **Model Level**: Array of models per year
4. **Engines**: Array of engine specifications
5. **Submodels**: Array of trim levels
## Make Name Analysis
### File Naming vs Display Name Issues
| Filename | Required Display Name | Issue |
|----------|---------------------|--------|
| `alfa_romeo.json` | "Alfa Romeo" | Underscore → space, title case |
| `land_rover.json` | "Land Rover" | Underscore → space, title case |
| `rolls_royce.json` | "Rolls Royce" | Underscore → space, title case |
| `chevrolet.json` | "Chevrolet" | Direct match |
| `bmw.json` | "BMW" | Uppercase required |
### Make Name Normalization Rules
1. **Replace underscores** with spaces
2. **Title case** each word
3. **Special cases**: BMW, GMC (all caps)
4. **Validation**: Cross-reference with `sources/makes.json`
## Engine Specification Analysis
### Discovered Engine Patterns
From analysis of Nissan, Toyota, Ford, Subaru, and Porsche files:
#### Standard Format: `{displacement}L {config}{cylinders}`
- `"2.0L I4"` - 2.0 liter, Inline 4-cylinder
- `"3.5L V6"` - 3.5 liter, V6 configuration
- `"2.4L H4"` - 2.4 liter, Horizontal (Boxer) 4-cylinder
#### Configuration Types Found
- **I** = Inline (most common)
- **V** = V-configuration
- **H** = Horizontal/Boxer (Subaru, Porsche)
- **L** = **MUST BE TREATED AS INLINE** (L3 → I3)
### Engine Modifier Patterns
#### Hybrid Classifications
- `"PLUG-IN HYBRID EV- (PHEV)"` - Plug-in hybrid electric vehicle
- `"FULL HYBRID EV- (FHEV)"` - Full hybrid electric vehicle
- `"HYBRID"` - General hybrid designation
#### Fuel Type Modifiers
- `"FLEX"` - Flex-fuel capability (e.g., `"5.6L V8 FLEX"`)
- `"ELECTRIC"` - Pure electric motor
- `"TURBO"` - Turbocharged (less common in current data)
#### Example Engine Strings
```
"2.5L I4 FULL HYBRID EV- (FHEV)"
"1.5L L3 PLUG-IN HYBRID EV- (PHEV)" // L3 → I3
"5.6L V8 FLEX"
"2.4L H4" // Subaru Boxer
"1.8L I4 ELECTRIC"
```
## Special Cases Analysis
### Electric Vehicle Handling
**Tesla Example** (`tesla.json`):
```json
{
"name": "3",
"engines": [], // Empty array
"submodels": ["Long Range AWD", "Performance"]
}
```
**Lucid Example** (`lucid.json`):
```json
{
"name": "air",
"engines": [], // Empty array
"submodels": []
}
```
#### Electric Vehicle Requirements
- **Empty engines arrays** are common for pure electric vehicles
- **Must create default engine**: `"Electric Motor"` with appropriate specs
- **Fuel type**: `"Electric"`
- **Configuration**: `null` or `"Electric"`
### Hybrid Vehicle Patterns
From Toyota analysis - hybrid appears in both engines and submodels:
- **Engine level**: `"1.8L I4 ELECTRIC"`
- **Submodel level**: `"Hybrid LE"`, `"Hybrid XSE"`
## Data Quality Issues Found
### Missing Engine Data
- **Tesla models**: Consistently empty engines arrays
- **Lucid models**: Empty engines arrays
- **Some Nissan models**: Empty engines for electric variants
### Inconsistent Submodel Data
- **Mix of trim levels and descriptors**
- **Some technical specifications** in submodel names
- **Inconsistent naming patterns** across makes
### Engine Specification Inconsistencies
- **L-configuration usage**: Should be normalized to I (Inline)
- **Mixed hybrid notation**: Sometimes in engine string, sometimes separate
- **Abbreviation variations**: EV- vs EV, FHEV vs FULL HYBRID
## Database Mapping Strategy
### Make Mapping
```
Filename: "alfa_romeo.json" → Database: "Alfa Romeo"
```
### Model Mapping
```
JSON models.name → vehicles.model.name
```
### Engine Mapping
```
JSON engines[0] → vehicles.engine.name (with parsing)
Engine parsing → displacement_l, cylinders, fuel_type, aspiration
```
### Trim Mapping
```
JSON submodels[0] → vehicles.trim.name
```
## Data Volume Estimates
### File Size Analysis
- **Largest files**: `toyota.json` (~748KB), `volkswagen.json` (~738KB)
- **Smallest files**: `lucid.json` (~176B), `rivian.json` (~177B)
- **Average file size**: ~150KB
### Record Estimates (Based on Sample Analysis)
- **Makes**: 55 (one per file)
- **Models per make**: 5-50 (highly variable)
- **Years per model**: 10-15 years average
- **Trims per model-year**: 3-10 average
- **Engines**: 500-1000 unique engines total
## Processing Recommendations
### Order of Operations
1. **Load makes** - Create make records with normalized names
2. **Load models** - Associate with correct make_id
3. **Load model_years** - Create year availability
4. **Parse and load engines** - Handle L→I normalization
5. **Load trims** - Associate with model_year_id
6. **Create trim_engine relationships**
### Error Handling Requirements
- **Handle empty engines arrays** (electric vehicles)
- **Validate engine parsing** (log unparseable engines)
- **Handle duplicate records** (upsert strategy)
- **Report data quality issues** (missing data, parsing failures)
## Validation Strategy
- **Cross-reference makes** with existing `sources/makes.json`
- **Validate engine parsing** with regex patterns
- **Check referential integrity** during loading
- **Report statistics** per make (models, engines, trims loaded)