# Analysis Findings - JSON Vehicle Data

## Data Source Overview
- **Location**: `mvp-platform-services/vehicles/etl/sources/makes/`
- **File Count**: 55 JSON files
- **File Naming**: Lowercase with underscores (e.g., `alfa_romeo.json`, `land_rover.json`)
- **Data Structure**: Hierarchical vehicle data by make

## JSON File Structure Analysis

### Standard Structure
```json
{
  "[make_name]": [
    {
      "year": "2024",
      "models": [
        {
          "name": "model_name",
          "engines": [
            "2.0L I4",
            "3.5L V6 TURBO"
          ],
          "submodels": [
            "Base",
            "Premium",
            "Limited"
          ]
        }
      ]
    }
  ]
}
```

### Key Data Points
1. **Make Level**: Root key matches filename (lowercase)
2. **Year Level**: Array of yearly data
3. **Model Level**: Array of models per year
4. **Engines**: Array of engine specifications
5. **Submodels**: Array of trim levels

## Make Name Analysis

### File Naming vs Display Name Issues
| Filename | Required Display Name | Issue |
|----------|---------------------|--------|
| `alfa_romeo.json` | "Alfa Romeo" | Underscore → space, title case |
| `land_rover.json` | "Land Rover" | Underscore → space, title case |
| `rolls_royce.json` | "Rolls Royce" | Underscore → space, title case |
| `chevrolet.json` | "Chevrolet" | Direct match |
| `bmw.json` | "BMW" | Uppercase required |

### Make Name Normalization Rules
1. **Replace underscores** with spaces
2. **Title case** each word
3. **Special cases**: BMW, GMC (all caps)
4. **Validation**: Cross-reference with `sources/makes.json`

## Engine Specification Analysis

### Discovered Engine Patterns
From analysis of Nissan, Toyota, Ford, Subaru, and Porsche files:

#### Standard Format: `{displacement}L {config}{cylinders}`
- `"2.0L I4"` - 2.0 liter, Inline 4-cylinder
- `"3.5L V6"` - 3.5 liter, V6 configuration  
- `"2.4L H4"` - 2.4 liter, Horizontal (Boxer) 4-cylinder

#### Configuration Types Found
- **I** = Inline (most common)
- **V** = V-configuration
- **H** = Horizontal/Boxer (Subaru, Porsche)
- **L** = **MUST BE TREATED AS INLINE** (L3 → I3)

### Engine Modifier Patterns

#### Hybrid Classifications
- `"PLUG-IN HYBRID EV- (PHEV)"` - Plug-in hybrid electric vehicle
- `"FULL HYBRID EV- (FHEV)"` - Full hybrid electric vehicle
- `"HYBRID"` - General hybrid designation

#### Fuel Type Modifiers
- `"FLEX"` - Flex-fuel capability (e.g., `"5.6L V8 FLEX"`)
- `"ELECTRIC"` - Pure electric motor
- `"TURBO"` - Turbocharged (less common in current data)

#### Example Engine Strings
```
"2.5L I4 FULL HYBRID EV- (FHEV)"
"1.5L L3 PLUG-IN HYBRID EV- (PHEV)"  // L3 → I3
"5.6L V8 FLEX"
"2.4L H4"  // Subaru Boxer
"1.8L I4 ELECTRIC"
```

## Special Cases Analysis

### Electric Vehicle Handling
**Tesla Example** (`tesla.json`):
```json
{
  "name": "3",
  "engines": [],  // Empty array
  "submodels": ["Long Range AWD", "Performance"]
}
```

**Lucid Example** (`lucid.json`):
```json
{
  "name": "air",
  "engines": [],  // Empty array
  "submodels": []
}
```

#### Electric Vehicle Requirements
- **Empty engines arrays** are common for pure electric vehicles
- **Must create default engine**: `"Electric Motor"` with appropriate specs
- **Fuel type**: `"Electric"`
- **Configuration**: `null` or `"Electric"`

### Hybrid Vehicle Patterns
From Toyota analysis - hybrid appears in both engines and submodels:
- **Engine level**: `"1.8L I4 ELECTRIC"`
- **Submodel level**: `"Hybrid LE"`, `"Hybrid XSE"`

## Data Quality Issues Found

### Missing Engine Data
- **Tesla models**: Consistently empty engines arrays
- **Lucid models**: Empty engines arrays  
- **Some Nissan models**: Empty engines for electric variants

### Inconsistent Submodel Data
- **Mix of trim levels and descriptors**
- **Some technical specifications** in submodel names
- **Inconsistent naming patterns** across makes

### Engine Specification Inconsistencies
- **L-configuration usage**: Should be normalized to I (Inline)
- **Mixed hybrid notation**: Sometimes in engine string, sometimes separate
- **Abbreviation variations**: EV- vs EV, FHEV vs FULL HYBRID

## Database Mapping Strategy

### Make Mapping
```
Filename: "alfa_romeo.json" → Database: "Alfa Romeo"
```

### Model Mapping  
```
JSON models.name → vehicles.model.name
```

### Engine Mapping
```
JSON engines[0] → vehicles.engine.name (with parsing)
Engine parsing → displacement_l, cylinders, fuel_type, aspiration
```

### Trim Mapping
```
JSON submodels[0] → vehicles.trim.name
```

## Data Volume Estimates

### File Size Analysis
- **Largest files**: `toyota.json` (~748KB), `volkswagen.json` (~738KB)
- **Smallest files**: `lucid.json` (~176B), `rivian.json` (~177B)
- **Average file size**: ~150KB

### Record Estimates (Based on Sample Analysis)
- **Makes**: 55 (one per file)
- **Models per make**: 5-50 (highly variable)
- **Years per model**: 10-15 years average
- **Trims per model-year**: 3-10 average
- **Engines**: 500-1000 unique engines total

## Processing Recommendations

### Order of Operations
1. **Load makes** - Create make records with normalized names
2. **Load models** - Associate with correct make_id
3. **Load model_years** - Create year availability
4. **Parse and load engines** - Handle L→I normalization
5. **Load trims** - Associate with model_year_id
6. **Create trim_engine relationships**

### Error Handling Requirements
- **Handle empty engines arrays** (electric vehicles)
- **Validate engine parsing** (log unparseable engines)  
- **Handle duplicate records** (upsert strategy)
- **Report data quality issues** (missing data, parsing failures)

## Validation Strategy
- **Cross-reference makes** with existing `sources/makes.json`
- **Validate engine parsing** with regex patterns
- **Check referential integrity** during loading
- **Report statistics** per make (models, engines, trims loaded)