203 lines
5.8 KiB
Markdown
203 lines
5.8 KiB
Markdown
# Analysis Findings - JSON Vehicle Data
|
|
|
|
## Data Source Overview
|
|
- **Location**: `mvp-platform-services/vehicles/etl/sources/makes/`
|
|
- **File Count**: 55 JSON files
|
|
- **File Naming**: Lowercase with underscores (e.g., `alfa_romeo.json`, `land_rover.json`)
|
|
- **Data Structure**: Hierarchical vehicle data by make
|
|
|
|
## JSON File Structure Analysis
|
|
|
|
### Standard Structure
|
|
```json
|
|
{
|
|
"[make_name]": [
|
|
{
|
|
"year": "2024",
|
|
"models": [
|
|
{
|
|
"name": "model_name",
|
|
"engines": [
|
|
"2.0L I4",
|
|
"3.5L V6 TURBO"
|
|
],
|
|
"submodels": [
|
|
"Base",
|
|
"Premium",
|
|
"Limited"
|
|
]
|
|
}
|
|
]
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### Key Data Points
|
|
1. **Make Level**: Root key matches filename (lowercase)
|
|
2. **Year Level**: Array of yearly data
|
|
3. **Model Level**: Array of models per year
|
|
4. **Engines**: Array of engine specifications
|
|
5. **Submodels**: Array of trim levels
|
|
|
|
## Make Name Analysis
|
|
|
|
### File Naming vs Display Name Issues
|
|
| Filename | Required Display Name | Issue |
|
|
|----------|---------------------|--------|
|
|
| `alfa_romeo.json` | "Alfa Romeo" | Underscore → space, title case |
|
|
| `land_rover.json` | "Land Rover" | Underscore → space, title case |
|
|
| `rolls_royce.json` | "Rolls Royce" | Underscore → space, title case |
|
|
| `chevrolet.json` | "Chevrolet" | Direct match |
|
|
| `bmw.json` | "BMW" | Uppercase required |
|
|
|
|
### Make Name Normalization Rules
|
|
1. **Replace underscores** with spaces
|
|
2. **Title case** each word
|
|
3. **Special cases**: BMW, GMC (all caps)
|
|
4. **Validation**: Cross-reference with `sources/makes.json`
|
|
|
|
## Engine Specification Analysis
|
|
|
|
### Discovered Engine Patterns
|
|
From analysis of Nissan, Toyota, Ford, Subaru, and Porsche files:
|
|
|
|
#### Standard Format: `{displacement}L {config}{cylinders}`
|
|
- `"2.0L I4"` - 2.0 liter, Inline 4-cylinder
|
|
- `"3.5L V6"` - 3.5 liter, V6 configuration
|
|
- `"2.4L H4"` - 2.4 liter, Horizontal (Boxer) 4-cylinder
|
|
|
|
#### Configuration Types Found
|
|
- **I** = Inline (most common)
|
|
- **V** = V-configuration
|
|
- **H** = Horizontal/Boxer (Subaru, Porsche)
|
|
- **L** = **MUST BE TREATED AS INLINE** (L3 → I3)
|
|
|
|
### Engine Modifier Patterns
|
|
|
|
#### Hybrid Classifications
|
|
- `"PLUG-IN HYBRID EV- (PHEV)"` - Plug-in hybrid electric vehicle
|
|
- `"FULL HYBRID EV- (FHEV)"` - Full hybrid electric vehicle
|
|
- `"HYBRID"` - General hybrid designation
|
|
|
|
#### Fuel Type Modifiers
|
|
- `"FLEX"` - Flex-fuel capability (e.g., `"5.6L V8 FLEX"`)
|
|
- `"ELECTRIC"` - Pure electric motor
|
|
- `"TURBO"` - Turbocharged (less common in current data)
|
|
|
|
#### Example Engine Strings
|
|
```
|
|
"2.5L I4 FULL HYBRID EV- (FHEV)"
|
|
"1.5L L3 PLUG-IN HYBRID EV- (PHEV)" // L3 → I3
|
|
"5.6L V8 FLEX"
|
|
"2.4L H4" // Subaru Boxer
|
|
"1.8L I4 ELECTRIC"
|
|
```
|
|
|
|
## Special Cases Analysis
|
|
|
|
### Electric Vehicle Handling
|
|
**Tesla Example** (`tesla.json`):
|
|
```json
|
|
{
|
|
"name": "3",
|
|
"engines": [], // Empty array
|
|
"submodels": ["Long Range AWD", "Performance"]
|
|
}
|
|
```
|
|
|
|
**Lucid Example** (`lucid.json`):
|
|
```json
|
|
{
|
|
"name": "air",
|
|
"engines": [], // Empty array
|
|
"submodels": []
|
|
}
|
|
```
|
|
|
|
#### Electric Vehicle Requirements
|
|
- **Empty engines arrays** are common for pure electric vehicles
|
|
- **Must create default engine**: `"Electric Motor"` with appropriate specs
|
|
- **Fuel type**: `"Electric"`
|
|
- **Configuration**: `null` or `"Electric"`
|
|
|
|
### Hybrid Vehicle Patterns
|
|
From Toyota analysis - hybrid appears in both engines and submodels:
|
|
- **Engine level**: `"1.8L I4 ELECTRIC"`
|
|
- **Submodel level**: `"Hybrid LE"`, `"Hybrid XSE"`
|
|
|
|
## Data Quality Issues Found
|
|
|
|
### Missing Engine Data
|
|
- **Tesla models**: Consistently empty engines arrays
|
|
- **Lucid models**: Empty engines arrays
|
|
- **Some Nissan models**: Empty engines for electric variants
|
|
|
|
### Inconsistent Submodel Data
|
|
- **Mix of trim levels and descriptors**
|
|
- **Some technical specifications** in submodel names
|
|
- **Inconsistent naming patterns** across makes
|
|
|
|
### Engine Specification Inconsistencies
|
|
- **L-configuration usage**: Should be normalized to I (Inline)
|
|
- **Mixed hybrid notation**: Sometimes in engine string, sometimes separate
|
|
- **Abbreviation variations**: EV- vs EV, FHEV vs FULL HYBRID
|
|
|
|
## Database Mapping Strategy
|
|
|
|
### Make Mapping
|
|
```
|
|
Filename: "alfa_romeo.json" → Database: "Alfa Romeo"
|
|
```
|
|
|
|
### Model Mapping
|
|
```
|
|
JSON models.name → vehicles.model.name
|
|
```
|
|
|
|
### Engine Mapping
|
|
```
|
|
JSON engines[0] → vehicles.engine.name (with parsing)
|
|
Engine parsing → displacement_l, cylinders, fuel_type, aspiration
|
|
```
|
|
|
|
### Trim Mapping
|
|
```
|
|
JSON submodels[0] → vehicles.trim.name
|
|
```
|
|
|
|
## Data Volume Estimates
|
|
|
|
### File Size Analysis
|
|
- **Largest files**: `toyota.json` (~748KB), `volkswagen.json` (~738KB)
|
|
- **Smallest files**: `lucid.json` (~176B), `rivian.json` (~177B)
|
|
- **Average file size**: ~150KB
|
|
|
|
### Record Estimates (Based on Sample Analysis)
|
|
- **Makes**: 55 (one per file)
|
|
- **Models per make**: 5-50 (highly variable)
|
|
- **Years per model**: 10-15 years average
|
|
- **Trims per model-year**: 3-10 average
|
|
- **Engines**: 500-1000 unique engines total
|
|
|
|
## Processing Recommendations
|
|
|
|
### Order of Operations
|
|
1. **Load makes** - Create make records with normalized names
|
|
2. **Load models** - Associate with correct make_id
|
|
3. **Load model_years** - Create year availability
|
|
4. **Parse and load engines** - Handle L→I normalization
|
|
5. **Load trims** - Associate with model_year_id
|
|
6. **Create trim_engine relationships**
|
|
|
|
### Error Handling Requirements
|
|
- **Handle empty engines arrays** (electric vehicles)
|
|
- **Validate engine parsing** (log unparseable engines)
|
|
- **Handle duplicate records** (upsert strategy)
|
|
- **Report data quality issues** (missing data, parsing failures)
|
|
|
|
## Validation Strategy
|
|
- **Cross-reference makes** with existing `sources/makes.json`
|
|
- **Validate engine parsing** with regex patterns
|
|
- **Check referential integrity** during loading
|
|
- **Report statistics** per make (models, engines, trims loaded) |