5.8 KiB
5.8 KiB
Analysis Findings - JSON Vehicle Data
Data Source Overview
- Location:
mvp-platform-services/vehicles/etl/sources/makes/ - File Count: 55 JSON files
- File Naming: Lowercase with underscores (e.g.,
alfa_romeo.json,land_rover.json) - Data Structure: Hierarchical vehicle data by make
JSON File Structure Analysis
Standard Structure
{
"[make_name]": [
{
"year": "2024",
"models": [
{
"name": "model_name",
"engines": [
"2.0L I4",
"3.5L V6 TURBO"
],
"submodels": [
"Base",
"Premium",
"Limited"
]
}
]
}
]
}
Key Data Points
- Make Level: Root key matches filename (lowercase)
- Year Level: Array of yearly data
- Model Level: Array of models per year
- Engines: Array of engine specifications
- Submodels: Array of trim levels
Make Name Analysis
File Naming vs Display Name Issues
| Filename | Required Display Name | Issue |
|---|---|---|
alfa_romeo.json |
"Alfa Romeo" | Underscore → space, title case |
land_rover.json |
"Land Rover" | Underscore → space, title case |
rolls_royce.json |
"Rolls Royce" | Underscore → space, title case |
chevrolet.json |
"Chevrolet" | Direct match |
bmw.json |
"BMW" | Uppercase required |
Make Name Normalization Rules
- Replace underscores with spaces
- Title case each word
- Special cases: BMW, GMC (all caps)
- Validation: Cross-reference with
sources/makes.json
Engine Specification Analysis
Discovered Engine Patterns
From analysis of Nissan, Toyota, Ford, Subaru, and Porsche files:
Standard Format: {displacement}L {config}{cylinders}
"2.0L I4"- 2.0 liter, Inline 4-cylinder"3.5L V6"- 3.5 liter, V6 configuration"2.4L H4"- 2.4 liter, Horizontal (Boxer) 4-cylinder
Configuration Types Found
- I = Inline (most common)
- V = V-configuration
- H = Horizontal/Boxer (Subaru, Porsche)
- L = MUST BE TREATED AS INLINE (L3 → I3)
Engine Modifier Patterns
Hybrid Classifications
"PLUG-IN HYBRID EV- (PHEV)"- Plug-in hybrid electric vehicle"FULL HYBRID EV- (FHEV)"- Full hybrid electric vehicle"HYBRID"- General hybrid designation
Fuel Type Modifiers
"FLEX"- Flex-fuel capability (e.g.,"5.6L V8 FLEX")"ELECTRIC"- Pure electric motor"TURBO"- Turbocharged (less common in current data)
Example Engine Strings
"2.5L I4 FULL HYBRID EV- (FHEV)"
"1.5L L3 PLUG-IN HYBRID EV- (PHEV)" // L3 → I3
"5.6L V8 FLEX"
"2.4L H4" // Subaru Boxer
"1.8L I4 ELECTRIC"
Special Cases Analysis
Electric Vehicle Handling
Tesla Example (tesla.json):
{
"name": "3",
"engines": [], // Empty array
"submodels": ["Long Range AWD", "Performance"]
}
Lucid Example (lucid.json):
{
"name": "air",
"engines": [], // Empty array
"submodels": []
}
Electric Vehicle Requirements
- Empty engines arrays are common for pure electric vehicles
- Must create default engine:
"Electric Motor"with appropriate specs - Fuel type:
"Electric" - Configuration:
nullor"Electric"
Hybrid Vehicle Patterns
From Toyota analysis - hybrid appears in both engines and submodels:
- Engine level:
"1.8L I4 ELECTRIC" - Submodel level:
"Hybrid LE","Hybrid XSE"
Data Quality Issues Found
Missing Engine Data
- Tesla models: Consistently empty engines arrays
- Lucid models: Empty engines arrays
- Some Nissan models: Empty engines for electric variants
Inconsistent Submodel Data
- Mix of trim levels and descriptors
- Some technical specifications in submodel names
- Inconsistent naming patterns across makes
Engine Specification Inconsistencies
- L-configuration usage: Should be normalized to I (Inline)
- Mixed hybrid notation: Sometimes in engine string, sometimes separate
- Abbreviation variations: EV- vs EV, FHEV vs FULL HYBRID
Database Mapping Strategy
Make Mapping
Filename: "alfa_romeo.json" → Database: "Alfa Romeo"
Model Mapping
JSON models.name → vehicles.model.name
Engine Mapping
JSON engines[0] → vehicles.engine.name (with parsing)
Engine parsing → displacement_l, cylinders, fuel_type, aspiration
Trim Mapping
JSON submodels[0] → vehicles.trim.name
Data Volume Estimates
File Size Analysis
- Largest files:
toyota.json(~748KB),volkswagen.json(~738KB) - Smallest files:
lucid.json(~176B),rivian.json(~177B) - Average file size: ~150KB
Record Estimates (Based on Sample Analysis)
- Makes: 55 (one per file)
- Models per make: 5-50 (highly variable)
- Years per model: 10-15 years average
- Trims per model-year: 3-10 average
- Engines: 500-1000 unique engines total
Processing Recommendations
Order of Operations
- Load makes - Create make records with normalized names
- Load models - Associate with correct make_id
- Load model_years - Create year availability
- Parse and load engines - Handle L→I normalization
- Load trims - Associate with model_year_id
- Create trim_engine relationships
Error Handling Requirements
- Handle empty engines arrays (electric vehicles)
- Validate engine parsing (log unparseable engines)
- Handle duplicate records (upsert strategy)
- Report data quality issues (missing data, parsing failures)
Validation Strategy
- Cross-reference makes with existing
sources/makes.json - Validate engine parsing with regex patterns
- Check referential integrity during loading
- Report statistics per make (models, engines, trims loaded)