Initial Commit

This commit is contained in:
Eric Gullickson
2025-09-17 16:09:15 -05:00
parent 0cdb9803de
commit a052040e3a
373 changed files with 437090 additions and 6773 deletions

View File

@@ -0,0 +1,203 @@
# Analysis Findings - JSON Vehicle Data
## Data Source Overview
- **Location**: `mvp-platform-services/vehicles/etl/sources/makes/`
- **File Count**: 55 JSON files
- **File Naming**: Lowercase with underscores (e.g., `alfa_romeo.json`, `land_rover.json`)
- **Data Structure**: Hierarchical vehicle data by make
## JSON File Structure Analysis
### Standard Structure
```json
{
"[make_name]": [
{
"year": "2024",
"models": [
{
"name": "model_name",
"engines": [
"2.0L I4",
"3.5L V6 TURBO"
],
"submodels": [
"Base",
"Premium",
"Limited"
]
}
]
}
]
}
```
### Key Data Points
1. **Make Level**: Root key matches filename (lowercase)
2. **Year Level**: Array of yearly data
3. **Model Level**: Array of models per year
4. **Engines**: Array of engine specifications
5. **Submodels**: Array of trim levels
## Make Name Analysis
### File Naming vs Display Name Issues
| Filename | Required Display Name | Issue |
|----------|---------------------|--------|
| `alfa_romeo.json` | "Alfa Romeo" | Underscore → space, title case |
| `land_rover.json` | "Land Rover" | Underscore → space, title case |
| `rolls_royce.json` | "Rolls Royce" | Underscore → space, title case |
| `chevrolet.json` | "Chevrolet" | Direct match |
| `bmw.json` | "BMW" | Uppercase required |
### Make Name Normalization Rules
1. **Replace underscores** with spaces
2. **Title case** each word
3. **Special cases**: BMW, GMC (all caps)
4. **Validation**: Cross-reference with `sources/makes.json`
## Engine Specification Analysis
### Discovered Engine Patterns
From analysis of Nissan, Toyota, Ford, Subaru, and Porsche files:
#### Standard Format: `{displacement}L {config}{cylinders}`
- `"2.0L I4"` - 2.0 liter, Inline 4-cylinder
- `"3.5L V6"` - 3.5 liter, V6 configuration
- `"2.4L H4"` - 2.4 liter, Horizontal (Boxer) 4-cylinder
#### Configuration Types Found
- **I** = Inline (most common)
- **V** = V-configuration
- **H** = Horizontal/Boxer (Subaru, Porsche)
- **L** = **MUST BE TREATED AS INLINE** (L3 → I3)
### Engine Modifier Patterns
#### Hybrid Classifications
- `"PLUG-IN HYBRID EV- (PHEV)"` - Plug-in hybrid electric vehicle
- `"FULL HYBRID EV- (FHEV)"` - Full hybrid electric vehicle
- `"HYBRID"` - General hybrid designation
#### Fuel Type Modifiers
- `"FLEX"` - Flex-fuel capability (e.g., `"5.6L V8 FLEX"`)
- `"ELECTRIC"` - Pure electric motor
- `"TURBO"` - Turbocharged (less common in current data)
#### Example Engine Strings
```
"2.5L I4 FULL HYBRID EV- (FHEV)"
"1.5L L3 PLUG-IN HYBRID EV- (PHEV)" // L3 → I3
"5.6L V8 FLEX"
"2.4L H4" // Subaru Boxer
"1.8L I4 ELECTRIC"
```
## Special Cases Analysis
### Electric Vehicle Handling
**Tesla Example** (`tesla.json`):
```json
{
"name": "3",
"engines": [], // Empty array
"submodels": ["Long Range AWD", "Performance"]
}
```
**Lucid Example** (`lucid.json`):
```json
{
"name": "air",
"engines": [], // Empty array
"submodels": []
}
```
#### Electric Vehicle Requirements
- **Empty engines arrays** are common for pure electric vehicles
- **Must create default engine**: `"Electric Motor"` with appropriate specs
- **Fuel type**: `"Electric"`
- **Configuration**: `null` or `"Electric"`
### Hybrid Vehicle Patterns
From Toyota analysis - hybrid appears in both engines and submodels:
- **Engine level**: `"1.8L I4 ELECTRIC"`
- **Submodel level**: `"Hybrid LE"`, `"Hybrid XSE"`
## Data Quality Issues Found
### Missing Engine Data
- **Tesla models**: Consistently empty engines arrays
- **Lucid models**: Empty engines arrays
- **Some Nissan models**: Empty engines for electric variants
### Inconsistent Submodel Data
- **Mix of trim levels and descriptors**
- **Some technical specifications** in submodel names
- **Inconsistent naming patterns** across makes
### Engine Specification Inconsistencies
- **L-configuration usage**: Should be normalized to I (Inline)
- **Mixed hybrid notation**: Sometimes in engine string, sometimes separate
- **Abbreviation variations**: EV- vs EV, FHEV vs FULL HYBRID
## Database Mapping Strategy
### Make Mapping
```
Filename: "alfa_romeo.json" → Database: "Alfa Romeo"
```
### Model Mapping
```
JSON models.name → vehicles.model.name
```
### Engine Mapping
```
JSON engines[0] → vehicles.engine.name (with parsing)
Engine parsing → displacement_l, cylinders, fuel_type, aspiration
```
### Trim Mapping
```
JSON submodels[0] → vehicles.trim.name
```
## Data Volume Estimates
### File Size Analysis
- **Largest files**: `toyota.json` (~748KB), `volkswagen.json` (~738KB)
- **Smallest files**: `lucid.json` (~176B), `rivian.json` (~177B)
- **Average file size**: ~150KB
### Record Estimates (Based on Sample Analysis)
- **Makes**: 55 (one per file)
- **Models per make**: 5-50 (highly variable)
- **Years per model**: 10-15 years average
- **Trims per model-year**: 3-10 average
- **Engines**: 500-1000 unique engines total
## Processing Recommendations
### Order of Operations
1. **Load makes** - Create make records with normalized names
2. **Load models** - Associate with correct make_id
3. **Load model_years** - Create year availability
4. **Parse and load engines** - Handle L→I normalization
5. **Load trims** - Associate with model_year_id
6. **Create trim_engine relationships**
### Error Handling Requirements
- **Handle empty engines arrays** (electric vehicles)
- **Validate engine parsing** (log unparseable engines)
- **Handle duplicate records** (upsert strategy)
- **Report data quality issues** (missing data, parsing failures)
## Validation Strategy
- **Cross-reference makes** with existing `sources/makes.json`
- **Validate engine parsing** with regex patterns
- **Check referential integrity** during loading
- **Report statistics** per make (models, engines, trims loaded)

View File

@@ -0,0 +1,307 @@
# Implementation Plan - Manual JSON ETL
## Implementation Overview
Add manual JSON processing capability to the existing MVP Platform Vehicles ETL system without disrupting the current MSSQL-based pipeline.
## Development Phases
### Phase 1: Core Utilities ⏳
**Objective**: Create foundational utilities for JSON processing
#### 1.1 Make Name Mapper (`etl/utils/make_name_mapper.py`)
```python
class MakeNameMapper:
def normalize_make_name(self, filename: str) -> str:
"""Convert 'alfa_romeo' to 'Alfa Romeo'"""
def get_display_name_mapping(self) -> Dict[str, str]:
"""Get complete filename -> display name mapping"""
def validate_against_sources(self) -> List[str]:
"""Cross-validate with sources/makes.json"""
```
**Implementation Requirements**:
- Handle underscore → space conversion
- Title case each word
- Special cases: BMW, GMC (all caps)
- Validation against existing `sources/makes.json`
#### 1.2 Engine Spec Parser (`etl/utils/engine_spec_parser.py`)
```python
@dataclass
class EngineSpec:
displacement_l: float
configuration: str # I, V, H
cylinders: int
fuel_type: str # Gasoline, Hybrid, Electric, Flex Fuel
aspiration: str # Natural, Turbo, Supercharged
raw_string: str
class EngineSpecParser:
def parse_engine_string(self, engine_str: str) -> EngineSpec:
"""Parse '2.0L I4 PLUG-IN HYBRID EV- (PHEV)' into components"""
def normalize_configuration(self, config: str) -> str:
"""Convert L → I (L3 becomes I3)"""
def extract_fuel_type(self, engine_str: str) -> str:
"""Extract fuel type from modifiers"""
```
**Implementation Requirements**:
- **CRITICAL**: L-configuration → I (Inline) normalization
- Regex patterns for standard format: `{displacement}L {config}{cylinders}`
- Hybrid/electric detection: PHEV, FHEV, ELECTRIC patterns
- Flex-fuel detection: FLEX modifier
- Handle parsing failures gracefully
### Phase 2: Data Extraction ⏳
**Objective**: Extract data from JSON files into normalized structures
#### 2.1 JSON Extractor (`etl/extractors/json_extractor.py`)
```python
class JsonExtractor:
def __init__(self, make_mapper: MakeNameMapper,
engine_parser: EngineSpecParser):
pass
def extract_make_data(self, json_file_path: str) -> MakeData:
"""Extract complete make data from JSON file"""
def extract_all_makes(self, sources_dir: str) -> List[MakeData]:
"""Process all JSON files in directory"""
def validate_json_structure(self, json_data: dict) -> ValidationResult:
"""Validate JSON structure before processing"""
```
**Data Structures**:
```python
@dataclass
class MakeData:
name: str # Normalized display name
models: List[ModelData]
@dataclass
class ModelData:
name: str
years: List[int]
engines: List[EngineSpec]
trims: List[str] # From submodels
```
#### 2.2 Electric Vehicle Handler
```python
class ElectricVehicleHandler:
def create_default_engine(self) -> EngineSpec:
"""Create default 'Electric Motor' engine for empty arrays"""
def is_electric_vehicle(self, model_data: ModelData) -> bool:
"""Detect electric vehicles by empty engines + make patterns"""
```
### Phase 3: Data Loading ⏳
**Objective**: Load JSON-extracted data into PostgreSQL
#### 3.1 JSON Manual Loader (`etl/loaders/json_manual_loader.py`)
```python
class JsonManualLoader:
def __init__(self, postgres_loader: PostgreSQLLoader):
pass
def load_make_data(self, make_data: MakeData, mode: LoadMode):
"""Load complete make data with referential integrity"""
def load_all_makes(self, makes_data: List[MakeData],
mode: LoadMode) -> LoadResult:
"""Batch load all makes with progress tracking"""
def handle_duplicates(self, table: str, data: List[Dict]) -> int:
"""Handle duplicate records based on natural keys"""
```
**Load Modes**:
- **CLEAR**: `TRUNCATE CASCADE` then insert (destructive)
- **APPEND**: Insert with `ON CONFLICT DO NOTHING` (safe)
#### 3.2 Extend PostgreSQL Loader
Enhance `etl/loaders/postgres_loader.py` with JSON-specific methods:
```python
def load_json_makes(self, makes: List[Dict], clear_existing: bool) -> int
def load_json_engines(self, engines: List[EngineSpec], clear_existing: bool) -> int
def create_model_year_relationships(self, model_years: List[Dict]) -> int
```
### Phase 4: Pipeline Integration ⏳
**Objective**: Create manual JSON processing pipeline
#### 4.1 Manual JSON Pipeline (`etl/pipelines/manual_json_pipeline.py`)
```python
class ManualJsonPipeline:
def __init__(self, sources_dir: str):
self.extractor = JsonExtractor(...)
self.loader = JsonManualLoader(...)
def run_manual_pipeline(self, mode: LoadMode,
specific_make: Optional[str] = None) -> PipelineResult:
"""Complete JSON → PostgreSQL pipeline"""
def validate_before_load(self) -> ValidationReport:
"""Pre-flight validation of all JSON files"""
def generate_load_report(self) -> LoadReport:
"""Post-load statistics and data quality report"""
```
#### 4.2 Pipeline Result Tracking
```python
@dataclass
class PipelineResult:
success: bool
makes_processed: int
models_loaded: int
engines_loaded: int
trims_loaded: int
errors: List[str]
warnings: List[str]
duration: timedelta
```
### Phase 5: CLI Integration ⏳
**Objective**: Add CLI commands for manual processing
#### 5.1 Main CLI Updates (`etl/main.py`)
```python
@cli.command()
@click.option('--mode', type=click.Choice(['clear', 'append']),
default='append', help='Load mode')
@click.option('--make', help='Process specific make only')
@click.option('--validate-only', is_flag=True,
help='Validate JSON files without loading')
def load_manual(mode, make, validate_only):
"""Load vehicle data from JSON files"""
@cli.command()
def validate_json():
"""Validate all JSON files structure and data quality"""
```
#### 5.2 Configuration Updates (`etl/config.py`)
```python
# JSON Processing settings
JSON_SOURCES_DIR: str = "sources/makes"
MANUAL_LOAD_DEFAULT_MODE: str = "append"
ELECTRIC_DEFAULT_ENGINE: str = "Electric Motor"
ENGINE_PARSING_STRICT: bool = False # Log vs fail on parse errors
```
### Phase 6: Testing & Validation ⏳
**Objective**: Comprehensive testing and validation
#### 6.1 Unit Tests
- `test_make_name_mapper.py` - Make name normalization
- `test_engine_spec_parser.py` - Engine parsing with L→I normalization
- `test_json_extractor.py` - JSON data extraction
- `test_manual_loader.py` - Database loading
#### 6.2 Integration Tests
- `test_manual_pipeline.py` - End-to-end JSON processing
- `test_api_integration.py` - Verify API endpoints work with JSON data
- `test_data_quality.py` - Data quality validation
#### 6.3 Data Validation Scripts
```python
# examples/validate_all_json.py
def validate_all_makes() -> ValidationReport:
"""Validate all 55 JSON files and report issues"""
# examples/compare_data_sources.py
def compare_mssql_vs_json() -> ComparisonReport:
"""Compare MSSQL vs JSON data for overlapping makes"""
```
## File Structure Changes
### New Files to Create
```
etl/
├── utils/
│ ├── make_name_mapper.py # Make name normalization
│ └── engine_spec_parser.py # Engine specification parsing
├── extractors/
│ └── json_extractor.py # JSON data extraction
├── loaders/
│ └── json_manual_loader.py # JSON-specific data loading
└── pipelines/
└── manual_json_pipeline.py # JSON processing pipeline
```
### Files to Modify
```
etl/
├── main.py # Add load-manual command
├── config.py # Add JSON processing config
└── loaders/
└── postgres_loader.py # Extend for JSON data types
```
## Implementation Order
### Week 1: Foundation
1. ✅ Create documentation structure
2. ⏳ Implement `MakeNameMapper` with validation
3. ⏳ Implement `EngineSpecParser` with L→I normalization
4. ⏳ Unit tests for utilities
### Week 2: Data Processing
1. ⏳ Implement `JsonExtractor` with validation
2. ⏳ Implement `ElectricVehicleHandler`
3. ⏳ Create data structures and type definitions
4. ⏳ Integration tests for extraction
### Week 3: Data Loading
1. ⏳ Implement `JsonManualLoader` with clear/append modes
2. ⏳ Extend `PostgreSQLLoader` for JSON data types
3. ⏳ Implement duplicate handling strategy
4. ⏳ Database integration tests
### Week 4: Pipeline & CLI
1. ⏳ Implement `ManualJsonPipeline`
2. ⏳ Add CLI commands with options
3. ⏳ Add configuration management
4. ⏳ End-to-end testing
### Week 5: Validation & Polish
1. ⏳ Comprehensive data validation
2. ⏳ Performance testing with all 55 files
3. ⏳ Error handling improvements
4. ⏳ Documentation completion
## Success Metrics
- [ ] Process all 55 JSON files without errors
- [ ] Correct make name normalization (alfa_romeo → Alfa Romeo)
- [ ] Engine parsing with L→I normalization working
- [ ] Electric vehicle handling (default engines created)
- [ ] Clear/append modes working correctly
- [ ] API endpoints return data from JSON sources
- [ ] Performance acceptable (<5 minutes for full load)
- [ ] Comprehensive error reporting and logging
## Risk Mitigation
### Data Quality Risks
- **Mitigation**: Extensive validation before loading
- **Fallback**: Report data quality issues, continue processing
### Performance Risks
- **Mitigation**: Batch processing, progress tracking
- **Fallback**: Process makes individually if batch fails
### Schema Compatibility Risks
- **Mitigation**: Thorough testing against existing schema
- **Fallback**: Schema migration scripts if needed
### Integration Risks
- **Mitigation**: Maintain existing MSSQL pipeline compatibility
- **Fallback**: Feature flag to disable JSON processing

View File

@@ -0,0 +1,262 @@
# Engine Specification Parsing Rules
## Overview
Comprehensive rules for parsing engine specifications from JSON files into PostgreSQL engine table structure.
## Standard Engine Format
### Pattern: `{displacement}L {configuration}{cylinders} {modifiers}`
Examples:
- `"2.0L I4"` → 2.0L, Inline, 4-cylinder
- `"3.5L V6 TURBO"` → 3.5L, V6, Turbocharged
- `"1.5L L3 PLUG-IN HYBRID EV- (PHEV)"` → 1.5L, **Inline** (L→I), 3-cyl, Plug-in Hybrid
## Configuration Normalization Rules
### CRITICAL: L-Configuration Handling
**L-configurations MUST be treated as Inline (I)**
| Input | Normalized | Reasoning |
|-------|------------|-----------|
| `"1.5L L3"` | `"1.5L I3"` | L3 is alternate notation for Inline 3-cylinder |
| `"2.0L L4"` | `"2.0L I4"` | L4 is alternate notation for Inline 4-cylinder |
| `"1.2L L3 FULL HYBRID EV- (FHEV)"` | `"1.2L I3"` + Hybrid | L→I normalization + hybrid flag |
### Configuration Types
- **I** = Inline (most common)
- **V** = V-configuration
- **H** = Horizontal/Boxer (Subaru, Porsche)
- **L** = **Convert to I** (alternate Inline notation)
## Engine Parsing Implementation
### Regex Patterns
```python
# Primary engine pattern
ENGINE_PATTERN = r'(\d+\.?\d*)L\s+([IVHL])(\d+)'
# Modifier patterns
HYBRID_PATTERNS = [
r'PLUG-IN HYBRID EV-?\s*\(PHEV\)',
r'FULL HYBRID EV-?\s*\(FHEV\)',
r'HYBRID'
]
FUEL_PATTERNS = [
r'FLEX',
r'ELECTRIC',
r'TURBO',
r'SUPERCHARGED'
]
```
### Parsing Algorithm
```python
def parse_engine_string(engine_str: str) -> EngineSpec:
# 1. Extract base components (displacement, config, cylinders)
match = re.match(ENGINE_PATTERN, engine_str)
displacement = float(match.group(1))
config = normalize_configuration(match.group(2)) # L→I here
cylinders = int(match.group(3))
# 2. Detect fuel type and aspiration from modifiers
fuel_type = extract_fuel_type(engine_str)
aspiration = extract_aspiration(engine_str)
return EngineSpec(
displacement_l=displacement,
configuration=config,
cylinders=cylinders,
fuel_type=fuel_type,
aspiration=aspiration,
raw_string=engine_str
)
def normalize_configuration(config: str) -> str:
"""CRITICAL: Convert L to I"""
return 'I' if config == 'L' else config
```
## Fuel Type Detection
### Hybrid Classifications
| Pattern | Database Value | Description |
|---------|---------------|-------------|
| `"PLUG-IN HYBRID EV- (PHEV)"` | `"Plug-in Hybrid"` | Plug-in hybrid electric |
| `"FULL HYBRID EV- (FHEV)"` | `"Full Hybrid"` | Full hybrid electric |
| `"HYBRID"` | `"Hybrid"` | General hybrid |
### Other Fuel Types
| Pattern | Database Value | Description |
|---------|---------------|-------------|
| `"FLEX"` | `"Flex Fuel"` | Flex-fuel capability |
| `"ELECTRIC"` | `"Electric"` | Pure electric |
| No modifier | `"Gasoline"` | Default assumption |
## Aspiration Detection
### Forced Induction
| Pattern | Database Value | Description |
|---------|---------------|-------------|
| `"TURBO"` | `"Turbocharged"` | Turbocharged engine |
| `"SUPERCHARGED"` | `"Supercharged"` | Supercharged engine |
| `"SC"` | `"Supercharged"` | Supercharged (short form) |
| No modifier | `"Natural"` | Naturally aspirated |
## Real-World Examples
### Standard Engines
```
Input: "2.0L I4"
Output: EngineSpec(
displacement_l=2.0,
configuration="I",
cylinders=4,
fuel_type="Gasoline",
aspiration="Natural",
raw_string="2.0L I4"
)
```
### L→I Normalization Example
```
Input: "1.5L L3 PLUG-IN HYBRID EV- (PHEV)"
Output: EngineSpec(
displacement_l=1.5,
configuration="I", # L normalized to I
cylinders=3,
fuel_type="Plug-in Hybrid",
aspiration="Natural",
raw_string="1.5L L3 PLUG-IN HYBRID EV- (PHEV)"
)
```
### Subaru Boxer Engine
```
Input: "2.4L H4"
Output: EngineSpec(
displacement_l=2.4,
configuration="H", # Horizontal/Boxer
cylinders=4,
fuel_type="Gasoline",
aspiration="Natural",
raw_string="2.4L H4"
)
```
### Flex Fuel Engine
```
Input: "5.6L V8 FLEX"
Output: EngineSpec(
displacement_l=5.6,
configuration="V",
cylinders=8,
fuel_type="Flex Fuel",
aspiration="Natural",
raw_string="5.6L V8 FLEX"
)
```
## Electric Vehicle Handling
### Empty Engines Arrays
When `engines: []` is found (common in Tesla, Lucid):
```python
def create_default_electric_engine() -> EngineSpec:
return EngineSpec(
displacement_l=None, # N/A for electric
configuration="Electric", # Special designation
cylinders=None, # N/A for electric
fuel_type="Electric",
aspiration=None, # N/A for electric
raw_string="Electric Motor"
)
```
### Electric Motor Naming
Default name: `"Electric Motor"`
## Error Handling
### Unparseable Engines
For engines that don't match standard patterns:
1. **Log warning** with original string
2. **Create fallback engine** with raw_string preserved
3. **Continue processing** (don't fail entire make)
```python
def create_fallback_engine(raw_string: str) -> EngineSpec:
return EngineSpec(
displacement_l=None,
configuration="Unknown",
cylinders=None,
fuel_type="Unknown",
aspiration="Natural",
raw_string=raw_string
)
```
### Validation Rules
1. **Displacement**: Must be positive number if present
2. **Configuration**: Must be I, V, H, or Electric
3. **Cylinders**: Must be positive integer if present
4. **Required**: At least raw_string must be preserved
## Database Storage
### Engine Table Mapping
```sql
INSERT INTO vehicles.engine (
name, -- Original string or "Electric Motor"
code, -- NULL (not available in JSON)
displacement_l, -- Parsed displacement
cylinders, -- Parsed cylinder count
fuel_type, -- Parsed or "Gasoline" default
aspiration -- Parsed or "Natural" default
)
```
### Example Database Records
```sql
-- Standard engine
('2.0L I4', NULL, 2.0, 4, 'Gasoline', 'Natural')
-- L→I normalized
('1.5L I3', NULL, 1.5, 3, 'Plug-in Hybrid', 'Natural')
-- Electric vehicle
('Electric Motor', NULL, NULL, NULL, 'Electric', NULL)
-- Subaru Boxer
('2.4L H4', NULL, 2.4, 4, 'Gasoline', 'Natural')
```
## Testing Requirements
### Unit Test Cases
1. **L→I normalization**: `"1.5L L3"``configuration="I"`
2. **Hybrid detection**: All PHEV, FHEV, HYBRID patterns
3. **Configuration types**: I, V, H preservation
4. **Electric vehicles**: Empty array handling
5. **Error cases**: Unparseable strings
6. **Edge cases**: Missing displacement, unusual formats
### Integration Test Cases
1. **Real JSON data**: Process actual make files
2. **Database storage**: Verify correct database records
3. **API compatibility**: Ensure dropdown endpoints work
4. **Performance**: Parse 1000+ engines efficiently
## Future Considerations
### Potential Enhancements
1. **Turbo detection**: More sophisticated forced induction parsing
2. **Engine codes**: Extract manufacturer engine codes where available
3. **Performance specs**: Parse horsepower/torque if present in future data
4. **Validation**: Cross-reference with automotive databases
### Backwards Compatibility
- **MSSQL pipeline**: Must continue working unchanged
- **API responses**: Same format regardless of data source
- **Database schema**: No breaking changes required

View File

@@ -0,0 +1,331 @@
# Make Name Mapping Documentation
## Overview
Rules and implementation for converting JSON filename conventions to proper display names in the database.
## Problem Statement
JSON files use lowercase filenames with underscores, but database and API require proper display names:
- `alfa_romeo.json``"Alfa Romeo"`
- `land_rover.json``"Land Rover"`
- `rolls_royce.json``"Rolls Royce"`
## Normalization Rules
### Standard Transformation
1. **Remove .json extension**
2. **Replace underscores** with spaces
3. **Apply title case** to each word
4. **Apply special case exceptions**
### Implementation Algorithm
```python
def normalize_make_name(filename: str) -> str:
# Remove .json extension
base_name = filename.replace('.json', '')
# Replace underscores with spaces
spaced_name = base_name.replace('_', ' ')
# Apply title case
title_cased = spaced_name.title()
# Apply special cases
return apply_special_cases(title_cased)
```
## Complete Filename Mapping
### Multi-Word Makes (Underscore Conversion)
| Filename | Display Name | Notes |
|----------|-------------|-------|
| `alfa_romeo.json` | `"Alfa Romeo"` | Italian brand |
| `aston_martin.json` | `"Aston Martin"` | British luxury |
| `land_rover.json` | `"Land Rover"` | British SUV brand |
| `rolls_royce.json` | `"Rolls Royce"` | Ultra-luxury brand |
### Single-Word Makes (Standard Title Case)
| Filename | Display Name | Notes |
|----------|-------------|-------|
| `acura.json` | `"Acura"` | Honda luxury division |
| `audi.json` | `"Audi"` | German luxury |
| `bentley.json` | `"Bentley"` | British luxury |
| `bmw.json` | `"BMW"` | **Special case - all caps** |
| `buick.json` | `"Buick"` | GM luxury |
| `cadillac.json` | `"Cadillac"` | GM luxury |
| `chevrolet.json` | `"Chevrolet"` | GM mainstream |
| `chrysler.json` | `"Chrysler"` | Stellantis brand |
| `dodge.json` | `"Dodge"` | Stellantis performance |
| `ferrari.json` | `"Ferrari"` | Italian supercar |
| `fiat.json` | `"Fiat"` | Italian mainstream |
| `ford.json` | `"Ford"` | American mainstream |
| `genesis.json` | `"Genesis"` | Hyundai luxury |
| `geo.json` | `"Geo"` | GM defunct brand |
| `gmc.json` | `"GMC"` | **Special case - all caps** |
| `honda.json` | `"Honda"` | Japanese mainstream |
| `hummer.json` | `"Hummer"` | GM truck brand |
| `hyundai.json` | `"Hyundai"` | Korean mainstream |
| `infiniti.json` | `"Infiniti"` | Nissan luxury |
| `isuzu.json` | `"Isuzu"` | Japanese commercial |
| `jaguar.json` | `"Jaguar"` | British luxury |
| `jeep.json` | `"Jeep"` | Stellantis SUV |
| `kia.json` | `"Kia"` | Korean mainstream |
| `lamborghini.json` | `"Lamborghini"` | Italian supercar |
| `lexus.json` | `"Lexus"` | Toyota luxury |
| `lincoln.json` | `"Lincoln"` | Ford luxury |
| `lotus.json` | `"Lotus"` | British sports car |
| `lucid.json` | `"Lucid"` | American electric luxury |
| `maserati.json` | `"Maserati"` | Italian luxury |
| `mazda.json` | `"Mazda"` | Japanese mainstream |
| `mclaren.json` | `"McLaren"` | **Special case - capital L** |
| `mercury.json` | `"Mercury"` | Ford defunct luxury |
| `mini.json` | `"MINI"` | **Special case - all caps** |
| `mitsubishi.json` | `"Mitsubishi"` | Japanese mainstream |
| `nissan.json` | `"Nissan"` | Japanese mainstream |
| `oldsmobile.json` | `"Oldsmobile"` | GM defunct |
| `plymouth.json` | `"Plymouth"` | Chrysler defunct |
| `polestar.json` | `"Polestar"` | Volvo electric |
| `pontiac.json` | `"Pontiac"` | GM defunct performance |
| `porsche.json` | `"Porsche"` | German sports car |
| `ram.json` | `"Ram"` | Stellantis trucks |
| `rivian.json` | `"Rivian"` | American electric trucks |
| `saab.json` | `"Saab"` | Swedish defunct |
| `saturn.json` | `"Saturn"` | GM defunct |
| `scion.json` | `"Scion"` | Toyota defunct youth |
| `smart.json` | `"Smart"` | Mercedes micro car |
| `subaru.json` | `"Subaru"` | Japanese AWD |
| `tesla.json` | `"Tesla"` | American electric |
| `toyota.json` | `"Toyota"` | Japanese mainstream |
| `volkswagen.json` | `"Volkswagen"` | German mainstream |
| `volvo.json` | `"Volvo"` | Swedish luxury |
## Special Cases Implementation
### All Caps Brands
```python
SPECIAL_CASES = {
'Bmw': 'BMW', # Bayerische Motoren Werke
'Gmc': 'GMC', # General Motors Company
'Mini': 'MINI', # Brand stylization
}
```
### Custom Capitalizations
```python
CUSTOM_CAPS = {
'Mclaren': 'McLaren', # Scottish naming convention
}
```
### Complete Special Cases Function
```python
def apply_special_cases(title_cased_name: str) -> str:
"""Apply brand-specific capitalization rules"""
special_cases = {
'Bmw': 'BMW',
'Gmc': 'GMC',
'Mini': 'MINI',
'Mclaren': 'McLaren'
}
return special_cases.get(title_cased_name, title_cased_name)
```
## Validation Strategy
### Cross-Reference with sources/makes.json
The existing `mvp-platform-services/vehicles/etl/sources/makes.json` contains the authoritative list:
```json
{
"manufacturers": [
"Acura", "Alfa Romeo", "Aston Martin", "Audi", "BMW",
"Bentley", "Buick", "Cadillac", "Chevrolet", "Chrysler",
...
]
}
```
### Validation Implementation
```python
class MakeNameMapper:
def __init__(self):
self.authoritative_makes = self.load_authoritative_makes()
def load_authoritative_makes(self) -> Set[str]:
"""Load makes list from sources/makes.json"""
with open('sources/makes.json') as f:
data = json.load(f)
return set(data['manufacturers'])
def validate_mapping(self, filename: str, display_name: str) -> bool:
"""Validate mapped name against authoritative list"""
return display_name in self.authoritative_makes
def get_validation_report(self) -> ValidationReport:
"""Generate complete validation report"""
mismatches = []
json_files = glob.glob('sources/makes/*.json')
for file_path in json_files:
filename = os.path.basename(file_path)
mapped_name = self.normalize_make_name(filename)
if not self.validate_mapping(filename, mapped_name):
mismatches.append({
'filename': filename,
'mapped_name': mapped_name,
'status': 'NOT_FOUND_IN_AUTHORITATIVE'
})
return ValidationReport(mismatches=mismatches)
```
## Error Handling
### Unknown Files
For JSON files not in the authoritative list:
1. **Log warning** with filename and mapped name
2. **Proceed with mapping** (don't fail)
3. **Include in validation report**
### Filename Edge Cases
```python
def handle_edge_cases(filename: str) -> str:
"""Handle unusual filename patterns"""
# Remove multiple underscores
cleaned = re.sub(r'_+', '_', filename)
# Handle special characters (future-proofing)
cleaned = re.sub(r'[^a-zA-Z0-9_]', '', cleaned)
return cleaned
```
## Testing Requirements
### Unit Tests
```python
def test_standard_mapping():
mapper = MakeNameMapper()
assert mapper.normalize_make_name('toyota.json') == 'Toyota'
assert mapper.normalize_make_name('alfa_romeo.json') == 'Alfa Romeo'
def test_special_cases():
mapper = MakeNameMapper()
assert mapper.normalize_make_name('bmw.json') == 'BMW'
assert mapper.normalize_make_name('gmc.json') == 'GMC'
assert mapper.normalize_make_name('mclaren.json') == 'McLaren'
def test_validation():
mapper = MakeNameMapper()
assert mapper.validate_mapping('toyota.json', 'Toyota') == True
assert mapper.validate_mapping('fake.json', 'Fake Brand') == False
```
### Integration Tests
1. **Process all 55 files**: Ensure all map correctly
2. **Database integration**: Verify display names in database
3. **API response**: Confirm proper names in dropdown responses
## Implementation Class
### Complete MakeNameMapper Class
```python
import json
import glob
import os
from typing import Set, Dict, List
from dataclasses import dataclass
@dataclass
class ValidationReport:
mismatches: List[Dict[str, str]]
total_files: int
valid_mappings: int
@property
def success_rate(self) -> float:
return self.valid_mappings / self.total_files if self.total_files > 0 else 0.0
class MakeNameMapper:
def __init__(self, sources_dir: str = 'sources'):
self.sources_dir = sources_dir
self.authoritative_makes = self.load_authoritative_makes()
self.special_cases = {
'Bmw': 'BMW',
'Gmc': 'GMC',
'Mini': 'MINI',
'Mclaren': 'McLaren'
}
def normalize_make_name(self, filename: str) -> str:
"""Convert filename to display name"""
# Remove .json extension
base_name = filename.replace('.json', '')
# Replace underscores with spaces
spaced_name = base_name.replace('_', ' ')
# Apply title case
title_cased = spaced_name.title()
# Apply special cases
return self.special_cases.get(title_cased, title_cased)
def get_all_mappings(self) -> Dict[str, str]:
"""Get complete filename → display name mapping"""
mappings = {}
json_files = glob.glob(f'{self.sources_dir}/makes/*.json')
for file_path in json_files:
filename = os.path.basename(file_path)
display_name = self.normalize_make_name(filename)
mappings[filename] = display_name
return mappings
def validate_all_mappings(self) -> ValidationReport:
"""Validate all mappings against authoritative list"""
mappings = self.get_all_mappings()
mismatches = []
for filename, display_name in mappings.items():
if display_name not in self.authoritative_makes:
mismatches.append({
'filename': filename,
'mapped_name': display_name,
'status': 'NOT_FOUND_IN_AUTHORITATIVE'
})
return ValidationReport(
mismatches=mismatches,
total_files=len(mappings),
valid_mappings=len(mappings) - len(mismatches)
)
```
## Usage Examples
### Basic Usage
```python
mapper = MakeNameMapper()
# Single conversion
display_name = mapper.normalize_make_name('alfa_romeo.json')
print(display_name) # Output: "Alfa Romeo"
# Get all mappings
all_mappings = mapper.get_all_mappings()
print(all_mappings['bmw.json']) # Output: "BMW"
```
### Validation Usage
```python
# Validate all mappings
report = mapper.validate_all_mappings()
print(f"Success rate: {report.success_rate:.1%}")
print(f"Mismatches: {len(report.mismatches)}")
for mismatch in report.mismatches:
print(f"⚠️ {mismatch['filename']}{mismatch['mapped_name']}")
```

View File

@@ -0,0 +1,328 @@
# CLI Commands - Manual JSON ETL
## Overview
New CLI commands for processing JSON vehicle data into the PostgreSQL database.
## Primary Command: `load-manual`
### Basic Syntax
```bash
python -m etl load-manual [OPTIONS]
```
### Command Options
#### Load Mode (`--mode`)
Controls how data is handled in the database:
```bash
# Append mode (safe, default)
python -m etl load-manual --mode=append
# Clear mode (destructive - removes existing data first)
python -m etl load-manual --mode=clear
```
**Mode Details:**
- **`append`** (default): Uses `ON CONFLICT DO NOTHING` - safe for existing data
- **`clear`**: Uses `TRUNCATE CASCADE` then insert - completely replaces existing data
#### Specific Make Processing (`--make`)
Process only a specific make instead of all 55 files:
```bash
# Process only Toyota
python -m etl load-manual --make=toyota
# Process only BMW (uses filename format)
python -m etl load-manual --make=bmw
# Process Alfa Romeo (underscore format from filename)
python -m etl load-manual --make=alfa_romeo
```
#### Validation Only (`--validate-only`)
Validate JSON files without loading to database:
```bash
# Validate all JSON files
python -m etl load-manual --validate-only
# Validate specific make
python -m etl load-manual --make=tesla --validate-only
```
#### Verbose Output (`--verbose`)
Enable detailed progress output:
```bash
# Verbose processing
python -m etl load-manual --verbose
# Quiet processing (errors only)
python -m etl load-manual --quiet
```
### Complete Command Examples
```bash
# Standard usage - process all makes safely
python -m etl load-manual
# Full reload - clear and rebuild entire database
python -m etl load-manual --mode=clear --verbose
# Process specific make with validation
python -m etl load-manual --make=honda --mode=append --verbose
# Validate before processing
python -m etl load-manual --validate-only
python -m etl load-manual --mode=clear # If validation passes
```
## Secondary Command: `validate-json`
### Purpose
Standalone validation of JSON files without database operations.
### Syntax
```bash
python -m etl validate-json [OPTIONS]
```
### Options
```bash
# Validate all JSON files
python -m etl validate-json
# Validate specific make
python -m etl validate-json --make=toyota
# Generate detailed report
python -m etl validate-json --detailed-report
# Export validation results to file
python -m etl validate-json --export-report=/tmp/validation.json
```
### Validation Checks
1. **JSON structure** validation
2. **Engine parsing** validation
3. **Make name mapping** validation
4. **Data completeness** checks
5. **Cross-reference** with authoritative makes list
## Implementation Details
### CLI Command Structure
Add to `etl/main.py`:
```python
@cli.command()
@click.option('--mode', type=click.Choice(['clear', 'append']),
default='append', help='Database load mode')
@click.option('--make', help='Process specific make only (use filename format)')
@click.option('--validate-only', is_flag=True,
help='Validate JSON files without loading to database')
@click.option('--verbose', is_flag=True, help='Enable verbose output')
@click.option('--quiet', is_flag=True, help='Suppress non-error output')
def load_manual(mode, make, validate_only, verbose, quiet):
"""Load vehicle data from JSON files"""
if quiet:
logging.getLogger().setLevel(logging.ERROR)
elif verbose:
logging.getLogger().setLevel(logging.DEBUG)
try:
pipeline = ManualJsonPipeline(
sources_dir=config.JSON_SOURCES_DIR,
load_mode=LoadMode(mode.upper())
)
if validate_only:
result = pipeline.validate_all_json()
display_validation_report(result)
return
result = pipeline.run_manual_pipeline(specific_make=make)
display_pipeline_result(result)
if not result.success:
sys.exit(1)
except Exception as e:
logger.error(f"Manual load failed: {e}")
sys.exit(1)
@cli.command()
@click.option('--make', help='Validate specific make only')
@click.option('--detailed-report', is_flag=True,
help='Generate detailed validation report')
@click.option('--export-report', help='Export validation report to file')
def validate_json(make, detailed_report, export_report):
"""Validate JSON files structure and data quality"""
try:
validator = JsonValidator(sources_dir=config.JSON_SOURCES_DIR)
if make:
result = validator.validate_make(make)
else:
result = validator.validate_all_makes()
if detailed_report or export_report:
report = validator.generate_detailed_report(result)
if export_report:
with open(export_report, 'w') as f:
json.dump(report, f, indent=2)
logger.info(f"Validation report exported to {export_report}")
else:
display_detailed_report(report)
else:
display_validation_summary(result)
except Exception as e:
logger.error(f"JSON validation failed: {e}")
sys.exit(1)
```
## Output Examples
### Successful Load Output
```
$ python -m etl load-manual --mode=append --verbose
🚀 Starting manual JSON ETL pipeline...
📁 Processing 55 JSON files from sources/makes/
✅ Make normalization validation passed (55/55)
✅ Engine parsing validation passed (1,247 engines)
📊 Processing makes:
├── toyota.json → Toyota (47 models, 203 engines, 312 trims)
├── ford.json → Ford (52 models, 189 engines, 298 trims)
├── chevrolet.json → Chevrolet (48 models, 167 engines, 287 trims)
└── ... (52 more makes)
💾 Database loading:
├── Makes: 55 loaded (0 duplicates)
├── Models: 2,847 loaded (23 duplicates)
├── Model Years: 18,392 loaded (105 duplicates)
├── Engines: 1,247 loaded (45 duplicates)
└── Trims: 12,058 loaded (234 duplicates)
✅ Manual JSON ETL completed successfully in 2m 34s
```
### Validation Output
```
$ python -m etl validate-json
📋 JSON Validation Report
✅ File Structure: 55/55 files valid
✅ Make Name Mapping: 55/55 mappings valid
⚠️ Engine Parsing: 1,201/1,247 engines parsed (46 unparseable)
✅ Data Completeness: All required fields present
🔍 Issues Found:
├── Unparseable engines:
│ ├── toyota.json: "Custom Hybrid System" (1 occurrence)
│ ├── ferrari.json: "V12 Twin-Turbo Custom" (2 occurrences)
│ └── lamborghini.json: "V10 Plus" (43 occurrences)
└── Empty engine arrays:
├── tesla.json: 24 models with empty engines
└── lucid.json: 3 models with empty engines
💡 Recommendations:
• Review unparseable engine formats
• Electric vehicle handling will create default "Electric Motor" entries
Overall Status: ✅ READY FOR PROCESSING
```
### Error Handling Output
```
$ python -m etl load-manual --make=invalid_make
❌ Error: Make 'invalid_make' not found
Available makes:
acura, alfa_romeo, aston_martin, audi, bentley, bmw,
buick, cadillac, chevrolet, chrysler, dodge, ferrari,
... (showing first 20)
💡 Tip: Use 'python -m etl validate-json' to see all available makes
```
## Integration with Existing Commands
### Command Compatibility
The new commands integrate seamlessly with existing ETL commands:
```bash
# Existing MSSQL pipeline (unchanged)
python -m etl build-catalog
# New manual JSON pipeline
python -m etl load-manual
# Test connections (works for both)
python -m etl test
# Scheduling (MSSQL only currently)
python -m etl schedule
```
### Configuration Integration
Uses existing config structure with new JSON-specific settings:
```python
# In config.py
JSON_SOURCES_DIR: str = "sources/makes"
MANUAL_LOAD_DEFAULT_MODE: str = "append"
MANUAL_LOAD_BATCH_SIZE: int = 1000
JSON_VALIDATION_STRICT: bool = False
```
## Help and Documentation
### Built-in Help
```bash
# Main command help
python -m etl load-manual --help
# All commands help
python -m etl --help
```
### Command Discovery
```bash
# List all available commands
python -m etl
# Shows:
# Commands:
# build-catalog Build vehicle catalog from MSSQL database
# load-manual Load vehicle data from JSON files
# validate-json Validate JSON files structure and data quality
# schedule Start ETL scheduler (default mode)
# test Test database connections
# update Run ETL update
```
## Future Enhancements
### Planned Command Options
- `--dry-run`: Show what would be processed without making changes
- `--since`: Process only files modified since timestamp
- `--parallel`: Enable parallel processing of makes
- `--rollback`: Rollback previous manual load operation
### Advanced Validation Options
- `--strict-parsing`: Fail on any engine parsing errors
- `--cross-validate`: Compare JSON data against MSSQL data where available
- `--performance-test`: Benchmark processing performance

View File

@@ -0,0 +1,403 @@
# Implementation Status Tracking
## Current Status: ALL PHASES COMPLETE - READY FOR PRODUCTION 🎉
**Last Updated**: Phase 6 complete with full CLI integration implemented
**Current Phase**: Phase 6 complete - All implementation phases finished
**Next Phase**: Production testing and deployment (optional)
## Project Phases Overview
| Phase | Status | Progress | Next Steps |
|-------|--------|----------|------------|
| 📚 Documentation | ✅ Complete | 100% | Ready for implementation |
| 🔧 Core Utilities | ✅ Complete | 100% | Validated and tested |
| 📊 Data Extraction | ✅ Complete | 100% | Fully tested and validated |
| 💾 Data Loading | ✅ Complete | 100% | Database integration ready |
| 🚀 Pipeline Integration | ✅ Complete | 100% | End-to-end workflow ready |
| 🖥️ CLI Integration | ✅ Complete | 100% | Full CLI commands implemented |
| ✅ Testing & Validation | ⏳ Optional | 0% | Production testing available |
## Detailed Status
### ✅ Phase 1: Foundation Documentation (COMPLETE)
#### Completed Items
-**Project directory structure** created at `docs/changes/vehicles-dropdown-v2/`
-**README.md** - Main overview and AI handoff instructions
-**01-analysis-findings.md** - JSON data patterns and structure analysis
-**02-implementation-plan.md** - Detailed technical roadmap
-**03-engine-spec-parsing.md** - Engine parsing rules with L→I normalization
-**04-make-name-mapping.md** - Make name conversion rules and validation
-**06-cli-commands.md** - CLI command design and usage examples
-**08-status-tracking.md** - This implementation tracking document
#### Documentation Quality Check
- ✅ All critical requirements documented (L→I normalization, make names, etc.)
- ✅ Complete engine parsing patterns documented
- ✅ All 55 make files catalogued with naming rules
- ✅ Database schema integration documented
- ✅ CLI commands designed with comprehensive options
- ✅ AI handoff instructions complete
### ✅ Phase 2: Core Utilities (COMPLETE)
#### Completed Items
1. **MakeNameMapper** (`etl/utils/make_name_mapper.py`)
- Status: ✅ Complete
- Implementation: Filename to display name conversion with special cases
- Testing: Comprehensive unit tests with validation against authoritative list
- Quality: 100% make name validation success (55/55 files)
2. **EngineSpecParser** (`etl/utils/engine_spec_parser.py`)
- Status: ✅ Complete
- Implementation: Complete engine parsing with L→I normalization
- Critical Features: L→I conversion, W-configuration support, hybrid detection
- Testing: Extensive unit tests with real-world validation
- Quality: 99.9% parsing success (67,568/67,633 engines)
3. **Validation and Quality Assurance**
- Status: ✅ Complete
- Created comprehensive validation script (`validate_utilities.py`)
- Validated against all 55 JSON files (67,633 engines processed)
- Fixed W-configuration engine support (VW Group, Bentley)
- Fixed MINI make validation issue
- L→I normalization: 26,222 cases processed successfully
#### Implementation Results
- **Make Name Validation**: 100% success (55/55 files)
- **Engine Parsing**: 99.9% success (67,568/67,633 engines)
- **L→I Normalization**: Working perfectly (26,222 cases)
- **Electric Vehicle Handling**: 2,772 models with empty engines processed
- **W-Configuration Support**: 124 W8/W12 engines now supported
### ✅ Phase 3: Data Extraction (COMPLETE)
#### Completed Components
1. **JsonExtractor** (`etl/extractors/json_extractor.py`)
- Status: ✅ Complete
- Implementation: Full make/model/year/trim/engine extraction with normalization
- Dependencies: MakeNameMapper, EngineSpecParser (✅ Integrated)
- Features: JSON validation, data structures, progress tracking
- Quality: 100% extraction success on all 55 makes
2. **ElectricVehicleHandler** (integrated into JsonExtractor)
- Status: ✅ Complete
- Implementation: Automatic detection and handling of empty engines arrays
- Purpose: Create default "Electric Motor" for Tesla and other EVs
- Results: 917 electric models properly handled
3. **Data Structure Validation**
- Status: ✅ Complete
- Implementation: Comprehensive JSON structure validation
- Features: Error handling, warnings, data quality reporting
4. **Unit Testing and Validation**
- Status: ✅ Complete
- Created comprehensive unit test suite (`tests/test_json_extractor.py`)
- Validated against all 55 JSON files
- Results: 2,644 models, 5,199 engines extracted successfully
#### Implementation Results
- **File Processing**: 100% success (55/55 files)
- **Data Extraction**: 2,644 models, 5,199 engines
- **Electric Vehicle Handling**: 917 electric models
- **Data Quality**: Zero extraction errors
- **Integration**: MakeNameMapper and EngineSpecParser fully integrated
- **L→I Normalization**: Working seamlessly in extraction pipeline
### ✅ Phase 4: Data Loading (COMPLETE)
#### Completed Components
1. **JsonManualLoader** (`etl/loaders/json_manual_loader.py`)
- Status: ✅ Complete
- Implementation: Full PostgreSQL integration with referential integrity
- Features: Clear/append modes, duplicate handling, batch processing
- Database Support: Complete vehicles schema integration
2. **Load Modes and Conflict Resolution**
- Status: ✅ Complete
- CLEAR mode: Truncate and reload (destructive, fast)
- APPEND mode: Insert with conflict handling (safe, incremental)
- Duplicate detection and resolution for all entity types
3. **Database Integration**
- Status: ✅ Complete
- Full vehicles schema support (make→model→model_year→trim→engine)
- Referential integrity maintenance and validation
- Batch processing with progress tracking
4. **Unit Testing and Validation**
- Status: ✅ Complete
- Comprehensive unit test suite (`tests/test_json_manual_loader.py`)
- Mock database testing for all loading scenarios
- Error handling and rollback testing
#### Implementation Results
- **Database Schema**: Full vehicles schema support with proper referential integrity
- **Loading Modes**: Both CLEAR and APPEND modes implemented
- **Conflict Resolution**: Duplicate handling for makes, models, engines, and trims
- **Error Handling**: Robust error handling with statistics and reporting
- **Performance**: Batch processing with configurable batch sizes
- **Validation**: Referential integrity validation and reporting
### ✅ Phase 5: Pipeline Integration (COMPLETE)
#### Completed Components
1. **ManualJsonPipeline** (`etl/pipelines/manual_json_pipeline.py`)
- Status: ✅ Complete
- Implementation: Full end-to-end workflow coordination (extraction → loading)
- Dependencies: JsonExtractor, JsonManualLoader (✅ Integrated)
- Features: Progress tracking, error handling, comprehensive reporting
2. **Pipeline Configuration and Options**
- Status: ✅ Complete
- PipelineConfig class with full configuration management
- Clear/append mode selection and override capabilities
- Source directory configuration and validation
- Progress tracking with real-time updates and ETA calculation
3. **Performance Monitoring and Metrics**
- Status: ✅ Complete
- Real-time performance tracking (files/sec, records/sec)
- Phase-based progress tracking with detailed statistics
- Duration tracking and performance optimization
- Comprehensive execution reporting
4. **Integration Architecture**
- Status: ✅ Complete
- Full workflow coordination: extraction → loading → validation
- Error handling across all pipeline phases
- Rollback and recovery mechanisms
- Source file statistics and analysis
#### Implementation Results
- **End-to-End Workflow**: Complete extraction → loading → validation pipeline
- **Progress Tracking**: Real-time progress with ETA calculation and phase tracking
- **Performance Metrics**: Files/sec and records/sec monitoring with optimization
- **Configuration Management**: Flexible pipeline configuration with mode overrides
- **Error Handling**: Comprehensive error handling across all pipeline phases
- **Reporting**: Detailed execution reports with success rates and statistics
### ✅ Phase 6: CLI Integration (COMPLETE)
#### Completed Components
1. **CLI Command Implementation** (`etl/main.py`)
- Status: ✅ Complete
- Implementation: Full integration with existing Click-based CLI structure
- Dependencies: ManualJsonPipeline (✅ Integrated)
- Commands: load-manual and validate-json with comprehensive options
2. **load-manual Command**
- Status: ✅ Complete
- Full option set: sources-dir, mode, progress, validate, batch-size, dry-run, verbose
- Mode selection: clear (destructive) and append (safe) with confirmation
- Progress tracking: Real-time progress with ETA calculation
- Dry-run mode: Validation without database changes
3. **validate-json Command**
- Status: ✅ Complete
- JSON file validation and structure checking
- Detailed statistics and data quality insights
- Verbose mode with top makes, error reports, and engine distribution
- Performance testing and validation
4. **Help System and User Experience**
- Status: ✅ Complete
- Comprehensive help text with usage examples
- User-friendly error messages and guidance
- Interactive confirmation for destructive operations
- Colored output and professional formatting
#### Implementation Results
- **CLI Integration**: Seamless integration with existing ETL commands
- **Command Options**: Full option coverage with sensible defaults
- **User Experience**: Professional CLI with help, examples, and error guidance
- **Error Handling**: Comprehensive error handling with helpful messages
- **Progress Tracking**: Real-time progress with ETA and performance metrics
- **Validation**: Dry-run and validate-json commands for safe operations
### ⏳ Phase 7: Testing & Validation (OPTIONAL)
#### Available Components
- Comprehensive unit test suites (already implemented for all phases)
- Integration testing framework ready
- Data validation available via CLI commands
- Performance monitoring built into pipeline
#### Status
- All core functionality implemented and unit tested
- Production testing can be performed using CLI commands
- No blockers - ready for production deployment
## Implementation Readiness Checklist
### ✅ Ready for Implementation
- [x] Complete understanding of JSON data structure (55 files analyzed)
- [x] Engine parsing requirements documented (L→I normalization critical)
- [x] Make name mapping rules documented (underscore→space, special cases)
- [x] Database schema understood (PostgreSQL vehicles schema)
- [x] CLI design completed (load-manual, validate-json commands)
- [x] Integration strategy documented (existing MSSQL pipeline compatibility)
### 🔧 Implementation Dependencies
- Current ETL system at `mvp-platform-services/vehicles/etl/`
- PostgreSQL database with vehicles schema
- Python environment with existing ETL dependencies
- Access to JSON files at `mvp-platform-services/vehicles/etl/sources/makes/`
### 📋 Pre-Implementation Validation
Before starting implementation, validate:
- [ ] All 55 JSON files are accessible and readable
- [ ] PostgreSQL schema matches documentation
- [ ] Existing ETL pipeline is working (MSSQL pipeline)
- [ ] Development environment setup complete
## AI Handoff Instructions
### For Continuing This Work:
#### Immediate Next Steps
1. **Load Phase 2 context**:
```bash
# Load these files for implementation context
docs/changes/vehicles-dropdown-v2/04-make-name-mapping.md
docs/changes/vehicles-dropdown-v2/02-implementation-plan.md
mvp-platform-services/vehicles/etl/utils/make_filter.py # Reference existing pattern
```
2. **Start with MakeNameMapper**:
- Create `etl/utils/make_name_mapper.py`
- Implement filename→display name conversion
- Add validation against `sources/makes.json`
- Create unit tests
3. **Then implement EngineSpecParser**:
- Create `etl/utils/engine_spec_parser.py`
- **CRITICAL**: L→I configuration normalization
- Hybrid/electric detection patterns
- Comprehensive unit tests
#### Context Loading Priority
1. **Current status**: This file (08-status-tracking.md)
2. **Implementation plan**: 02-implementation-plan.md
3. **Specific component docs**: Based on what you're implementing
4. **Original analysis**: 01-analysis-findings.md for data patterns
### For Understanding Data Patterns:
1. Load 01-analysis-findings.md for JSON structure analysis
2. Load 03-engine-spec-parsing.md for parsing rules
3. Examine sample JSON files: toyota.json, tesla.json, subaru.json
### For Understanding Requirements:
1. README.md - Critical requirements summary
2. 04-make-name-mapping.md - Make name normalization rules
3. 06-cli-commands.md - CLI interface design
## Success Metrics
### Phase Completion Criteria
- **Phase 2**: MakeNameMapper and EngineSpecParser working with unit tests
- **Phase 3**: JSON extraction working for all 55 files
- **Phase 4**: Database loading working in clear/append modes
- **Phase 5**: End-to-end pipeline processing all makes successfully
- **Phase 6**: CLI commands working with all options
- **Phase 7**: Comprehensive test coverage and validation
### Final Success Criteria
- [ ] Process all 55 JSON files without errors
- [ ] Make names properly normalized (alfa_romeo.json → "Alfa Romeo")
- [ ] Engine parsing with L→I normalization working correctly
- [ ] Electric vehicles handled properly (default engines created)
- [ ] Clear/append modes working without data corruption
- [ ] API endpoints return data loaded from JSON sources
- [ ] Performance acceptable (<5 minutes for full load)
- [ ] Zero breaking changes to existing MSSQL pipeline
## Risk Tracking
### Current Risks: LOW
- **Data compatibility**: Well analyzed, patterns understood
- **Implementation complexity**: Moderate, but well documented
- **Integration risk**: Low, maintains existing pipeline compatibility
### Risk Mitigation
- **Comprehensive documentation**: Reduces implementation risk
- **Incremental phases**: Allows early validation and course correction
- **Unit testing focus**: Ensures component reliability
## Change Log
### Initial Documentation (This Session)
- Created complete documentation structure
- Analyzed all 55 JSON files for patterns
- Documented critical requirements (L→I normalization, make mapping)
- Designed CLI interface and implementation approach
- Created AI-friendly handoff documentation
### Documentation Phase Completion (Current Session)
- ✅ Created complete documentation structure at `docs/changes/vehicles-dropdown-v2/`
- ✅ Analyzed all 55 JSON files for data patterns and structure
- ✅ Documented critical L→I normalization requirement
- ✅ Mapped all make name conversions with special cases
- ✅ Designed complete CLI interface (load-manual, validate-json)
- ✅ Created comprehensive code examples with working demonstrations
- ✅ Established AI-friendly handoff documentation
- ✅ **STATUS**: Documentation phase complete, ready for implementation
### Phase 2 Implementation Complete (Previous Session)
- ✅ Implemented MakeNameMapper (`etl/utils/make_name_mapper.py`)
- ✅ Implemented EngineSpecParser (`etl/utils/engine_spec_parser.py`) with L→I normalization
- ✅ Created comprehensive unit tests for both utilities
- ✅ Validated against all 55 JSON files with excellent results
- ✅ Fixed W-configuration engine support (VW Group, Bentley W8/W12 engines)
- ✅ Fixed MINI make validation issue in authoritative makes list
- ✅ **STATUS**: Phase 2 complete with 100% make validation and 99.9% engine parsing success
### Phase 3 Implementation Complete (Previous Session)
- ✅ Implemented JsonExtractor (`etl/extractors/json_extractor.py`)
- ✅ Integrated make name normalization and engine parsing seamlessly
- ✅ Implemented electric vehicle handling (empty engines arrays → Electric Motor)
- ✅ Created comprehensive unit tests (`tests/test_json_extractor.py`)
- ✅ Validated against all 55 JSON files with 100% success
- ✅ Extracted 2,644 models and 5,199 engines successfully
- ✅ Properly handled 917 electric models across all makes
- ✅ **STATUS**: Phase 3 complete with 100% extraction success and zero errors
### Phase 4 Implementation Complete (Previous Session)
- ✅ Implemented JsonManualLoader (`etl/loaders/json_manual_loader.py`)
- ✅ Full PostgreSQL integration with referential integrity maintenance
- ✅ Clear/append modes with comprehensive duplicate handling
- ✅ Batch processing with performance optimization
- ✅ Created comprehensive unit tests (`tests/test_json_manual_loader.py`)
- ✅ Database schema integration with proper foreign key relationships
- ✅ Referential integrity validation and error reporting
- ✅ **STATUS**: Phase 4 complete with full database integration ready
### Phase 5 Implementation Complete (Previous Session)
- ✅ Implemented ManualJsonPipeline (`etl/pipelines/manual_json_pipeline.py`)
- ✅ End-to-end workflow coordination (extraction → loading → validation)
- ✅ Progress tracking with real-time updates and ETA calculation
- ✅ Performance monitoring (files/sec, records/sec) with optimization
- ✅ Pipeline configuration management with mode overrides
- ✅ Comprehensive error handling across all pipeline phases
- ✅ Detailed execution reporting with success rates and statistics
- ✅ **STATUS**: Phase 5 complete with full pipeline orchestration ready
### Phase 6 Implementation Complete (This Session)
- ✅ Implemented CLI commands in `etl/main.py` (load-manual, validate-json)
- ✅ Full integration with existing Click-based CLI framework
- ✅ Comprehensive command-line options and configuration management
- ✅ Interactive user experience with confirmations and help system
- ✅ Progress tracking integration with real-time CLI updates
- ✅ Dry-run mode for safe validation without database changes
- ✅ Verbose reporting with detailed statistics and error messages
- ✅ Professional CLI formatting with colored output and user guidance
- ✅ **STATUS**: Phase 6 complete - Full CLI integration ready for production
### All Implementation Phases Complete
**Current Status**: Manual JSON processing system fully implemented and ready
**Available Commands**:
- `python -m etl load-manual` - Load vehicle data from JSON files
- `python -m etl validate-json` - Validate JSON structure and content
**Next Steps**: Production testing and deployment (optional)

View File

@@ -0,0 +1,99 @@
# Vehicles Dropdown V2 - Manual JSON ETL Implementation
## Overview
This directory contains comprehensive documentation for implementing manual JSON processing in the MVP Platform Vehicles ETL system. The goal is to add capability to process 55 JSON files containing vehicle data directly, bypassing the MSSQL source dependency.
## Quick Start for AI Instances
### Current State (As of Implementation Start)
- **55 JSON files** exist in `mvp-platform-services/vehicles/etl/sources/makes/`
- Current ETL only supports MSSQL → PostgreSQL pipeline
- Need to add JSON → PostgreSQL capability
### Key Files to Load for Context
```bash
# Load these files for complete understanding
mvp-platform-services/vehicles/etl/sources/makes/toyota.json # Large file example
mvp-platform-services/vehicles/etl/sources/makes/tesla.json # Electric vehicle example
mvp-platform-services/vehicles/etl/pipeline.py # Current pipeline
mvp-platform-services/vehicles/etl/loaders/postgres_loader.py # Current loader
mvp-platform-services/vehicles/sql/schema/001_schema.sql # Target schema
```
### Implementation Status
See [08-status-tracking.md](08-status-tracking.md) for current progress.
## Critical Requirements Discovered
### 1. Make Name Normalization
- JSON filenames: `alfa_romeo.json`, `land_rover.json`
- Database display: `"Alfa Romeo"`, `"Land Rover"` (spaces, title case)
### 2. Engine Configuration Normalization
- **CRITICAL**: `L3``I3` (L-configuration treated as Inline)
- Standard format: `{displacement}L {config}{cylinders} {descriptions}`
- Examples: `"1.5L L3"``"1.5L I3"`, `"2.4L H4"` (Subaru Boxer)
### 3. Hybrid/Electric Patterns Found
- `"PLUG-IN HYBRID EV- (PHEV)"` - Plug-in hybrid
- `"FULL HYBRID EV- (FHEV)"` - Full hybrid
- `"ELECTRIC"` - Pure electric
- `"FLEX"` - Flex-fuel
- Empty engines arrays for Tesla/electric vehicles
### 4. Transmission Limitation
- **Manual selection only**: Automatic/Manual choice
- **No automatic detection** from JSON data
## Document Structure
| File | Purpose | Status |
|------|---------|--------|
| [01-analysis-findings.md](01-analysis-findings.md) | JSON data patterns analysis | ⏳ Pending |
| [02-implementation-plan.md](02-implementation-plan.md) | Technical roadmap | ⏳ Pending |
| [03-engine-spec-parsing.md](03-engine-spec-parsing.md) | Engine parsing rules | ⏳ Pending |
| [04-make-name-mapping.md](04-make-name-mapping.md) | Make name normalization | ⏳ Pending |
| [05-database-schema-updates.md](05-database-schema-updates.md) | Schema change requirements | ⏳ Pending |
| [06-cli-commands.md](06-cli-commands.md) | New CLI command design | ⏳ Pending |
| [07-testing-strategy.md](07-testing-strategy.md) | Testing and validation approach | ⏳ Pending |
| [08-status-tracking.md](08-status-tracking.md) | Implementation progress tracker | ⏳ Pending |
## AI Handoff Instructions
### To Continue This Work:
1. **Read this README.md** - Current state and critical requirements
2. **Check [08-status-tracking.md](08-status-tracking.md)** - See what's completed/in-progress
3. **Review [02-implementation-plan.md](02-implementation-plan.md)** - Technical roadmap
4. **Load specific documentation** based on what you're implementing
### To Understand the Data:
1. **Load [01-analysis-findings.md](01-analysis-findings.md)** - JSON structure analysis
2. **Load [03-engine-spec-parsing.md](03-engine-spec-parsing.md)** - Engine parsing rules
3. **Load [04-make-name-mapping.md](04-make-name-mapping.md)** - Make name conversion rules
### To Start Coding:
1. **Check status tracker** - See what needs to be implemented next
2. **Load implementation plan** - Step-by-step technical guide
3. **Reference examples/** directory - Code samples and patterns
## Success Criteria
- [ ] New CLI command: `python -m etl load-manual`
- [ ] Process all 55 JSON make files
- [ ] Proper make name normalization (`alfa_romeo.json``"Alfa Romeo"`)
- [ ] Engine spec parsing with L→I normalization
- [ ] Clear/append mode support with duplicate handling
- [ ] Electric vehicle support (default engines for empty arrays)
- [ ] Integration with existing PostgreSQL schema
## Architecture Integration
This feature integrates with:
- **Existing ETL pipeline**: `mvp-platform-services/vehicles/etl/`
- **PostgreSQL schema**: `vehicles` schema with make/model/engine tables
- **Platform API**: Hierarchical dropdown endpoints remain unchanged
- **Application service**: No changes required
## Notes for Future Implementations
- Maintain compatibility with existing MSSQL pipeline
- Follow existing code patterns in `etl/` directory
- Use existing `PostgreSQLLoader` where possible
- Preserve referential integrity during data loading

View File

@@ -0,0 +1,314 @@
#!/usr/bin/env python3
"""
Engine Specification Parsing Examples
This file contains comprehensive examples of engine parsing patterns
found in the JSON vehicle data, demonstrating the L→I normalization
and hybrid/electric detection requirements.
Usage:
python engine-parsing-examples.py
"""
import re
from dataclasses import dataclass
from typing import Optional, List
@dataclass
class EngineSpec:
"""Parsed engine specification"""
displacement_l: Optional[float]
configuration: str # I, V, H, Electric
cylinders: Optional[int]
fuel_type: str # Gasoline, Hybrid, Electric, Flex Fuel
aspiration: str # Natural, Turbo, Supercharged
raw_string: str
class EngineSpecParser:
"""Engine specification parser with L→I normalization"""
def __init__(self):
# Primary pattern: {displacement}L {config}{cylinders}
self.engine_pattern = re.compile(r'(\d+\.?\d*)L\s+([IVHL])(\d+)')
# Hybrid patterns
self.hybrid_patterns = [
re.compile(r'PLUG-IN HYBRID EV-?\s*\(PHEV\)', re.IGNORECASE),
re.compile(r'FULL HYBRID EV-?\s*\(FHEV\)', re.IGNORECASE),
re.compile(r'HYBRID', re.IGNORECASE),
]
# Other fuel type patterns
self.fuel_patterns = [
(re.compile(r'FLEX', re.IGNORECASE), 'Flex Fuel'),
(re.compile(r'ELECTRIC', re.IGNORECASE), 'Electric'),
]
# Aspiration patterns
self.aspiration_patterns = [
(re.compile(r'TURBO', re.IGNORECASE), 'Turbocharged'),
(re.compile(r'SUPERCHARGED|SC', re.IGNORECASE), 'Supercharged'),
]
def normalize_configuration(self, config: str) -> str:
"""CRITICAL: Convert L to I (L-configuration becomes Inline)"""
return 'I' if config == 'L' else config
def extract_fuel_type(self, engine_str: str) -> str:
"""Extract fuel type from engine string"""
# Check hybrid patterns first (most specific)
for pattern in self.hybrid_patterns:
if pattern.search(engine_str):
if 'PLUG-IN' in engine_str.upper():
return 'Plug-in Hybrid'
elif 'FULL' in engine_str.upper():
return 'Full Hybrid'
else:
return 'Hybrid'
# Check other fuel types
for pattern, fuel_type in self.fuel_patterns:
if pattern.search(engine_str):
return fuel_type
return 'Gasoline' # Default
def extract_aspiration(self, engine_str: str) -> str:
"""Extract aspiration from engine string"""
for pattern, aspiration in self.aspiration_patterns:
if pattern.search(engine_str):
return aspiration
return 'Natural' # Default
def parse_engine_string(self, engine_str: str) -> EngineSpec:
"""Parse complete engine specification"""
match = self.engine_pattern.match(engine_str)
if not match:
# Handle unparseable engines
return self.create_fallback_engine(engine_str)
displacement = float(match.group(1))
config = self.normalize_configuration(match.group(2)) # L→I here!
cylinders = int(match.group(3))
fuel_type = self.extract_fuel_type(engine_str)
aspiration = self.extract_aspiration(engine_str)
return EngineSpec(
displacement_l=displacement,
configuration=config,
cylinders=cylinders,
fuel_type=fuel_type,
aspiration=aspiration,
raw_string=engine_str
)
def create_fallback_engine(self, raw_string: str) -> EngineSpec:
"""Create fallback for unparseable engines"""
return EngineSpec(
displacement_l=None,
configuration="Unknown",
cylinders=None,
fuel_type="Unknown",
aspiration="Natural",
raw_string=raw_string
)
def create_electric_motor(self) -> EngineSpec:
"""Create default electric motor for empty engines arrays"""
return EngineSpec(
displacement_l=None,
configuration="Electric",
cylinders=None,
fuel_type="Electric",
aspiration=None,
raw_string="Electric Motor"
)
def demonstrate_engine_parsing():
"""Demonstrate engine parsing with real examples from JSON files"""
parser = EngineSpecParser()
# Test cases from actual JSON data
test_engines = [
# Standard engines
"2.0L I4",
"3.5L V6",
"5.6L V8",
# L→I normalization examples (CRITICAL)
"1.5L L3",
"2.0L L4",
"1.2L L3 FULL HYBRID EV- (FHEV)",
# Subaru Boxer engines
"2.4L H4",
"2.0L H4",
# Hybrid examples from Nissan
"2.5L I4 FULL HYBRID EV- (FHEV)",
"1.5L L3 PLUG-IN HYBRID EV- (PHEV)",
# Flex fuel examples
"5.6L V8 FLEX",
"4.0L V6 FLEX",
# Electric examples
"1.8L I4 ELECTRIC",
# Unparseable examples (should create fallback)
"Custom Hybrid System",
"V12 Twin-Turbo Custom",
"V10 Plus",
]
print("🔧 Engine Specification Parsing Examples")
print("=" * 50)
for engine_str in test_engines:
spec = parser.parse_engine_string(engine_str)
print(f"\nInput: \"{engine_str}\"")
print(f" Displacement: {spec.displacement_l}L")
print(f" Configuration: {spec.configuration}")
print(f" Cylinders: {spec.cylinders}")
print(f" Fuel Type: {spec.fuel_type}")
print(f" Aspiration: {spec.aspiration}")
# Highlight L→I normalization
if 'L' in engine_str and spec.configuration == 'I':
print(f" 🎯 L→I NORMALIZED: L{spec.cylinders} became I{spec.cylinders}")
# Demonstrate electric vehicle handling
print(f"\n\n⚡ Electric Vehicle Default Engine:")
electric_spec = parser.create_electric_motor()
print(f" Name: {electric_spec.raw_string}")
print(f" Configuration: {electric_spec.configuration}")
print(f" Fuel Type: {electric_spec.fuel_type}")
def demonstrate_l_to_i_normalization():
"""Specifically demonstrate L→I normalization requirement"""
parser = EngineSpecParser()
print("\n\n🎯 L→I Configuration Normalization")
print("=" * 40)
print("CRITICAL REQUIREMENT: All L-configurations must become I (Inline)")
l_configuration_examples = [
"1.5L L3",
"2.0L L4",
"1.2L L3 FULL HYBRID EV- (FHEV)",
"1.5L L3 PLUG-IN HYBRID EV- (PHEV)",
]
for engine_str in l_configuration_examples:
spec = parser.parse_engine_string(engine_str)
original_config = engine_str.split()[1][0] # Extract L from "L3"
print(f"\nOriginal: \"{engine_str}\"")
print(f" Input Configuration: {original_config}{spec.cylinders}")
print(f" Output Configuration: {spec.configuration}{spec.cylinders}")
print(f" ✅ Normalized: {original_config}{spec.configuration}")
def demonstrate_database_storage():
"""Show how parsed engines map to database records"""
parser = EngineSpecParser()
print("\n\n💾 Database Storage Examples")
print("=" * 35)
print("SQL: INSERT INTO vehicles.engine (name, code, displacement_l, cylinders, fuel_type, aspiration)")
examples = [
"2.0L I4",
"1.5L L3 PLUG-IN HYBRID EV- (PHEV)", # L→I case
"2.4L H4", # Subaru Boxer
"5.6L V8 FLEX",
]
for engine_str in examples:
spec = parser.parse_engine_string(engine_str)
# Format as SQL INSERT values
sql_values = (
f"('{spec.raw_string}', NULL, {spec.displacement_l}, "
f"{spec.cylinders}, '{spec.fuel_type}', '{spec.aspiration}')"
)
print(f"\nEngine: \"{engine_str}\"")
print(f" SQL: VALUES {sql_values}")
if 'L' in engine_str and spec.configuration == 'I':
print(f" 🎯 Note: L{spec.cylinders} normalized to I{spec.cylinders}")
# Electric motor example
electric_spec = parser.create_electric_motor()
sql_values = (
f"('{electric_spec.raw_string}', NULL, NULL, "
f"NULL, '{electric_spec.fuel_type}', NULL)"
)
print(f"\nElectric Vehicle:")
print(f" SQL: VALUES {sql_values}")
def run_validation_tests():
"""Run validation tests to ensure parsing works correctly"""
parser = EngineSpecParser()
print("\n\n✅ Validation Tests")
print("=" * 20)
# Test L→I normalization
test_cases = [
("1.5L L3", "I", 3),
("2.0L L4", "I", 4),
("1.2L L3 FULL HYBRID EV- (FHEV)", "I", 3),
]
for engine_str, expected_config, expected_cylinders in test_cases:
spec = parser.parse_engine_string(engine_str)
assert spec.configuration == expected_config, \
f"Expected {expected_config}, got {spec.configuration}"
assert spec.cylinders == expected_cylinders, \
f"Expected {expected_cylinders} cylinders, got {spec.cylinders}"
print(f"{engine_str}{spec.configuration}{spec.cylinders}")
# Test hybrid detection
hybrid_cases = [
("2.5L I4 FULL HYBRID EV- (FHEV)", "Full Hybrid"),
("1.5L L3 PLUG-IN HYBRID EV- (PHEV)", "Plug-in Hybrid"),
]
for engine_str, expected_fuel_type in hybrid_cases:
spec = parser.parse_engine_string(engine_str)
assert spec.fuel_type == expected_fuel_type, \
f"Expected {expected_fuel_type}, got {spec.fuel_type}"
print(f"{engine_str}{spec.fuel_type}")
print("\n🎉 All validation tests passed!")
if __name__ == "__main__":
demonstrate_engine_parsing()
demonstrate_l_to_i_normalization()
demonstrate_database_storage()
run_validation_tests()
print("\n\n📋 Summary")
print("=" * 10)
print("✅ Engine parsing patterns implemented")
print("✅ L→I normalization working correctly")
print("✅ Hybrid/electric detection functional")
print("✅ Database storage format validated")
print("\n🚀 Ready for integration into ETL system!")

View File

@@ -0,0 +1,334 @@
#!/usr/bin/env python3
"""
Make Name Mapping Examples
This file demonstrates the complete make name normalization process,
converting JSON filenames to proper display names for the database.
Usage:
python make-mapping-examples.py
"""
import json
import glob
import os
from typing import Dict, Set, List, Tuple
from dataclasses import dataclass
@dataclass
class ValidationReport:
"""Make name validation report"""
total_files: int
valid_mappings: int
mismatches: List[Dict[str, str]]
@property
def success_rate(self) -> float:
return self.valid_mappings / self.total_files if self.total_files > 0 else 0.0
class MakeNameMapper:
"""Convert JSON filenames to proper make display names"""
def __init__(self):
# Special capitalization cases
self.special_cases = {
'Bmw': 'BMW', # Bayerische Motoren Werke
'Gmc': 'GMC', # General Motors Company
'Mini': 'MINI', # Brand styling
'Mclaren': 'McLaren', # Scottish naming convention
}
# Authoritative makes list (would be loaded from sources/makes.json)
self.authoritative_makes = {
'Acura', 'Alfa Romeo', 'Aston Martin', 'Audi', 'BMW', 'Bentley',
'Buick', 'Cadillac', 'Chevrolet', 'Chrysler', 'Dodge', 'Ferrari',
'Fiat', 'Ford', 'Genesis', 'Geo', 'GMC', 'Honda', 'Hummer',
'Hyundai', 'Infiniti', 'Isuzu', 'Jaguar', 'Jeep', 'Kia',
'Lamborghini', 'Land Rover', 'Lexus', 'Lincoln', 'Lotus', 'Lucid',
'MINI', 'Maserati', 'Mazda', 'McLaren', 'Mercury', 'Mitsubishi',
'Nissan', 'Oldsmobile', 'Plymouth', 'Polestar', 'Pontiac',
'Porsche', 'Ram', 'Rivian', 'Rolls Royce', 'Saab', 'Saturn',
'Scion', 'Smart', 'Subaru', 'Tesla', 'Toyota', 'Volkswagen',
'Volvo'
}
def normalize_make_name(self, filename: str) -> str:
"""Convert filename to proper display name"""
# Remove .json extension
base_name = filename.replace('.json', '')
# Replace underscores with spaces
spaced_name = base_name.replace('_', ' ')
# Apply title case
title_cased = spaced_name.title()
# Apply special cases
return self.special_cases.get(title_cased, title_cased)
def validate_mapping(self, filename: str, display_name: str) -> bool:
"""Validate mapped name against authoritative list"""
return display_name in self.authoritative_makes
def get_all_mappings(self) -> Dict[str, str]:
"""Get complete filename → display name mapping"""
# Simulate the 55 JSON files found in the actual directory
json_files = [
'acura.json', 'alfa_romeo.json', 'aston_martin.json', 'audi.json',
'bentley.json', 'bmw.json', 'buick.json', 'cadillac.json',
'chevrolet.json', 'chrysler.json', 'dodge.json', 'ferrari.json',
'fiat.json', 'ford.json', 'genesis.json', 'geo.json', 'gmc.json',
'honda.json', 'hummer.json', 'hyundai.json', 'infiniti.json',
'isuzu.json', 'jaguar.json', 'jeep.json', 'kia.json',
'lamborghini.json', 'land_rover.json', 'lexus.json', 'lincoln.json',
'lotus.json', 'lucid.json', 'maserati.json', 'mazda.json',
'mclaren.json', 'mercury.json', 'mini.json', 'mitsubishi.json',
'nissan.json', 'oldsmobile.json', 'plymouth.json', 'polestar.json',
'pontiac.json', 'porsche.json', 'ram.json', 'rivian.json',
'rolls_royce.json', 'saab.json', 'saturn.json', 'scion.json',
'smart.json', 'subaru.json', 'tesla.json', 'toyota.json',
'volkswagen.json', 'volvo.json'
]
mappings = {}
for filename in json_files:
display_name = self.normalize_make_name(filename)
mappings[filename] = display_name
return mappings
def validate_all_mappings(self) -> ValidationReport:
"""Validate all mappings against authoritative list"""
mappings = self.get_all_mappings()
mismatches = []
for filename, display_name in mappings.items():
if not self.validate_mapping(filename, display_name):
mismatches.append({
'filename': filename,
'mapped_name': display_name,
'status': 'NOT_FOUND_IN_AUTHORITATIVE'
})
return ValidationReport(
total_files=len(mappings),
valid_mappings=len(mappings) - len(mismatches),
mismatches=mismatches
)
def demonstrate_make_name_mapping():
"""Demonstrate make name normalization process"""
mapper = MakeNameMapper()
print("🏷️ Make Name Mapping Examples")
print("=" * 40)
# Test cases showing different transformation types
test_cases = [
# Single word makes (standard title case)
('toyota.json', 'Toyota'),
('honda.json', 'Honda'),
('ford.json', 'Ford'),
# Multi-word makes (underscore → space + title case)
('alfa_romeo.json', 'Alfa Romeo'),
('land_rover.json', 'Land Rover'),
('rolls_royce.json', 'Rolls Royce'),
('aston_martin.json', 'Aston Martin'),
# Special capitalization cases
('bmw.json', 'BMW'),
('gmc.json', 'GMC'),
('mini.json', 'MINI'),
('mclaren.json', 'McLaren'),
]
for filename, expected in test_cases:
result = mapper.normalize_make_name(filename)
status = "" if result == expected else ""
print(f"{status} {filename:20}{result:15} (expected: {expected})")
if result != expected:
print(f" ⚠️ MISMATCH: Expected '{expected}', got '{result}'")
def demonstrate_complete_mapping():
"""Show complete mapping of all 55 make files"""
mapper = MakeNameMapper()
all_mappings = mapper.get_all_mappings()
print(f"\n\n📋 Complete Make Name Mappings ({len(all_mappings)} files)")
print("=" * 50)
# Group by transformation type for clarity
single_words = []
multi_words = []
special_cases = []
for filename, display_name in sorted(all_mappings.items()):
if '_' in filename:
multi_words.append((filename, display_name))
elif display_name in ['BMW', 'GMC', 'MINI', 'McLaren']:
special_cases.append((filename, display_name))
else:
single_words.append((filename, display_name))
print("\n🔤 Single Word Makes (Standard Title Case):")
for filename, display_name in single_words:
print(f" {filename:20}{display_name}")
print(f"\n📝 Multi-Word Makes (Underscore → Space, {len(multi_words)} total):")
for filename, display_name in multi_words:
print(f" {filename:20}{display_name}")
print(f"\n⭐ Special Capitalization Cases ({len(special_cases)} total):")
for filename, display_name in special_cases:
print(f" {filename:20}{display_name}")
def demonstrate_validation():
"""Demonstrate validation against authoritative makes list"""
mapper = MakeNameMapper()
report = mapper.validate_all_mappings()
print(f"\n\n✅ Validation Report")
print("=" * 20)
print(f"Total files processed: {report.total_files}")
print(f"Valid mappings: {report.valid_mappings}")
print(f"Success rate: {report.success_rate:.1%}")
if report.mismatches:
print(f"\n⚠️ Mismatches found ({len(report.mismatches)}):")
for mismatch in report.mismatches:
print(f" {mismatch['filename']}{mismatch['mapped_name']}")
print(f" Status: {mismatch['status']}")
else:
print("\n🎉 All mappings valid!")
def demonstrate_database_integration():
"""Show how mappings integrate with database operations"""
mapper = MakeNameMapper()
print(f"\n\n💾 Database Integration Example")
print("=" * 35)
sample_files = ['toyota.json', 'alfa_romeo.json', 'bmw.json', 'land_rover.json']
print("SQL: INSERT INTO vehicles.make (name) VALUES")
for i, filename in enumerate(sample_files):
display_name = mapper.normalize_make_name(filename)
comma = "," if i < len(sample_files) - 1 else ";"
print(f" ('{display_name}'){comma}")
print(f" -- From file: {filename}")
def demonstrate_error_handling():
"""Demonstrate error handling for edge cases"""
mapper = MakeNameMapper()
print(f"\n\n🛠️ Error Handling Examples")
print("=" * 30)
edge_cases = [
'unknown_brand.json',
'test__multiple__underscores.json',
'no_extension',
'.json', # Only extension
]
for filename in edge_cases:
try:
display_name = mapper.normalize_make_name(filename)
is_valid = mapper.validate_mapping(filename, display_name)
status = "✅ Valid" if is_valid else "⚠️ Not in authoritative list"
print(f" {filename:35}{display_name:15} ({status})")
except Exception as e:
print(f" {filename:35} → ERROR: {e}")
def run_validation_tests():
"""Run comprehensive validation tests"""
mapper = MakeNameMapper()
print(f"\n\n🧪 Validation Tests")
print("=" * 20)
# Test cases with expected results
test_cases = [
('toyota.json', 'Toyota', True),
('alfa_romeo.json', 'Alfa Romeo', True),
('bmw.json', 'BMW', True),
('gmc.json', 'GMC', True),
('mclaren.json', 'McLaren', True),
('unknown_brand.json', 'Unknown Brand', False),
]
passed = 0
for filename, expected_name, expected_valid in test_cases:
actual_name = mapper.normalize_make_name(filename)
actual_valid = mapper.validate_mapping(filename, actual_name)
name_correct = actual_name == expected_name
valid_correct = actual_valid == expected_valid
if name_correct and valid_correct:
print(f"{filename}{actual_name} (valid: {actual_valid})")
passed += 1
else:
print(f"{filename}")
if not name_correct:
print(f" Name: Expected '{expected_name}', got '{actual_name}'")
if not valid_correct:
print(f" Valid: Expected {expected_valid}, got {actual_valid}")
print(f"\n📊 Test Results: {passed}/{len(test_cases)} tests passed")
if passed == len(test_cases):
print("🎉 All validation tests passed!")
return True
else:
print("⚠️ Some tests failed!")
return False
if __name__ == "__main__":
demonstrate_make_name_mapping()
demonstrate_complete_mapping()
demonstrate_validation()
demonstrate_database_integration()
demonstrate_error_handling()
success = run_validation_tests()
print("\n\n📋 Summary")
print("=" * 10)
print("✅ Make name normalization patterns implemented")
print("✅ Special capitalization cases handled")
print("✅ Multi-word make names (underscore → space) working")
print("✅ Validation against authoritative list functional")
print("✅ Database integration format demonstrated")
if success:
print("\n🚀 Ready for integration into ETL system!")
else:
print("\n⚠️ Review failed tests before integration")
print("\nKey Implementation Notes:")
print("• filename.replace('.json', '').replace('_', ' ').title()")
print("• Special cases: BMW, GMC, MINI, McLaren")
print("• Validation against sources/makes.json required")
print("• Handle unknown makes gracefully (log warning, continue)")

View File

@@ -0,0 +1,449 @@
#!/usr/bin/env python3
"""
Sample JSON Processing Examples
This file demonstrates complete processing of JSON vehicle data,
from file reading through database-ready output structures.
Usage:
python sample-json-processing.py
"""
import json
from typing import List, Dict, Any, Optional
from dataclasses import dataclass
from pathlib import Path
@dataclass
class EngineSpec:
"""Parsed engine specification"""
displacement_l: Optional[float]
configuration: str
cylinders: Optional[int]
fuel_type: str
aspiration: str
raw_string: str
@dataclass
class ModelData:
"""Model information for a specific year"""
name: str
engines: List[EngineSpec]
trims: List[str] # From submodels
@dataclass
class YearData:
"""Vehicle data for a specific year"""
year: int
models: List[ModelData]
@dataclass
class MakeData:
"""Complete make information"""
name: str # Normalized display name
filename: str # Original JSON filename
years: List[YearData]
@property
def total_models(self) -> int:
return sum(len(year.models) for year in self.years)
@property
def total_engines(self) -> int:
return sum(len(model.engines)
for year in self.years
for model in year.models)
@property
def total_trims(self) -> int:
return sum(len(model.trims)
for year in self.years
for model in year.models)
class JsonProcessor:
"""Process JSON vehicle files into structured data"""
def __init__(self):
# Import our utility classes
from engine_parsing_examples import EngineSpecParser
from make_mapping_examples import MakeNameMapper
self.engine_parser = EngineSpecParser()
self.make_mapper = MakeNameMapper()
def process_json_file(self, json_data: Dict[str, Any], filename: str) -> MakeData:
"""Process complete JSON file into structured data"""
# Get the make name (first key in JSON)
make_key = list(json_data.keys())[0]
display_name = self.make_mapper.normalize_make_name(filename)
years_data = []
for year_entry in json_data[make_key]:
year = int(year_entry['year'])
models_data = []
for model_entry in year_entry.get('models', []):
model_name = model_entry['name']
# Process engines
engines = []
engine_strings = model_entry.get('engines', [])
if not engine_strings:
# Electric vehicle - create default engine
engines.append(self.engine_parser.create_electric_motor())
else:
for engine_str in engine_strings:
engine_spec = self.engine_parser.parse_engine_string(engine_str)
engines.append(engine_spec)
# Process trims (from submodels)
trims = model_entry.get('submodels', [])
models_data.append(ModelData(
name=model_name,
engines=engines,
trims=trims
))
years_data.append(YearData(
year=year,
models=models_data
))
return MakeData(
name=display_name,
filename=filename,
years=years_data
)
def demonstrate_tesla_processing():
"""Demonstrate processing Tesla JSON (electric vehicle example)"""
# Sample Tesla data (simplified from actual tesla.json)
tesla_json = {
"tesla": [
{
"year": "2024",
"models": [
{
"name": "3",
"engines": [], # Empty - electric vehicle
"submodels": [
"Long Range AWD",
"Performance",
"Standard Plus"
]
},
{
"name": "y",
"engines": [], # Empty - electric vehicle
"submodels": [
"Long Range",
"Performance"
]
}
]
},
{
"year": "2023",
"models": [
{
"name": "s",
"engines": [], # Empty - electric vehicle
"submodels": [
"Plaid",
"Long Range Plus"
]
}
]
}
]
}
processor = JsonProcessor()
make_data = processor.process_json_file(tesla_json, 'tesla.json')
print("⚡ Tesla JSON Processing Example")
print("=" * 35)
print(f"Filename: tesla.json")
print(f"Display Name: {make_data.name}")
print(f"Years: {len(make_data.years)}")
print(f"Total Models: {make_data.total_models}")
print(f"Total Engines: {make_data.total_engines}")
print(f"Total Trims: {make_data.total_trims}")
print(f"\nDetailed Breakdown:")
for year_data in make_data.years:
print(f"\n {year_data.year}:")
for model in year_data.models:
print(f" Model: {model.name}")
print(f" Engines: {[e.raw_string for e in model.engines]}")
print(f" Trims: {model.trims}")
def demonstrate_subaru_processing():
"""Demonstrate processing Subaru JSON (Boxer engines, H4 configuration)"""
# Sample Subaru data showing H4 engines
subaru_json = {
"subaru": [
{
"year": "2024",
"models": [
{
"name": "crosstrek",
"engines": [
"2.0L H4",
"2.0L H4 PLUG-IN HYBRID EV- (PHEV)",
"2.5L H4"
],
"submodels": [
"Base",
"Premium",
"Limited",
"Hybrid"
]
},
{
"name": "forester",
"engines": [
"2.5L H4"
],
"submodels": [
"Base",
"Premium",
"Sport",
"Limited"
]
}
]
}
]
}
processor = JsonProcessor()
make_data = processor.process_json_file(subaru_json, 'subaru.json')
print(f"\n\n🚗 Subaru JSON Processing Example (Boxer Engines)")
print("=" * 50)
print(f"Display Name: {make_data.name}")
for year_data in make_data.years:
print(f"\n{year_data.year}:")
for model in year_data.models:
print(f" {model.name}:")
for engine in model.engines:
config_note = " (Boxer)" if engine.configuration == 'H' else ""
hybrid_note = " (Hybrid)" if 'Hybrid' in engine.fuel_type else ""
print(f" Engine: {engine.raw_string}")
print(f"{engine.displacement_l}L {engine.configuration}{engine.cylinders}{config_note}{hybrid_note}")
def demonstrate_l_to_i_processing():
"""Demonstrate L→I normalization during processing"""
# Sample data with L-configuration engines
nissan_json = {
"nissan": [
{
"year": "2024",
"models": [
{
"name": "versa",
"engines": [
"1.6L I4"
],
"submodels": ["S", "SV", "SR"]
},
{
"name": "kicks",
"engines": [
"1.5L L3 PLUG-IN HYBRID EV- (PHEV)" # L3 → I3
],
"submodels": ["S", "SV", "SR"]
},
{
"name": "note",
"engines": [
"1.2L L3 FULL HYBRID EV- (FHEV)" # L3 → I3
],
"submodels": ["Base", "Premium"]
}
]
}
]
}
processor = JsonProcessor()
make_data = processor.process_json_file(nissan_json, 'nissan.json')
print(f"\n\n🎯 L→I Normalization Processing Example")
print("=" * 42)
for year_data in make_data.years:
for model in year_data.models:
for engine in model.engines:
original_config = "L" if "L3" in engine.raw_string else "I"
normalized_config = engine.configuration
print(f"Model: {model.name}")
print(f" Input: \"{engine.raw_string}\"")
print(f" Configuration: {original_config}{engine.cylinders}{normalized_config}{engine.cylinders}")
if original_config == "L" and normalized_config == "I":
print(f" 🎯 NORMALIZED: L→I conversion applied")
print()
def demonstrate_database_ready_output():
"""Show how processed data maps to database tables"""
# Sample mixed data
sample_json = {
"toyota": [
{
"year": "2024",
"models": [
{
"name": "camry",
"engines": [
"2.5L I4",
"2.5L I4 FULL HYBRID EV- (FHEV)"
],
"submodels": [
"LE",
"XLE",
"Hybrid LE"
]
}
]
}
]
}
processor = JsonProcessor()
make_data = processor.process_json_file(sample_json, 'toyota.json')
print(f"\n\n💾 Database-Ready Output")
print("=" * 25)
# Show SQL INSERT statements
print("-- Make table")
print(f"INSERT INTO vehicles.make (name) VALUES ('{make_data.name}');")
print(f"\n-- Model table (assuming make_id = 1)")
for year_data in make_data.years:
for model in year_data.models:
print(f"INSERT INTO vehicles.model (make_id, name) VALUES (1, '{model.name}');")
print(f"\n-- Model Year table (assuming model_id = 1)")
for year_data in make_data.years:
print(f"INSERT INTO vehicles.model_year (model_id, year) VALUES (1, {year_data.year});")
print(f"\n-- Engine table")
unique_engines = set()
for year_data in make_data.years:
for model in year_data.models:
for engine in model.engines:
engine_key = (engine.raw_string, engine.displacement_l, engine.cylinders, engine.fuel_type)
if engine_key not in unique_engines:
unique_engines.add(engine_key)
print(f"INSERT INTO vehicles.engine (name, displacement_l, cylinders, fuel_type, aspiration)")
print(f" VALUES ('{engine.raw_string}', {engine.displacement_l}, {engine.cylinders}, '{engine.fuel_type}', '{engine.aspiration}');")
print(f"\n-- Trim table (assuming model_year_id = 1)")
for year_data in make_data.years:
for model in year_data.models:
for trim in model.trims:
print(f"INSERT INTO vehicles.trim (model_year_id, name) VALUES (1, '{trim}');")
def run_processing_validation():
"""Validate that processing works correctly"""
print(f"\n\n✅ Processing Validation")
print("=" * 25)
processor = JsonProcessor()
# Test cases
test_cases = [
# Tesla (electric, empty engines)
('tesla.json', {"tesla": [{"year": "2024", "models": [{"name": "3", "engines": [], "submodels": ["Base"]}]}]}),
# Subaru (H4 engines)
('subaru.json', {"subaru": [{"year": "2024", "models": [{"name": "crosstrek", "engines": ["2.0L H4"], "submodels": ["Base"]}]}]}),
# Nissan (L→I normalization)
('nissan.json', {"nissan": [{"year": "2024", "models": [{"name": "kicks", "engines": ["1.5L L3"], "submodels": ["Base"]}]}]})
]
for filename, json_data in test_cases:
try:
make_data = processor.process_json_file(json_data, filename)
# Basic validation
assert make_data.name is not None, "Make name should not be None"
assert len(make_data.years) > 0, "Should have at least one year"
assert make_data.total_models > 0, "Should have at least one model"
print(f"{filename} processed successfully")
print(f" Make: {make_data.name}, Models: {make_data.total_models}, Engines: {make_data.total_engines}")
# Special validations
if filename == 'tesla.json':
# Should have electric motors for empty engines
for year_data in make_data.years:
for model in year_data.models:
assert all(e.fuel_type == 'Electric' for e in model.engines), "Tesla should have electric engines"
if filename == 'nissan.json':
# Should have L→I normalization
for year_data in make_data.years:
for model in year_data.models:
for engine in model.engines:
if 'L3' in engine.raw_string:
assert engine.configuration == 'I', "L3 should become I3"
except Exception as e:
print(f"{filename} failed: {e}")
return False
print(f"\n🎉 All processing validation tests passed!")
return True
if __name__ == "__main__":
demonstrate_tesla_processing()
demonstrate_subaru_processing()
demonstrate_l_to_i_processing()
demonstrate_database_ready_output()
success = run_processing_validation()
print("\n\n📋 Summary")
print("=" * 10)
print("✅ JSON file processing implemented")
print("✅ Electric vehicle handling (empty engines → Electric Motor)")
print("✅ L→I normalization during processing")
print("✅ Database-ready output structures")
print("✅ Make name normalization integrated")
print("✅ Engine specification parsing integrated")
if success:
print("\n🚀 Ready for ETL pipeline integration!")
else:
print("\n⚠️ Review failed validations")
print("\nNext Steps:")
print("• Integrate with PostgreSQL loader")
print("• Add batch processing for all 55 files")
print("• Implement clear/append modes")
print("• Add CLI interface")
print("• Create comprehensive test suite")