# Implementation Plan - Manual JSON ETL

## Implementation Overview
Add manual JSON processing capability to the existing MVP Platform Vehicles ETL system without disrupting the current MSSQL-based pipeline.

## Development Phases

### Phase 1: Core Utilities ⏳
**Objective**: Create foundational utilities for JSON processing

#### 1.1 Make Name Mapper (`etl/utils/make_name_mapper.py`)
```python
class MakeNameMapper:
    def normalize_make_name(self, filename: str) -> str:
        """Convert 'alfa_romeo' to 'Alfa Romeo'"""
    
    def get_display_name_mapping(self) -> Dict[str, str]:
        """Get complete filename -> display name mapping"""
    
    def validate_against_sources(self) -> List[str]:
        """Cross-validate with sources/makes.json"""
```

**Implementation Requirements**:
- Handle underscore → space conversion
- Title case each word  
- Special cases: BMW, GMC (all caps)
- Validation against existing `sources/makes.json`

#### 1.2 Engine Spec Parser (`etl/utils/engine_spec_parser.py`)
```python
@dataclass
class EngineSpec:
    displacement_l: float
    configuration: str  # I, V, H
    cylinders: int
    fuel_type: str      # Gasoline, Hybrid, Electric, Flex Fuel
    aspiration: str     # Natural, Turbo, Supercharged
    raw_string: str

class EngineSpecParser:
    def parse_engine_string(self, engine_str: str) -> EngineSpec:
        """Parse '2.0L I4 PLUG-IN HYBRID EV- (PHEV)' into components"""
    
    def normalize_configuration(self, config: str) -> str:
        """Convert L → I (L3 becomes I3)"""
    
    def extract_fuel_type(self, engine_str: str) -> str:
        """Extract fuel type from modifiers"""
```

**Implementation Requirements**:
- **CRITICAL**: L-configuration → I (Inline) normalization
- Regex patterns for standard format: `{displacement}L {config}{cylinders}`
- Hybrid/electric detection: PHEV, FHEV, ELECTRIC patterns
- Flex-fuel detection: FLEX modifier
- Handle parsing failures gracefully

### Phase 2: Data Extraction ⏳
**Objective**: Extract data from JSON files into normalized structures

#### 2.1 JSON Extractor (`etl/extractors/json_extractor.py`)
```python
class JsonExtractor:
    def __init__(self, make_mapper: MakeNameMapper, 
                 engine_parser: EngineSpecParser):
        pass
    
    def extract_make_data(self, json_file_path: str) -> MakeData:
        """Extract complete make data from JSON file"""
    
    def extract_all_makes(self, sources_dir: str) -> List[MakeData]:
        """Process all JSON files in directory"""
    
    def validate_json_structure(self, json_data: dict) -> ValidationResult:
        """Validate JSON structure before processing"""
```

**Data Structures**:
```python
@dataclass
class MakeData:
    name: str              # Normalized display name
    models: List[ModelData]
    
@dataclass  
class ModelData:
    name: str
    years: List[int]
    engines: List[EngineSpec] 
    trims: List[str]       # From submodels
```

#### 2.2 Electric Vehicle Handler
```python
class ElectricVehicleHandler:
    def create_default_engine(self) -> EngineSpec:
        """Create default 'Electric Motor' engine for empty arrays"""
    
    def is_electric_vehicle(self, model_data: ModelData) -> bool:
        """Detect electric vehicles by empty engines + make patterns"""
```

### Phase 3: Data Loading ⏳
**Objective**: Load JSON-extracted data into PostgreSQL

#### 3.1 JSON Manual Loader (`etl/loaders/json_manual_loader.py`)
```python
class JsonManualLoader:
    def __init__(self, postgres_loader: PostgreSQLLoader):
        pass
    
    def load_make_data(self, make_data: MakeData, mode: LoadMode):
        """Load complete make data with referential integrity"""
    
    def load_all_makes(self, makes_data: List[MakeData], 
                      mode: LoadMode) -> LoadResult:
        """Batch load all makes with progress tracking"""
    
    def handle_duplicates(self, table: str, data: List[Dict]) -> int:
        """Handle duplicate records based on natural keys"""
```

**Load Modes**:
- **CLEAR**: `TRUNCATE CASCADE` then insert (destructive)
- **APPEND**: Insert with `ON CONFLICT DO NOTHING` (safe)

#### 3.2 Extend PostgreSQL Loader
Enhance `etl/loaders/postgres_loader.py` with JSON-specific methods:
```python
def load_json_makes(self, makes: List[Dict], clear_existing: bool) -> int
def load_json_engines(self, engines: List[EngineSpec], clear_existing: bool) -> int
def create_model_year_relationships(self, model_years: List[Dict]) -> int
```

### Phase 4: Pipeline Integration ⏳  
**Objective**: Create manual JSON processing pipeline

#### 4.1 Manual JSON Pipeline (`etl/pipelines/manual_json_pipeline.py`)
```python
class ManualJsonPipeline:
    def __init__(self, sources_dir: str):
        self.extractor = JsonExtractor(...)
        self.loader = JsonManualLoader(...)
    
    def run_manual_pipeline(self, mode: LoadMode, 
                          specific_make: Optional[str] = None) -> PipelineResult:
        """Complete JSON → PostgreSQL pipeline"""
    
    def validate_before_load(self) -> ValidationReport:
        """Pre-flight validation of all JSON files"""
    
    def generate_load_report(self) -> LoadReport:
        """Post-load statistics and data quality report"""
```

#### 4.2 Pipeline Result Tracking
```python
@dataclass
class PipelineResult:
    success: bool
    makes_processed: int
    models_loaded: int  
    engines_loaded: int
    trims_loaded: int
    errors: List[str]
    warnings: List[str]
    duration: timedelta
```

### Phase 5: CLI Integration ⏳
**Objective**: Add CLI commands for manual processing

#### 5.1 Main CLI Updates (`etl/main.py`)
```python
@cli.command()
@click.option('--mode', type=click.Choice(['clear', 'append']), 
              default='append', help='Load mode')
@click.option('--make', help='Process specific make only')  
@click.option('--validate-only', is_flag=True, 
              help='Validate JSON files without loading')
def load_manual(mode, make, validate_only):
    """Load vehicle data from JSON files"""
    
@cli.command()
def validate_json():
    """Validate all JSON files structure and data quality"""
```

#### 5.2 Configuration Updates (`etl/config.py`)
```python
# JSON Processing settings
JSON_SOURCES_DIR: str = "sources/makes"
MANUAL_LOAD_DEFAULT_MODE: str = "append"
ELECTRIC_DEFAULT_ENGINE: str = "Electric Motor"
ENGINE_PARSING_STRICT: bool = False  # Log vs fail on parse errors
```

### Phase 6: Testing & Validation ⏳
**Objective**: Comprehensive testing and validation

#### 6.1 Unit Tests
- `test_make_name_mapper.py` - Make name normalization
- `test_engine_spec_parser.py` - Engine parsing with L→I normalization  
- `test_json_extractor.py` - JSON data extraction
- `test_manual_loader.py` - Database loading

#### 6.2 Integration Tests
- `test_manual_pipeline.py` - End-to-end JSON processing
- `test_api_integration.py` - Verify API endpoints work with JSON data
- `test_data_quality.py` - Data quality validation

#### 6.3 Data Validation Scripts
```python
# examples/validate_all_json.py
def validate_all_makes() -> ValidationReport:
    """Validate all 55 JSON files and report issues"""
    
# examples/compare_data_sources.py  
def compare_mssql_vs_json() -> ComparisonReport:
    """Compare MSSQL vs JSON data for overlapping makes"""
```

## File Structure Changes

### New Files to Create
```
etl/
├── utils/
│   ├── make_name_mapper.py        # Make name normalization
│   └── engine_spec_parser.py      # Engine specification parsing
├── extractors/  
│   └── json_extractor.py          # JSON data extraction
├── loaders/
│   └── json_manual_loader.py      # JSON-specific data loading
└── pipelines/
    └── manual_json_pipeline.py    # JSON processing pipeline
```

### Files to Modify
```
etl/
├── main.py                        # Add load-manual command
├── config.py                      # Add JSON processing config
└── loaders/
    └── postgres_loader.py         # Extend for JSON data types
```

## Implementation Order

### Week 1: Foundation
1. ✅ Create documentation structure
2. ⏳ Implement `MakeNameMapper` with validation
3. ⏳ Implement `EngineSpecParser` with L→I normalization
4. ⏳ Unit tests for utilities

### Week 2: Data Processing  
1. ⏳ Implement `JsonExtractor` with validation
2. ⏳ Implement `ElectricVehicleHandler`
3. ⏳ Create data structures and type definitions
4. ⏳ Integration tests for extraction

### Week 3: Data Loading
1. ⏳ Implement `JsonManualLoader` with clear/append modes
2. ⏳ Extend `PostgreSQLLoader` for JSON data types
3. ⏳ Implement duplicate handling strategy
4. ⏳ Database integration tests

### Week 4: Pipeline & CLI
1. ⏳ Implement `ManualJsonPipeline`
2. ⏳ Add CLI commands with options
3. ⏳ Add configuration management
4. ⏳ End-to-end testing

### Week 5: Validation & Polish
1. ⏳ Comprehensive data validation
2. ⏳ Performance testing with all 55 files
3. ⏳ Error handling improvements
4. ⏳ Documentation completion

## Success Metrics
- [ ] Process all 55 JSON files without errors
- [ ] Correct make name normalization (alfa_romeo → Alfa Romeo)
- [ ] Engine parsing with L→I normalization working
- [ ] Electric vehicle handling (default engines created)
- [ ] Clear/append modes working correctly
- [ ] API endpoints return data from JSON sources
- [ ] Performance acceptable (<5 minutes for full load)
- [ ] Comprehensive error reporting and logging

## Risk Mitigation

### Data Quality Risks
- **Mitigation**: Extensive validation before loading
- **Fallback**: Report data quality issues, continue processing

### Performance Risks  
- **Mitigation**: Batch processing, progress tracking
- **Fallback**: Process makes individually if batch fails

### Schema Compatibility Risks
- **Mitigation**: Thorough testing against existing schema
- **Fallback**: Schema migration scripts if needed

### Integration Risks
- **Mitigation**: Maintain existing MSSQL pipeline compatibility
- **Fallback**: Feature flag to disable JSON processing