Files
motovaultpro/docs/changes/vehicles-dropdown-v2/02-implementation-plan.md
Eric Gullickson a052040e3a Initial Commit
2025-09-17 16:09:15 -05:00

307 lines
9.9 KiB
Markdown

# Implementation Plan - Manual JSON ETL
## Implementation Overview
Add manual JSON processing capability to the existing MVP Platform Vehicles ETL system without disrupting the current MSSQL-based pipeline.
## Development Phases
### Phase 1: Core Utilities ⏳
**Objective**: Create foundational utilities for JSON processing
#### 1.1 Make Name Mapper (`etl/utils/make_name_mapper.py`)
```python
class MakeNameMapper:
def normalize_make_name(self, filename: str) -> str:
"""Convert 'alfa_romeo' to 'Alfa Romeo'"""
def get_display_name_mapping(self) -> Dict[str, str]:
"""Get complete filename -> display name mapping"""
def validate_against_sources(self) -> List[str]:
"""Cross-validate with sources/makes.json"""
```
**Implementation Requirements**:
- Handle underscore → space conversion
- Title case each word
- Special cases: BMW, GMC (all caps)
- Validation against existing `sources/makes.json`
#### 1.2 Engine Spec Parser (`etl/utils/engine_spec_parser.py`)
```python
@dataclass
class EngineSpec:
displacement_l: float
configuration: str # I, V, H
cylinders: int
fuel_type: str # Gasoline, Hybrid, Electric, Flex Fuel
aspiration: str # Natural, Turbo, Supercharged
raw_string: str
class EngineSpecParser:
def parse_engine_string(self, engine_str: str) -> EngineSpec:
"""Parse '2.0L I4 PLUG-IN HYBRID EV- (PHEV)' into components"""
def normalize_configuration(self, config: str) -> str:
"""Convert L → I (L3 becomes I3)"""
def extract_fuel_type(self, engine_str: str) -> str:
"""Extract fuel type from modifiers"""
```
**Implementation Requirements**:
- **CRITICAL**: L-configuration → I (Inline) normalization
- Regex patterns for standard format: `{displacement}L {config}{cylinders}`
- Hybrid/electric detection: PHEV, FHEV, ELECTRIC patterns
- Flex-fuel detection: FLEX modifier
- Handle parsing failures gracefully
### Phase 2: Data Extraction ⏳
**Objective**: Extract data from JSON files into normalized structures
#### 2.1 JSON Extractor (`etl/extractors/json_extractor.py`)
```python
class JsonExtractor:
def __init__(self, make_mapper: MakeNameMapper,
engine_parser: EngineSpecParser):
pass
def extract_make_data(self, json_file_path: str) -> MakeData:
"""Extract complete make data from JSON file"""
def extract_all_makes(self, sources_dir: str) -> List[MakeData]:
"""Process all JSON files in directory"""
def validate_json_structure(self, json_data: dict) -> ValidationResult:
"""Validate JSON structure before processing"""
```
**Data Structures**:
```python
@dataclass
class MakeData:
name: str # Normalized display name
models: List[ModelData]
@dataclass
class ModelData:
name: str
years: List[int]
engines: List[EngineSpec]
trims: List[str] # From submodels
```
#### 2.2 Electric Vehicle Handler
```python
class ElectricVehicleHandler:
def create_default_engine(self) -> EngineSpec:
"""Create default 'Electric Motor' engine for empty arrays"""
def is_electric_vehicle(self, model_data: ModelData) -> bool:
"""Detect electric vehicles by empty engines + make patterns"""
```
### Phase 3: Data Loading ⏳
**Objective**: Load JSON-extracted data into PostgreSQL
#### 3.1 JSON Manual Loader (`etl/loaders/json_manual_loader.py`)
```python
class JsonManualLoader:
def __init__(self, postgres_loader: PostgreSQLLoader):
pass
def load_make_data(self, make_data: MakeData, mode: LoadMode):
"""Load complete make data with referential integrity"""
def load_all_makes(self, makes_data: List[MakeData],
mode: LoadMode) -> LoadResult:
"""Batch load all makes with progress tracking"""
def handle_duplicates(self, table: str, data: List[Dict]) -> int:
"""Handle duplicate records based on natural keys"""
```
**Load Modes**:
- **CLEAR**: `TRUNCATE CASCADE` then insert (destructive)
- **APPEND**: Insert with `ON CONFLICT DO NOTHING` (safe)
#### 3.2 Extend PostgreSQL Loader
Enhance `etl/loaders/postgres_loader.py` with JSON-specific methods:
```python
def load_json_makes(self, makes: List[Dict], clear_existing: bool) -> int
def load_json_engines(self, engines: List[EngineSpec], clear_existing: bool) -> int
def create_model_year_relationships(self, model_years: List[Dict]) -> int
```
### Phase 4: Pipeline Integration ⏳
**Objective**: Create manual JSON processing pipeline
#### 4.1 Manual JSON Pipeline (`etl/pipelines/manual_json_pipeline.py`)
```python
class ManualJsonPipeline:
def __init__(self, sources_dir: str):
self.extractor = JsonExtractor(...)
self.loader = JsonManualLoader(...)
def run_manual_pipeline(self, mode: LoadMode,
specific_make: Optional[str] = None) -> PipelineResult:
"""Complete JSON → PostgreSQL pipeline"""
def validate_before_load(self) -> ValidationReport:
"""Pre-flight validation of all JSON files"""
def generate_load_report(self) -> LoadReport:
"""Post-load statistics and data quality report"""
```
#### 4.2 Pipeline Result Tracking
```python
@dataclass
class PipelineResult:
success: bool
makes_processed: int
models_loaded: int
engines_loaded: int
trims_loaded: int
errors: List[str]
warnings: List[str]
duration: timedelta
```
### Phase 5: CLI Integration ⏳
**Objective**: Add CLI commands for manual processing
#### 5.1 Main CLI Updates (`etl/main.py`)
```python
@cli.command()
@click.option('--mode', type=click.Choice(['clear', 'append']),
default='append', help='Load mode')
@click.option('--make', help='Process specific make only')
@click.option('--validate-only', is_flag=True,
help='Validate JSON files without loading')
def load_manual(mode, make, validate_only):
"""Load vehicle data from JSON files"""
@cli.command()
def validate_json():
"""Validate all JSON files structure and data quality"""
```
#### 5.2 Configuration Updates (`etl/config.py`)
```python
# JSON Processing settings
JSON_SOURCES_DIR: str = "sources/makes"
MANUAL_LOAD_DEFAULT_MODE: str = "append"
ELECTRIC_DEFAULT_ENGINE: str = "Electric Motor"
ENGINE_PARSING_STRICT: bool = False # Log vs fail on parse errors
```
### Phase 6: Testing & Validation ⏳
**Objective**: Comprehensive testing and validation
#### 6.1 Unit Tests
- `test_make_name_mapper.py` - Make name normalization
- `test_engine_spec_parser.py` - Engine parsing with L→I normalization
- `test_json_extractor.py` - JSON data extraction
- `test_manual_loader.py` - Database loading
#### 6.2 Integration Tests
- `test_manual_pipeline.py` - End-to-end JSON processing
- `test_api_integration.py` - Verify API endpoints work with JSON data
- `test_data_quality.py` - Data quality validation
#### 6.3 Data Validation Scripts
```python
# examples/validate_all_json.py
def validate_all_makes() -> ValidationReport:
"""Validate all 55 JSON files and report issues"""
# examples/compare_data_sources.py
def compare_mssql_vs_json() -> ComparisonReport:
"""Compare MSSQL vs JSON data for overlapping makes"""
```
## File Structure Changes
### New Files to Create
```
etl/
├── utils/
│ ├── make_name_mapper.py # Make name normalization
│ └── engine_spec_parser.py # Engine specification parsing
├── extractors/
│ └── json_extractor.py # JSON data extraction
├── loaders/
│ └── json_manual_loader.py # JSON-specific data loading
└── pipelines/
└── manual_json_pipeline.py # JSON processing pipeline
```
### Files to Modify
```
etl/
├── main.py # Add load-manual command
├── config.py # Add JSON processing config
└── loaders/
└── postgres_loader.py # Extend for JSON data types
```
## Implementation Order
### Week 1: Foundation
1. ✅ Create documentation structure
2. ⏳ Implement `MakeNameMapper` with validation
3. ⏳ Implement `EngineSpecParser` with L→I normalization
4. ⏳ Unit tests for utilities
### Week 2: Data Processing
1. ⏳ Implement `JsonExtractor` with validation
2. ⏳ Implement `ElectricVehicleHandler`
3. ⏳ Create data structures and type definitions
4. ⏳ Integration tests for extraction
### Week 3: Data Loading
1. ⏳ Implement `JsonManualLoader` with clear/append modes
2. ⏳ Extend `PostgreSQLLoader` for JSON data types
3. ⏳ Implement duplicate handling strategy
4. ⏳ Database integration tests
### Week 4: Pipeline & CLI
1. ⏳ Implement `ManualJsonPipeline`
2. ⏳ Add CLI commands with options
3. ⏳ Add configuration management
4. ⏳ End-to-end testing
### Week 5: Validation & Polish
1. ⏳ Comprehensive data validation
2. ⏳ Performance testing with all 55 files
3. ⏳ Error handling improvements
4. ⏳ Documentation completion
## Success Metrics
- [ ] Process all 55 JSON files without errors
- [ ] Correct make name normalization (alfa_romeo → Alfa Romeo)
- [ ] Engine parsing with L→I normalization working
- [ ] Electric vehicle handling (default engines created)
- [ ] Clear/append modes working correctly
- [ ] API endpoints return data from JSON sources
- [ ] Performance acceptable (<5 minutes for full load)
- [ ] Comprehensive error reporting and logging
## Risk Mitigation
### Data Quality Risks
- **Mitigation**: Extensive validation before loading
- **Fallback**: Report data quality issues, continue processing
### Performance Risks
- **Mitigation**: Batch processing, progress tracking
- **Fallback**: Process makes individually if batch fails
### Schema Compatibility Risks
- **Mitigation**: Thorough testing against existing schema
- **Fallback**: Schema migration scripts if needed
### Integration Risks
- **Mitigation**: Maintain existing MSSQL pipeline compatibility
- **Fallback**: Feature flag to disable JSON processing