307 lines
9.9 KiB
Markdown
307 lines
9.9 KiB
Markdown
# Implementation Plan - Manual JSON ETL
|
|
|
|
## Implementation Overview
|
|
Add manual JSON processing capability to the existing MVP Platform Vehicles ETL system without disrupting the current MSSQL-based pipeline.
|
|
|
|
## Development Phases
|
|
|
|
### Phase 1: Core Utilities ⏳
|
|
**Objective**: Create foundational utilities for JSON processing
|
|
|
|
#### 1.1 Make Name Mapper (`etl/utils/make_name_mapper.py`)
|
|
```python
|
|
class MakeNameMapper:
|
|
def normalize_make_name(self, filename: str) -> str:
|
|
"""Convert 'alfa_romeo' to 'Alfa Romeo'"""
|
|
|
|
def get_display_name_mapping(self) -> Dict[str, str]:
|
|
"""Get complete filename -> display name mapping"""
|
|
|
|
def validate_against_sources(self) -> List[str]:
|
|
"""Cross-validate with sources/makes.json"""
|
|
```
|
|
|
|
**Implementation Requirements**:
|
|
- Handle underscore → space conversion
|
|
- Title case each word
|
|
- Special cases: BMW, GMC (all caps)
|
|
- Validation against existing `sources/makes.json`
|
|
|
|
#### 1.2 Engine Spec Parser (`etl/utils/engine_spec_parser.py`)
|
|
```python
|
|
@dataclass
|
|
class EngineSpec:
|
|
displacement_l: float
|
|
configuration: str # I, V, H
|
|
cylinders: int
|
|
fuel_type: str # Gasoline, Hybrid, Electric, Flex Fuel
|
|
aspiration: str # Natural, Turbo, Supercharged
|
|
raw_string: str
|
|
|
|
class EngineSpecParser:
|
|
def parse_engine_string(self, engine_str: str) -> EngineSpec:
|
|
"""Parse '2.0L I4 PLUG-IN HYBRID EV- (PHEV)' into components"""
|
|
|
|
def normalize_configuration(self, config: str) -> str:
|
|
"""Convert L → I (L3 becomes I3)"""
|
|
|
|
def extract_fuel_type(self, engine_str: str) -> str:
|
|
"""Extract fuel type from modifiers"""
|
|
```
|
|
|
|
**Implementation Requirements**:
|
|
- **CRITICAL**: L-configuration → I (Inline) normalization
|
|
- Regex patterns for standard format: `{displacement}L {config}{cylinders}`
|
|
- Hybrid/electric detection: PHEV, FHEV, ELECTRIC patterns
|
|
- Flex-fuel detection: FLEX modifier
|
|
- Handle parsing failures gracefully
|
|
|
|
### Phase 2: Data Extraction ⏳
|
|
**Objective**: Extract data from JSON files into normalized structures
|
|
|
|
#### 2.1 JSON Extractor (`etl/extractors/json_extractor.py`)
|
|
```python
|
|
class JsonExtractor:
|
|
def __init__(self, make_mapper: MakeNameMapper,
|
|
engine_parser: EngineSpecParser):
|
|
pass
|
|
|
|
def extract_make_data(self, json_file_path: str) -> MakeData:
|
|
"""Extract complete make data from JSON file"""
|
|
|
|
def extract_all_makes(self, sources_dir: str) -> List[MakeData]:
|
|
"""Process all JSON files in directory"""
|
|
|
|
def validate_json_structure(self, json_data: dict) -> ValidationResult:
|
|
"""Validate JSON structure before processing"""
|
|
```
|
|
|
|
**Data Structures**:
|
|
```python
|
|
@dataclass
|
|
class MakeData:
|
|
name: str # Normalized display name
|
|
models: List[ModelData]
|
|
|
|
@dataclass
|
|
class ModelData:
|
|
name: str
|
|
years: List[int]
|
|
engines: List[EngineSpec]
|
|
trims: List[str] # From submodels
|
|
```
|
|
|
|
#### 2.2 Electric Vehicle Handler
|
|
```python
|
|
class ElectricVehicleHandler:
|
|
def create_default_engine(self) -> EngineSpec:
|
|
"""Create default 'Electric Motor' engine for empty arrays"""
|
|
|
|
def is_electric_vehicle(self, model_data: ModelData) -> bool:
|
|
"""Detect electric vehicles by empty engines + make patterns"""
|
|
```
|
|
|
|
### Phase 3: Data Loading ⏳
|
|
**Objective**: Load JSON-extracted data into PostgreSQL
|
|
|
|
#### 3.1 JSON Manual Loader (`etl/loaders/json_manual_loader.py`)
|
|
```python
|
|
class JsonManualLoader:
|
|
def __init__(self, postgres_loader: PostgreSQLLoader):
|
|
pass
|
|
|
|
def load_make_data(self, make_data: MakeData, mode: LoadMode):
|
|
"""Load complete make data with referential integrity"""
|
|
|
|
def load_all_makes(self, makes_data: List[MakeData],
|
|
mode: LoadMode) -> LoadResult:
|
|
"""Batch load all makes with progress tracking"""
|
|
|
|
def handle_duplicates(self, table: str, data: List[Dict]) -> int:
|
|
"""Handle duplicate records based on natural keys"""
|
|
```
|
|
|
|
**Load Modes**:
|
|
- **CLEAR**: `TRUNCATE CASCADE` then insert (destructive)
|
|
- **APPEND**: Insert with `ON CONFLICT DO NOTHING` (safe)
|
|
|
|
#### 3.2 Extend PostgreSQL Loader
|
|
Enhance `etl/loaders/postgres_loader.py` with JSON-specific methods:
|
|
```python
|
|
def load_json_makes(self, makes: List[Dict], clear_existing: bool) -> int
|
|
def load_json_engines(self, engines: List[EngineSpec], clear_existing: bool) -> int
|
|
def create_model_year_relationships(self, model_years: List[Dict]) -> int
|
|
```
|
|
|
|
### Phase 4: Pipeline Integration ⏳
|
|
**Objective**: Create manual JSON processing pipeline
|
|
|
|
#### 4.1 Manual JSON Pipeline (`etl/pipelines/manual_json_pipeline.py`)
|
|
```python
|
|
class ManualJsonPipeline:
|
|
def __init__(self, sources_dir: str):
|
|
self.extractor = JsonExtractor(...)
|
|
self.loader = JsonManualLoader(...)
|
|
|
|
def run_manual_pipeline(self, mode: LoadMode,
|
|
specific_make: Optional[str] = None) -> PipelineResult:
|
|
"""Complete JSON → PostgreSQL pipeline"""
|
|
|
|
def validate_before_load(self) -> ValidationReport:
|
|
"""Pre-flight validation of all JSON files"""
|
|
|
|
def generate_load_report(self) -> LoadReport:
|
|
"""Post-load statistics and data quality report"""
|
|
```
|
|
|
|
#### 4.2 Pipeline Result Tracking
|
|
```python
|
|
@dataclass
|
|
class PipelineResult:
|
|
success: bool
|
|
makes_processed: int
|
|
models_loaded: int
|
|
engines_loaded: int
|
|
trims_loaded: int
|
|
errors: List[str]
|
|
warnings: List[str]
|
|
duration: timedelta
|
|
```
|
|
|
|
### Phase 5: CLI Integration ⏳
|
|
**Objective**: Add CLI commands for manual processing
|
|
|
|
#### 5.1 Main CLI Updates (`etl/main.py`)
|
|
```python
|
|
@cli.command()
|
|
@click.option('--mode', type=click.Choice(['clear', 'append']),
|
|
default='append', help='Load mode')
|
|
@click.option('--make', help='Process specific make only')
|
|
@click.option('--validate-only', is_flag=True,
|
|
help='Validate JSON files without loading')
|
|
def load_manual(mode, make, validate_only):
|
|
"""Load vehicle data from JSON files"""
|
|
|
|
@cli.command()
|
|
def validate_json():
|
|
"""Validate all JSON files structure and data quality"""
|
|
```
|
|
|
|
#### 5.2 Configuration Updates (`etl/config.py`)
|
|
```python
|
|
# JSON Processing settings
|
|
JSON_SOURCES_DIR: str = "sources/makes"
|
|
MANUAL_LOAD_DEFAULT_MODE: str = "append"
|
|
ELECTRIC_DEFAULT_ENGINE: str = "Electric Motor"
|
|
ENGINE_PARSING_STRICT: bool = False # Log vs fail on parse errors
|
|
```
|
|
|
|
### Phase 6: Testing & Validation ⏳
|
|
**Objective**: Comprehensive testing and validation
|
|
|
|
#### 6.1 Unit Tests
|
|
- `test_make_name_mapper.py` - Make name normalization
|
|
- `test_engine_spec_parser.py` - Engine parsing with L→I normalization
|
|
- `test_json_extractor.py` - JSON data extraction
|
|
- `test_manual_loader.py` - Database loading
|
|
|
|
#### 6.2 Integration Tests
|
|
- `test_manual_pipeline.py` - End-to-end JSON processing
|
|
- `test_api_integration.py` - Verify API endpoints work with JSON data
|
|
- `test_data_quality.py` - Data quality validation
|
|
|
|
#### 6.3 Data Validation Scripts
|
|
```python
|
|
# examples/validate_all_json.py
|
|
def validate_all_makes() -> ValidationReport:
|
|
"""Validate all 55 JSON files and report issues"""
|
|
|
|
# examples/compare_data_sources.py
|
|
def compare_mssql_vs_json() -> ComparisonReport:
|
|
"""Compare MSSQL vs JSON data for overlapping makes"""
|
|
```
|
|
|
|
## File Structure Changes
|
|
|
|
### New Files to Create
|
|
```
|
|
etl/
|
|
├── utils/
|
|
│ ├── make_name_mapper.py # Make name normalization
|
|
│ └── engine_spec_parser.py # Engine specification parsing
|
|
├── extractors/
|
|
│ └── json_extractor.py # JSON data extraction
|
|
├── loaders/
|
|
│ └── json_manual_loader.py # JSON-specific data loading
|
|
└── pipelines/
|
|
└── manual_json_pipeline.py # JSON processing pipeline
|
|
```
|
|
|
|
### Files to Modify
|
|
```
|
|
etl/
|
|
├── main.py # Add load-manual command
|
|
├── config.py # Add JSON processing config
|
|
└── loaders/
|
|
└── postgres_loader.py # Extend for JSON data types
|
|
```
|
|
|
|
## Implementation Order
|
|
|
|
### Week 1: Foundation
|
|
1. ✅ Create documentation structure
|
|
2. ⏳ Implement `MakeNameMapper` with validation
|
|
3. ⏳ Implement `EngineSpecParser` with L→I normalization
|
|
4. ⏳ Unit tests for utilities
|
|
|
|
### Week 2: Data Processing
|
|
1. ⏳ Implement `JsonExtractor` with validation
|
|
2. ⏳ Implement `ElectricVehicleHandler`
|
|
3. ⏳ Create data structures and type definitions
|
|
4. ⏳ Integration tests for extraction
|
|
|
|
### Week 3: Data Loading
|
|
1. ⏳ Implement `JsonManualLoader` with clear/append modes
|
|
2. ⏳ Extend `PostgreSQLLoader` for JSON data types
|
|
3. ⏳ Implement duplicate handling strategy
|
|
4. ⏳ Database integration tests
|
|
|
|
### Week 4: Pipeline & CLI
|
|
1. ⏳ Implement `ManualJsonPipeline`
|
|
2. ⏳ Add CLI commands with options
|
|
3. ⏳ Add configuration management
|
|
4. ⏳ End-to-end testing
|
|
|
|
### Week 5: Validation & Polish
|
|
1. ⏳ Comprehensive data validation
|
|
2. ⏳ Performance testing with all 55 files
|
|
3. ⏳ Error handling improvements
|
|
4. ⏳ Documentation completion
|
|
|
|
## Success Metrics
|
|
- [ ] Process all 55 JSON files without errors
|
|
- [ ] Correct make name normalization (alfa_romeo → Alfa Romeo)
|
|
- [ ] Engine parsing with L→I normalization working
|
|
- [ ] Electric vehicle handling (default engines created)
|
|
- [ ] Clear/append modes working correctly
|
|
- [ ] API endpoints return data from JSON sources
|
|
- [ ] Performance acceptable (<5 minutes for full load)
|
|
- [ ] Comprehensive error reporting and logging
|
|
|
|
## Risk Mitigation
|
|
|
|
### Data Quality Risks
|
|
- **Mitigation**: Extensive validation before loading
|
|
- **Fallback**: Report data quality issues, continue processing
|
|
|
|
### Performance Risks
|
|
- **Mitigation**: Batch processing, progress tracking
|
|
- **Fallback**: Process makes individually if batch fails
|
|
|
|
### Schema Compatibility Risks
|
|
- **Mitigation**: Thorough testing against existing schema
|
|
- **Fallback**: Schema migration scripts if needed
|
|
|
|
### Integration Risks
|
|
- **Mitigation**: Maintain existing MSSQL pipeline compatibility
|
|
- **Fallback**: Feature flag to disable JSON processing |