k8s improvement

This commit is contained in:
Eric Gullickson
2025-09-17 20:47:42 -05:00
parent a052040e3a
commit 17d27f4b92
29 changed files with 836 additions and 29 deletions

655
K8S-REDESIGN.md Normal file
View File

@@ -0,0 +1,655 @@
# Docker Compose → Kubernetes Architecture Redesign
## Overview
This document outlines the comprehensive redesign of MotoVaultPro's Docker Compose architecture to closely replicate a Kubernetes deployment pattern. The goal is to maintain all current functionality while preparing for seamless K8s migration and improving development experience.
## Current Architecture Analysis
### Existing Services (13 containers)
**MVP Platform Services (Microservices)**
- `mvp-platform-landing` - Marketing/landing page (nginx)
- `mvp-platform-tenants` - Multi-tenant management API (FastAPI)
- `mvp-platform-vehicles-api` - Vehicle data API (FastAPI)
- `mvp-platform-vehicles-etl` - Data processing pipeline (Python)
- `mvp-platform-vehicles-db` - Vehicle data storage (PostgreSQL)
- `mvp-platform-vehicles-redis` - Vehicle data cache (Redis)
- `mvp-platform-vehicles-mssql` - Monthly ETL source (SQL Server)
**Application Services (Modular Monolith)**
- `admin-backend` - Application API with feature capsules (Node.js)
- `admin-frontend` - React SPA (nginx)
- `admin-postgres` - Application database (PostgreSQL)
- `admin-redis` - Application cache (Redis)
- `admin-minio` - Object storage (MinIO)
**Infrastructure**
- `nginx-proxy` - Load balancer and SSL termination
- `platform-postgres` - Platform services database
- `platform-redis` - Platform services cache
### Current Limitations
1. **Single Network**: All services on default network
2. **Manual Routing**: nginx configuration requires manual updates
3. **Port Exposure**: Many services expose ports directly
4. **Configuration**: Environment variables scattered across services
5. **Service Discovery**: Hard-coded service names
6. **Observability**: Limited monitoring and debugging capabilities
## Target Kubernetes-like Architecture
### Network Segmentation
```yaml
networks:
frontend:
driver: bridge
labels:
- "com.motovaultpro.network=frontend"
- "com.motovaultpro.purpose=public-facing"
backend:
driver: bridge
internal: true
labels:
- "com.motovaultpro.network=backend"
- "com.motovaultpro.purpose=api-services"
database:
driver: bridge
internal: true
labels:
- "com.motovaultpro.network=database"
- "com.motovaultpro.purpose=data-layer"
platform:
driver: bridge
internal: true
labels:
- "com.motovaultpro.network=platform"
- "com.motovaultpro.purpose=microservices"
```
### Service Placement Strategy
| Network | Services | Purpose | K8s Equivalent |
|---------|----------|---------|----------------|
| `frontend` | traefik, admin-frontend, mvp-platform-landing | Public-facing services | Public LoadBalancer services |
| `backend` | admin-backend, mvp-platform-tenants, mvp-platform-vehicles-api | API services | ClusterIP services |
| `database` | All PostgreSQL, Redis, MinIO | Data persistence | StatefulSets with PVCs |
| `platform` | Platform microservices communication | Internal service mesh | Service mesh networking |
## Traefik Configuration
### Core Traefik Setup
```yaml
traefik:
image: traefik:v3.0
container_name: traefik
networks:
- frontend
- backend
ports:
- "80:80"
- "443:443"
- "8080:8080" # Dashboard
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
- ./config/traefik:/config:ro
- ./certs:/certs:ro
configs:
- source: traefik-config
target: /etc/traefik/traefik.yml
labels:
- "traefik.enable=true"
- "traefik.http.routers.dashboard.rule=Host(`traefik.motovaultpro.local`)"
- "traefik.http.routers.dashboard.tls=true"
```
### Service Discovery Labels
**Admin Frontend**
```yaml
admin-frontend:
labels:
- "traefik.enable=true"
- "traefik.http.routers.admin-app.rule=Host(`admin.motovaultpro.com`)"
- "traefik.http.routers.admin-app.tls=true"
- "traefik.http.routers.admin-app.middlewares=secure-headers@file"
- "traefik.http.services.admin-app.loadbalancer.server.port=3000"
- "traefik.http.services.admin-app.loadbalancer.healthcheck.path=/"
```
**Admin Backend**
```yaml
admin-backend:
labels:
- "traefik.enable=true"
- "traefik.http.routers.admin-api.rule=Host(`admin.motovaultpro.com`) && PathPrefix(`/api`)"
- "traefik.http.routers.admin-api.tls=true"
- "traefik.http.routers.admin-api.middlewares=api-auth@file,cors@file"
- "traefik.http.services.admin-api.loadbalancer.server.port=3001"
- "traefik.http.services.admin-api.loadbalancer.healthcheck.path=/health"
```
**Platform Landing**
```yaml
mvp-platform-landing:
labels:
- "traefik.enable=true"
- "traefik.http.routers.landing.rule=Host(`motovaultpro.com`)"
- "traefik.http.routers.landing.tls=true"
- "traefik.http.routers.landing.middlewares=secure-headers@file"
- "traefik.http.services.landing.loadbalancer.server.port=3000"
```
### Middleware Configuration
```yaml
# config/traefik/middleware.yml
http:
middlewares:
secure-headers:
headers:
accessControlAllowMethods:
- GET
- OPTIONS
- PUT
- POST
- DELETE
accessControlAllowOriginList:
- "https://admin.motovaultpro.com"
- "https://motovaultpro.com"
accessControlMaxAge: 100
addVaryHeader: true
browserXssFilter: true
contentTypeNosniff: true
forceSTSHeader: true
frameDeny: true
stsIncludeSubdomains: true
stsPreload: true
stsSeconds: 31536000
cors:
headers:
accessControlAllowCredentials: true
accessControlAllowHeaders:
- "Authorization"
- "Content-Type"
- "X-Requested-With"
accessControlAllowMethods:
- "GET"
- "POST"
- "PUT"
- "DELETE"
- "OPTIONS"
accessControlAllowOriginList:
- "https://admin.motovaultpro.com"
- "https://motovaultpro.com"
accessControlMaxAge: 100
api-auth:
forwardAuth:
address: "http://admin-backend:3001/auth/verify"
authResponseHeaders:
- "X-Auth-User"
- "X-Auth-Roles"
```
## Enhanced Health Checks
### Standardized Health Check Pattern
All services will implement:
1. **Startup Probe** - Service initialization
2. **Readiness Probe** - Service ready to accept traffic
3. **Liveness Probe** - Service health monitoring
```yaml
# Example: admin-backend
healthcheck:
test: ["CMD", "node", "-e", "
const http = require('http');
const options = {
hostname: 'localhost',
port: 3001,
path: '/health/ready',
timeout: 2000
};
const req = http.request(options, (res) => {
process.exit(res.statusCode === 200 ? 0 : 1);
});
req.on('error', () => process.exit(1));
req.end();
"]
interval: 15s
timeout: 5s
retries: 3
start_period: 45s
```
### Health Endpoint Standards
All services must expose:
- `/health` - Basic health check
- `/health/ready` - Readiness probe
- `/health/live` - Liveness probe
## Configuration Management
### Docker Configs (K8s ConfigMaps equivalent)
```yaml
configs:
traefik-config:
file: ./config/traefik/traefik.yml
traefik-middleware:
file: ./config/traefik/middleware.yml
app-config-production:
file: ./config/app/production.yml
platform-config:
file: ./config/platform/services.yml
```
### Docker Secrets (K8s Secrets equivalent)
```yaml
secrets:
auth0-client-secret:
file: ./secrets/auth0-client-secret.txt
database-passwords:
file: ./secrets/database-passwords.txt
platform-api-keys:
file: ./secrets/platform-api-keys.txt
ssl-certificates:
file: ./secrets/ssl-certs.txt
```
### Environment Configuration
```yaml
# config/app/production.yml
database:
host: admin-postgres
port: 5432
name: motovaultpro
pool_size: 20
redis:
host: admin-redis
port: 6379
db: 0
auth0:
domain: ${AUTH0_DOMAIN}
audience: ${AUTH0_AUDIENCE}
platform:
vehicles_api:
url: http://mvp-platform-vehicles-api:8000
timeout: 30s
tenants_api:
url: http://mvp-platform-tenants:8000
timeout: 15s
```
## Resource Management
### Resource Allocation Strategy
**Tier 1: Critical Services**
```yaml
deploy:
resources:
limits: { memory: 2G, cpus: '2.0' }
reservations: { memory: 1G, cpus: '1.0' }
restart_policy:
condition: on-failure
max_attempts: 3
```
**Tier 2: Supporting Services**
```yaml
deploy:
resources:
limits: { memory: 1G, cpus: '1.0' }
reservations: { memory: 512M, cpus: '0.5' }
restart_policy:
condition: on-failure
max_attempts: 3
```
**Tier 3: Infrastructure Services**
```yaml
deploy:
resources:
limits: { memory: 512M, cpus: '0.5' }
reservations: { memory: 256M, cpus: '0.25' }
restart_policy:
condition: unless-stopped
```
### Service Tiers
| Tier | Services | Resource Profile | Priority |
|------|----------|------------------|----------|
| 1 | admin-backend, mvp-platform-vehicles-api, admin-postgres | High | Critical |
| 2 | admin-frontend, mvp-platform-tenants, mvp-platform-landing | Medium | Important |
| 3 | traefik, redis services, etl services | Low | Supporting |
## Migration Implementation Plan
### Phase 1: Infrastructure Foundation (Week 1)
**Objectives:**
- Implement Traefik service
- Create network segmentation
- Establish basic routing
**Tasks:**
1. Create new network topology
2. Add Traefik service with basic configuration
3. Migrate nginx routing to Traefik labels
4. Test SSL certificate handling
5. Verify all existing functionality
**Success Criteria:**
- All services accessible via original URLs
- SSL certificates working
- Health checks functional
- No performance degradation
### Phase 2: Service Discovery & Labels (Week 2)
**Objectives:**
- Convert all services to label-based discovery
- Implement middleware for security
- Add service health monitoring
**Tasks:**
1. Convert each service to Traefik labels
2. Implement security middleware
3. Add CORS and authentication middleware
4. Test service discovery and failover
5. Implement Traefik dashboard access
**Success Criteria:**
- All services discovered automatically
- Security middleware working
- Dashboard accessible and functional
- Mobile and desktop testing passes
### Phase 3: Configuration Management (Week 3)
**Objectives:**
- Implement Docker configs and secrets
- Standardize environment configuration
- Add monitoring and observability
**Tasks:**
1. Move configuration to Docker configs
2. Implement secrets management
3. Standardize health check endpoints
4. Add service metrics collection
5. Implement log aggregation
**Success Criteria:**
- No hardcoded secrets in compose files
- Centralized configuration management
- Enhanced monitoring capabilities
- Improved debugging experience
### Phase 4: Optimization & Documentation (Week 4)
**Objectives:**
- Optimize resource allocation
- Update development workflow
- Complete documentation
**Tasks:**
1. Implement resource limits and reservations
2. Update Makefile with new commands
3. Create troubleshooting documentation
4. Performance testing and optimization
5. Final validation of all features
**Success Criteria:**
- Optimized resource usage
- Updated development workflow
- Complete documentation
- All tests passing
## Development Workflow Enhancements
### New Makefile Commands
```makefile
# Traefik specific commands
traefik-dashboard:
@echo "Opening Traefik dashboard..."
@open https://traefik.motovaultpro.local:8080
traefik-logs:
@docker compose logs -f traefik
service-discovery:
@echo "Discovered services:"
@docker compose exec traefik traefik api --url=http://localhost:8080/api/rawdata
network-inspect:
@echo "Network topology:"
@docker network ls --filter name=motovaultpro
@docker network inspect motovaultpro_frontend motovaultpro_backend motovaultpro_database motovaultpro_platform
health-check-all:
@echo "Checking health of all services..."
@docker compose ps --format "table {{.Service}}\t{{.Status}}\t{{.Health}}"
# Enhanced existing commands
logs:
@echo "Available log targets: all, traefik, backend, frontend, platform"
@docker compose logs -f $(filter-out $@,$(MAKECMDGOALS))
%:
@: # This catches the log target argument
```
### Enhanced Development Features
**Service Discovery Dashboard**
- Real-time service status
- Route configuration visualization
- Health check monitoring
- Request tracing
**Debugging Tools**
- Network topology inspection
- Service dependency mapping
- Configuration validation
- Performance metrics
**Testing Enhancements**
- Automated health checks
- Service integration testing
- Load balancing validation
- SSL certificate verification
## Observability & Monitoring
### Metrics Collection
```yaml
# Add to traefik configuration
metrics:
prometheus:
addEntryPointsLabels: true
addServicesLabels: true
addRoutersLabels: true
```
### Logging Strategy
**Centralized Logging**
- All services log to stdout/stderr
- Traefik access logs
- Service health check logs
- Application performance logs
**Log Levels**
- `ERROR`: Critical issues requiring attention
- `WARN`: Potential issues or degraded performance
- `INFO`: Normal operational messages
- `DEBUG`: Detailed diagnostic information (dev only)
### Health Monitoring
**Service Health Dashboard**
- Real-time service status
- Historical health trends
- Alert notifications
- Performance metrics
## Security Enhancements
### Network Security
**Network Isolation**
- Frontend network: Public-facing services only
- Backend network: API services with restricted access
- Database network: Data services with no external access
- Platform network: Microservices internal communication
**Access Control**
- Traefik middleware for authentication
- Service-to-service authentication
- Network-level access restrictions
- SSL/TLS encryption for all traffic
### Secret Management
**Secrets Rotation**
- Database passwords
- API keys
- SSL certificates
- Auth0 client secrets
**Access Policies**
- Least privilege principle
- Service-specific secret access
- Audit logging for secret access
- Encrypted secret storage
## Testing Strategy
### Automated Testing
**Integration Tests**
- Service discovery validation
- Health check verification
- SSL certificate testing
- Load balancing functionality
**Performance Tests**
- Service response times
- Network latency measurement
- Resource utilization monitoring
- Concurrent user simulation
**Security Tests**
- Network isolation verification
- Authentication middleware testing
- SSL/TLS configuration validation
- Secret management verification
### Manual Testing Procedures
**Development Workflow**
1. Service startup validation
2. Route accessibility testing
3. Mobile/desktop compatibility
4. Feature functionality verification
5. Performance benchmarking
**Deployment Validation**
1. Service discovery verification
2. Health check validation
3. SSL certificate functionality
4. Load balancing behavior
5. Failover testing
## Migration Rollback Plan
### Rollback Triggers
- Service discovery failures
- Performance degradation > 20%
- SSL certificate issues
- Health check failures
- Mobile/desktop compatibility issues
### Rollback Procedure
1. **Immediate**: Switch DNS to backup nginx configuration
2. **Quick**: Restore docker-compose.yml.backup
3. **Complete**: Revert all configuration changes
4. **Verify**: Run full test suite
5. **Monitor**: Ensure service stability
### Backup Strategy
- Backup current docker-compose.yml
- Backup nginx configuration
- Export service configurations
- Document current network topology
- Save working environment variables
## Success Metrics
### Performance Metrics
- **Service Startup Time**: < 30 seconds for all services
- **Request Response Time**: < 500ms for API calls
- **Health Check Response**: < 2 seconds
- **SSL Handshake Time**: < 1 second
### Reliability Metrics
- **Service Availability**: 99.9% uptime
- **Health Check Success Rate**: > 98%
- **Service Discovery Accuracy**: 100%
- **Failover Time**: < 10 seconds
### Development Experience Metrics
- **Development Setup Time**: < 5 minutes
- **Service Debug Time**: < 2 minutes to identify issues
- **Configuration Change Deployment**: < 1 minute
- **Test Suite Execution**: < 10 minutes
## Post-Migration Benefits
### Immediate Benefits
1. **Enhanced Observability**: Real-time service monitoring and debugging
2. **Improved Security**: Network segmentation and middleware protection
3. **Better Development Experience**: Automatic service discovery and routing
4. **Simplified Configuration**: Centralized configuration management
5. **K8s Preparation**: Architecture closely mirrors Kubernetes patterns
### Long-term Benefits
1. **Easier K8s Migration**: Direct translation to Kubernetes manifests
2. **Better Scalability**: Load balancing and resource management
3. **Improved Maintainability**: Standardized configuration patterns
4. **Enhanced Monitoring**: Built-in metrics and health monitoring
5. **Professional Development Environment**: Production-like local setup
## Conclusion
This comprehensive redesign transforms the Docker Compose architecture to closely mirror Kubernetes deployment patterns while maintaining all existing functionality and improving the development experience. The phased migration approach ensures minimal disruption while delivering immediate benefits in observability, security, and maintainability.
The new architecture provides a solid foundation for future Kubernetes migration while enhancing current development workflows with modern service discovery, monitoring, and configuration management practices.