Architecture Docs

This commit is contained in:
Eric Gullickson
2025-07-28 08:40:03 -05:00
parent 61336d807e
commit 4391cf11ed
7 changed files with 8587 additions and 0 deletions

862
K8S-PHASE-3.md Normal file
View File

@@ -0,0 +1,862 @@
# Phase 3: Production Deployment (Weeks 9-12)
This phase focuses on deploying the modernized application with proper production configurations, monitoring, backup strategies, and operational procedures.
## Overview
Phase 3 transforms the development-ready Kubernetes application into a production-grade system with comprehensive monitoring, automated backup and recovery, secure ingress, and operational excellence. This phase ensures the system is ready for enterprise-level workloads with proper security, performance, and reliability guarantees.
## Key Objectives
- **Production Kubernetes Deployment**: Configure scalable, secure deployment manifests
- **Ingress and TLS Configuration**: Secure external access with proper routing
- **Comprehensive Monitoring**: Application and infrastructure observability
- **Backup and Disaster Recovery**: Automated backup strategies and recovery procedures
- **Migration Execution**: Seamless transition from legacy system
## 3.1 Kubernetes Deployment Configuration
**Objective**: Create production-ready Kubernetes manifests with proper resource management and high availability.
### Application Deployment Configuration
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: motovault-app
namespace: motovault
labels:
app: motovault
version: v1.0.0
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: motovault
template:
metadata:
labels:
app: motovault
version: v1.0.0
annotations:
prometheus.io/scrape: "true"
prometheus.io/path: "/metrics"
prometheus.io/port: "8080"
spec:
serviceAccountName: motovault-service-account
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 2000
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- motovault
topologyKey: kubernetes.io/hostname
- weight: 50
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- motovault
topologyKey: topology.kubernetes.io/zone
containers:
- name: motovault
image: motovault:latest
imagePullPolicy: Always
ports:
- containerPort: 8080
name: http
protocol: TCP
env:
- name: ASPNETCORE_ENVIRONMENT
value: "Production"
- name: ASPNETCORE_URLS
value: "http://+:8080"
envFrom:
- configMapRef:
name: motovault-config
- secretRef:
name: motovault-secrets
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
volumeMounts:
- name: tmp-volume
mountPath: /tmp
- name: app-logs
mountPath: /app/logs
volumes:
- name: tmp-volume
emptyDir: {}
- name: app-logs
emptyDir: {}
terminationGracePeriodSeconds: 30
---
apiVersion: v1
kind: Service
metadata:
name: motovault-service
namespace: motovault
labels:
app: motovault
spec:
type: ClusterIP
ports:
- port: 80
targetPort: 8080
protocol: TCP
name: http
selector:
app: motovault
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: motovault-pdb
namespace: motovault
spec:
minAvailable: 2
selector:
matchLabels:
app: motovault
```
### Horizontal Pod Autoscaler Configuration
```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: motovault-hpa
namespace: motovault
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: motovault-app
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleUp:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 100
periodSeconds: 15
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
```
### Implementation Tasks
#### 1. Create production namespace with security policies
```yaml
apiVersion: v1
kind: Namespace
metadata:
name: motovault
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted
```
#### 2. Configure resource quotas and limits
```yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: motovault-quota
namespace: motovault
spec:
hard:
requests.cpu: "4"
requests.memory: 8Gi
limits.cpu: "8"
limits.memory: 16Gi
persistentvolumeclaims: "10"
pods: "20"
```
#### 3. Set up service accounts and RBAC
```yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: motovault-service-account
namespace: motovault
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: motovault-role
namespace: motovault
rules:
- apiGroups: [""]
resources: ["configmaps", "secrets"]
verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: motovault-rolebinding
namespace: motovault
subjects:
- kind: ServiceAccount
name: motovault-service-account
namespace: motovault
roleRef:
kind: Role
name: motovault-role
apiGroup: rbac.authorization.k8s.io
```
#### 4. Configure pod anti-affinity for high availability
- Spread pods across nodes and availability zones
- Ensure no single point of failure
- Optimize for both performance and availability
#### 5. Implement rolling update strategy with zero downtime
- Configure progressive rollout with health checks
- Automatic rollback on failure
- Canary deployment capabilities
## 3.2 Ingress and TLS Configuration
**Objective**: Configure secure external access with proper TLS termination and routing.
### Ingress Configuration
```yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: motovault-ingress
namespace: motovault
annotations:
nginx.ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
nginx.ingress.kubernetes.io/proxy-body-size: "50m"
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
cert-manager.io/cluster-issuer: "letsencrypt-prod"
nginx.ingress.kubernetes.io/rate-limit: "100"
nginx.ingress.kubernetes.io/rate-limit-window: "1m"
spec:
ingressClassName: nginx
tls:
- hosts:
- motovault.example.com
secretName: motovault-tls
rules:
- host: motovault.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: motovault-service
port:
number: 80
```
### TLS Certificate Management
```yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: admin@motovault.example.com
privateKeySecretRef:
name: letsencrypt-prod
solvers:
- http01:
ingress:
class: nginx
```
### Implementation Tasks
#### 1. Deploy cert-manager for automated TLS
```bash
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml
```
#### 2. Configure Let's Encrypt for SSL certificates
- Automated certificate provisioning and renewal
- DNS-01 or HTTP-01 challenge configuration
- Certificate monitoring and alerting
#### 3. Set up WAF and DDoS protection
```yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: motovault-ingress-policy
namespace: motovault
spec:
podSelector:
matchLabels:
app: motovault
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: nginx-ingress
ports:
- protocol: TCP
port: 8080
```
#### 4. Configure rate limiting and security headers
- Request rate limiting per IP
- Security headers (HSTS, CSP, etc.)
- Request size limitations
#### 5. Set up health check endpoints for load balancer
- Configure ingress health checks
- Implement graceful degradation
- Monitor certificate expiration
## 3.3 Monitoring and Observability Setup
**Objective**: Implement comprehensive monitoring, logging, and alerting for production operations.
### Prometheus ServiceMonitor Configuration
```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: motovault-metrics
namespace: motovault
labels:
app: motovault
spec:
selector:
matchLabels:
app: motovault
endpoints:
- port: http
path: /metrics
interval: 30s
scrapeTimeout: 10s
```
### Application Metrics Implementation
```csharp
public class MetricsService
{
private readonly Counter _httpRequestsTotal;
private readonly Histogram _httpRequestDuration;
private readonly Gauge _activeConnections;
private readonly Counter _databaseOperationsTotal;
private readonly Histogram _databaseOperationDuration;
public MetricsService()
{
_httpRequestsTotal = Metrics.CreateCounter(
"motovault_http_requests_total",
"Total number of HTTP requests",
new[] { "method", "endpoint", "status_code" });
_httpRequestDuration = Metrics.CreateHistogram(
"motovault_http_request_duration_seconds",
"Duration of HTTP requests in seconds",
new[] { "method", "endpoint" });
_activeConnections = Metrics.CreateGauge(
"motovault_active_connections",
"Number of active database connections");
_databaseOperationsTotal = Metrics.CreateCounter(
"motovault_database_operations_total",
"Total number of database operations",
new[] { "operation", "table", "status" });
_databaseOperationDuration = Metrics.CreateHistogram(
"motovault_database_operation_duration_seconds",
"Duration of database operations in seconds",
new[] { "operation", "table" });
}
public void RecordHttpRequest(string method, string endpoint, int statusCode, double duration)
{
_httpRequestsTotal.WithLabels(method, endpoint, statusCode.ToString()).Inc();
_httpRequestDuration.WithLabels(method, endpoint).Observe(duration);
}
public void RecordDatabaseOperation(string operation, string table, bool success, double duration)
{
var status = success ? "success" : "error";
_databaseOperationsTotal.WithLabels(operation, table, status).Inc();
_databaseOperationDuration.WithLabels(operation, table).Observe(duration);
}
}
```
### Grafana Dashboard Configuration
```json
{
"dashboard": {
"title": "MotoVaultPro Application Dashboard",
"panels": [
{
"title": "HTTP Request Rate",
"type": "graph",
"targets": [
{
"expr": "rate(motovault_http_requests_total[5m])",
"legendFormat": "{{method}} {{endpoint}}"
}
]
},
{
"title": "Response Time Percentiles",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.50, rate(motovault_http_request_duration_seconds_bucket[5m]))",
"legendFormat": "50th percentile"
},
{
"expr": "histogram_quantile(0.95, rate(motovault_http_request_duration_seconds_bucket[5m]))",
"legendFormat": "95th percentile"
}
]
},
{
"title": "Database Connection Pool",
"type": "singlestat",
"targets": [
{
"expr": "motovault_active_connections",
"legendFormat": "Active Connections"
}
]
},
{
"title": "Error Rate",
"type": "graph",
"targets": [
{
"expr": "rate(motovault_http_requests_total{status_code=~\"5..\"}[5m])",
"legendFormat": "5xx errors"
}
]
}
]
}
}
```
### Alert Manager Configuration
```yaml
groups:
- name: motovault.rules
rules:
- alert: HighErrorRate
expr: rate(motovault_http_requests_total{status_code=~"5.."}[5m]) > 0.1
for: 2m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }}% for the last 5 minutes"
- alert: HighResponseTime
expr: histogram_quantile(0.95, rate(motovault_http_request_duration_seconds_bucket[5m])) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "High response time detected"
description: "95th percentile response time is {{ $value }}s"
- alert: DatabaseConnectionPoolExhaustion
expr: motovault_active_connections > 80
for: 2m
labels:
severity: warning
annotations:
summary: "Database connection pool nearly exhausted"
description: "Active connections: {{ $value }}/100"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total{namespace="motovault"}[15m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod is crash looping"
description: "Pod {{ $labels.pod }} is restarting frequently"
```
### Implementation Tasks
#### 1. Deploy Prometheus and Grafana stack
```bash
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/main/bundle.yaml
```
#### 2. Configure application metrics endpoints
- Add Prometheus metrics middleware
- Implement custom business metrics
- Configure metric collection intervals
#### 3. Set up centralized logging with structured logs
```csharp
builder.Services.AddLogging(loggingBuilder =>
{
loggingBuilder.AddJsonConsole(options =>
{
options.JsonWriterOptions = new JsonWriterOptions { Indented = false };
options.IncludeScopes = true;
options.TimestampFormat = "yyyy-MM-ddTHH:mm:ss.fffZ";
});
});
```
#### 4. Create operational dashboards and alerts
- Application performance dashboards
- Infrastructure monitoring dashboards
- Business metrics and KPIs
- Alert routing and escalation
#### 5. Implement distributed tracing
```csharp
services.AddOpenTelemetry()
.WithTracing(builder =>
{
builder
.AddAspNetCoreInstrumentation()
.AddNpgsql()
.AddRedisInstrumentation()
.AddJaegerExporter();
});
```
## 3.4 Backup and Disaster Recovery
**Objective**: Implement comprehensive backup strategies and disaster recovery procedures.
### Velero Backup Configuration
```yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: motovault-daily-backup
namespace: velero
spec:
schedule: "0 2 * * *" # Daily at 2 AM
template:
includedNamespaces:
- motovault
includedResources:
- "*"
storageLocation: default
ttl: 720h0m0s # 30 days
snapshotVolumes: true
---
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: motovault-weekly-backup
namespace: velero
spec:
schedule: "0 3 * * 0" # Weekly on Sunday at 3 AM
template:
includedNamespaces:
- motovault
includedResources:
- "*"
storageLocation: default
ttl: 2160h0m0s # 90 days
snapshotVolumes: true
```
### Database Backup Strategy
```bash
#!/bin/bash
# Automated database backup script
BACKUP_DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="motovault_backup_${BACKUP_DATE}.sql"
S3_BUCKET="motovault-backups"
# Create database backup
kubectl exec -n motovault motovault-postgres-1 -- \
pg_dump -U postgres motovault > "${BACKUP_FILE}"
# Compress backup
gzip "${BACKUP_FILE}"
# Upload to S3/MinIO
aws s3 cp "${BACKUP_FILE}.gz" "s3://${S3_BUCKET}/database/"
# Clean up local file
rm "${BACKUP_FILE}.gz"
# Retain only last 30 days of backups
aws s3api list-objects-v2 \
--bucket "${S3_BUCKET}" \
--prefix "database/" \
--query 'Contents[?LastModified<=`'$(date -d "30 days ago" --iso-8601)'`].[Key]' \
--output text | \
xargs -I {} aws s3 rm "s3://${S3_BUCKET}/{}"
```
### Disaster Recovery Procedures
```bash
#!/bin/bash
# Full system recovery script
BACKUP_DATE=$1
if [ -z "$BACKUP_DATE" ]; then
echo "Usage: $0 <backup_date>"
echo "Example: $0 20240120_020000"
exit 1
fi
# Stop application
echo "Scaling down application..."
kubectl scale deployment motovault-app --replicas=0 -n motovault
# Restore database
echo "Restoring database from backup..."
aws s3 cp "s3://motovault-backups/database/database_backup_${BACKUP_DATE}.sql.gz" .
gunzip "database_backup_${BACKUP_DATE}.sql.gz"
kubectl exec -i motovault-postgres-1 -n motovault -- \
psql -U postgres -d motovault < "database_backup_${BACKUP_DATE}.sql"
# Restore MinIO data
echo "Restoring MinIO data..."
aws s3 sync "s3://motovault-backups/minio/${BACKUP_DATE}/" /tmp/minio_restore/
mc mirror /tmp/minio_restore/ motovault-minio/motovault-files/
# Restart application
echo "Scaling up application..."
kubectl scale deployment motovault-app --replicas=3 -n motovault
# Verify health
echo "Waiting for application to be ready..."
kubectl wait --for=condition=ready pod -l app=motovault -n motovault --timeout=300s
echo "Recovery completed successfully"
```
### Implementation Tasks
#### 1. Deploy Velero for Kubernetes backup
```bash
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.7.0 \
--bucket motovault-backups \
--backup-location-config region=us-west-2 \
--snapshot-location-config region=us-west-2
```
#### 2. Configure automated database backups
- Point-in-time recovery setup
- Incremental backup strategies
- Cross-region backup replication
#### 3. Implement MinIO backup synchronization
- Automated file backup to external storage
- Metadata backup and restoration
- Verification of backup integrity
#### 4. Create disaster recovery runbooks
- Step-by-step recovery procedures
- RTO/RPO definitions and testing
- Contact information and escalation procedures
#### 5. Set up backup monitoring and alerting
```yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: backup-alerts
spec:
groups:
- name: backup.rules
rules:
- alert: BackupFailed
expr: velero_backup_failure_total > 0
labels:
severity: critical
annotations:
summary: "Backup operation failed"
description: "Velero backup has failed"
```
## Week-by-Week Breakdown
### Week 9: Production Kubernetes Configuration
- **Days 1-2**: Create production deployment manifests
- **Days 3-4**: Configure HPA, PDB, and resource quotas
- **Days 5-7**: Set up RBAC and security policies
### Week 10: Ingress and TLS Setup
- **Days 1-2**: Deploy and configure ingress controller
- **Days 3-4**: Set up cert-manager and TLS certificates
- **Days 5-7**: Configure security policies and rate limiting
### Week 11: Monitoring and Observability
- **Days 1-3**: Deploy Prometheus and Grafana stack
- **Days 4-5**: Configure application metrics and dashboards
- **Days 6-7**: Set up alerting and notification channels
### Week 12: Backup and Migration Preparation
- **Days 1-3**: Deploy and configure backup solutions
- **Days 4-5**: Create migration scripts and procedures
- **Days 6-7**: Execute migration dry runs and validation
## Success Criteria
- [ ] Production Kubernetes deployment with 99.9% availability
- [ ] Secure ingress with automated TLS certificate management
- [ ] Comprehensive monitoring with alerting
- [ ] Automated backup and recovery procedures tested
- [ ] Migration procedures validated and documented
- [ ] Security policies and network controls implemented
- [ ] Performance baselines established and monitored
## Testing Requirements
### Production Readiness Tests
- Load testing under expected traffic patterns
- Failover testing for all components
- Security penetration testing
- Backup and recovery validation
### Performance Tests
- Application response time under load
- Database performance with connection pooling
- Cache performance and hit ratios
- Network latency and throughput
### Security Tests
- Container image vulnerability scanning
- Network policy validation
- Authentication and authorization testing
- TLS configuration verification
## Deliverables
1. **Production Deployment**
- Complete Kubernetes manifests
- Security configurations
- Monitoring and alerting setup
- Backup and recovery procedures
2. **Documentation**
- Operational runbooks
- Security procedures
- Monitoring guides
- Disaster recovery plans
3. **Migration Tools**
- Data migration scripts
- Validation tools
- Rollback procedures
## Dependencies
- Production Kubernetes cluster
- External storage for backups
- DNS management for ingress
- Certificate authority for TLS
- Monitoring infrastructure
## Risks and Mitigations
### Risk: Extended Downtime During Migration
**Mitigation**: Blue-green deployment strategy with comprehensive rollback plan
### Risk: Data Integrity Issues
**Mitigation**: Extensive validation and parallel running during transition
### Risk: Performance Degradation
**Mitigation**: Load testing and gradual traffic migration
---
**Previous Phase**: [Phase 2: High Availability Infrastructure](K8S-PHASE-2.md)
**Next Phase**: [Phase 4: Advanced Features and Optimization](K8S-PHASE-4.md)