Incident Response Runbooks for Oneiric

# Incident Response Runbooks for Oneiric **Last Updated:** 2025-11-26 **Status:** Production Ready **Maintainer:** Platform Team ______________________________________________________________________ ## Table of Contents 1. \[[#overview|Overview]\] 1. \[[#incident-severity-levels|Incident Severity Levels]\] 1. \[[#general-incident-response-process|General Incident Response Process]\] 1. \[[#runbook-index|Runbook Index]\] 1. \[[#runbooks|Runbooks]\] - \[[#runbook-1-resolution-failures|1. Resolution Failures]\] - \[[#runbook-2-hot-swap-failures|2. Hot-Swap Failures]\] - \[[#runbook-3-remote-sync-failures|3. Remote Sync Failures]\] - \[[#runbook-4-cache-corruption|4. Cache Corruption]\] - \[[#runbook-5-memory-exhaustion|5. Memory Exhaustion]\] 1. \[[#post-incident-review|Post-Incident Review]\] 1. \[[#escalation-matrix|Escalation Matrix]\] ______________________________________________________________________ ## Overview This document provides step-by-step incident response procedures for common Oneiric operational issues. Each runbook includes: - **Symptoms:** How to identify the incident - **Diagnosis:** Tools and commands to investigate - **Resolution:** Step-by-step fix procedures - **Prevention:** How to avoid recurrence - **Escalation:** When and who to contact ### Quick Access | Alert | Runbook | Severity | Response Time | |-------|---------|----------|---------------| | `OneiricResolutionFailureRateHigh` | \[[#runbook-1-resolution-failures|#1]\] | Critical | < 5 min | | `OneiricLifecycleSwapFailureRateHigh` | \[[#runbook-2-hot-swap-failures|#2]\] | Critical | < 5 min | | `OneiricRemoteSyncConsecutiveFailures` | \[[#runbook-3-remote-sync-failures|#3]\] | Critical | < 15 min | | `OneiricDigestVerificationFailed` | \[[#runbook-4-cache-corruption|#4]\] | Critical | < 5 min | | `OneiricActiveInstancesExtremelyHigh` | \[[#runbook-5-memory-exhaustion|#5]\] | Critical | < 10 min | ______________________________________________________________________ ## Incident Severity Levels ```mermaid graph TD Alert["Incident Detected"] Classify{"Classify Severity"} P0["P0 - Critical < 5min response Page immediately"] P1["P1 - High < 15min response Slack + email"] P2["P2 - Medium < 1hr response Slack warning"] P3["P3 - Low < 4hr response Slack info"] Alert --> Classify Classify -->|"SLA Impact System Down"| P0 Classify -->|"Degraded Service"| P1 Classify -->|"Limited Impact"| P2 Classify -->|"Monitoring Issue"| P3 style P0 fill:#ffcccc style P1 fill:#ffe1cc style P2 fill:#fff4cc style P3 fill:#ccffcc ``` **Severity Classifications:** ### P0 - Critical (SLA Impact) - **Response Time:** < 5 minutes - **Resolution Target:** < 1 hour - **Notification:** Page on-call immediately - **Examples:** Resolution failures > 5%, security breaches, system down ### P1 - High (Degraded Service) - **Response Time:** < 15 minutes - **Resolution Target:** < 4 hours - **Notification:** Slack critical channel + email - **Examples:** Swap failures, remote sync issues, high latency ### P2 - Medium (Limited Impact) - **Response Time:** < 1 hour - **Resolution Target:** < 8 hours - **Notification:** Slack warning channel - **Examples:** Health check failures, cache growth, warnings ### P3 - Low (Monitoring Issue) - **Response Time:** < 4 hours - **Resolution Target:** < 24 hours - **Notification:** Slack info channel - **Examples:** Info alerts, maintenance notifications ______________________________________________________________________ ## General Incident Response Process ```mermaid graph LR subgraph "Phase 1: Acknowledge (< 5 min)" Ack1["Acknowledge alert in PagerDuty/AlertManager"] Ack2["Join incident channel Slack #incident-response"] Ack3["Announce response 'I'm investigating [INCIDENT]'"] Ack4["Silence related alerts reduce noise"] end subgraph "Phase 2: Diagnose (< 15 min)" Diag1["Check monitoring Grafana dashboards"] Diag2["Review logs Loki queries"] Diag3["Identify root cause Use runbook"] Diag4["Update incident channel with findings"] end subgraph "Phase 3: Resolve (Variable)" Res1["Follow runbook resolution steps"] Res2["Document actions in incident channel"] Res3["Verify fix check metrics/logs"] end subgraph "Phase 4: Post-Incident" Post1["Post-incident review document learnings"] Post2["Update runbooks if needed"] Post3["Close incident"] end Ack1 --> Ack2 --> Ack3 --> Ack4 Ack4 --> Diag1 Diag1 --> Diag2 --> Diag3 --> Diag4 Diag4 --> Res1 Res1 --> Res2 --> Res3 Res3 --> Post1 --> Post2 --> Post3 style Ack1 fill:#ffcccc style Diag1 fill:#fff4e1 style Res1 fill:#e1f5ff style Post1 fill:#ccffcc ``` **Phase Details:** ### Phase 1: Acknowledge (< 5 min) 1. **Acknowledge alert** in PagerDuty/AlertManager 1. **Join incident channel** (Slack #incident-response) 1. **Announce response:** "I'm investigating [INCIDENT]" 1. **Silence related alerts** to reduce noise ### Phase 2: Diagnose (< 15 min) 1. **Check monitoring:** Grafana dashboards, Prometheus alerts 1. **Review logs:** Loki queries for errors 1. **Identify root cause:** Use runbook diagnosis section 1. **Update incident channel** with findings ### Phase 3: Resolve (Variable) 1. **Follow runbook resolution steps** 1. **Document actions taken** in incident channel 1. **Verify fix:** Check metrics/logs for improvement 1. **Update stakeholders** on progress ### Phase 4: Verify (< 10 min) 1. **Confirm resolution:** Metrics/alerts return to normal 1. **Test functionality:** Run smoke tests 1. **Monitor for recurrence:** Watch for 30 minutes 1. **Remove silences** if stable ### Phase 5: Close (< 30 min) 1. **Update incident ticket** with resolution 1. **Schedule post-incident review** (within 48 hours) 1. **Document learnings** in incident log 1. **Close alerts** in AlertManager ______________________________________________________________________ ## Runbook Index | # | Runbook | Symptoms | Severity | Est. Time | |---|---------|----------|----------|-----------| | 1 | \[[#runbook-1-resolution-failures|Resolution Failures]\] | Components not resolving, errors | P0 | 15-30 min | | 2 | \[[#runbook-2-hot-swap-failures|Hot-Swap Failures]\] | Swaps failing, rollbacks | P0 | 20-45 min | | 3 | \[[#runbook-3-remote-sync-failures|Remote Sync Failures]\] | Cannot fetch manifests | P1 | 15-30 min | | 4 | \[[#runbook-4-cache-corruption|Cache Corruption]\] | Digest mismatches | P0 | 10-20 min | | 5 | \[[#runbook-5-memory-exhaustion|Memory Exhaustion]\] | OOMKilled, high memory | P0 | 20-40 min | ______________________________________________________________________ ## Runbooks ______________________________________________________________________ ## Runbook 1: Resolution Failures **Alert:** `OneiricResolutionFailureRateHigh` **Severity:** P0 - Critical **Response Time:** < 5 minutes **Owner:** Platform Team ### Symptoms - Alert firing: "Oneiric resolution failure rate exceeds 5%" - Components cannot be discovered/resolved - Application errors: "No candidate found for domain/key" - Grafana: Resolution success rate < 95% ### Impact - **User Impact:** Application features broken, requests failing - **SLA Impact:** High - service degradation - **Affected Domains:** Varies (check alert labels) ### Diagnosis **Step 1: Check Resolution Dashboard** ```bash # Access Grafana Resolution Dashboard open http://grafana:3000/d/oneiric-resolution # Key metrics to review: # - Resolution success rate (should be > 99%) # - Resolution failures by domain # - Recent error spikes ``` **Step 2: Query Failed Resolutions** ```promql # Failed resolutions in last 5 minutes sum(rate(oneiric_resolution_total{outcome="failed"}[5m])) by (domain, key) # Resolution error rate (1 - oneiric:resolution_success_rate_global:5m) * 100 ``` **Step 3: Check Logs for Errors** ```logql # Failed resolution logs {app="oneiric"} | json | event="resolver-decision" | outcome="failed" # Group by domain/key to find patterns {app="oneiric"} | json | event="resolver-decision" | outcome="failed" | line_format "{{.domain}}/{{.key}}" ``` **Step 4: Check Registered Candidates** ```bash # List all registered candidates uv run python -m oneiric.cli list --domain adapter --json # Check specific domain uv run python -m oneiric.cli list --domain service --json # Explain why resolution is failing uv run python -m oneiric.cli explain status --domain service ``` **Step 5: Common Root Causes** - ✅ **No candidates registered** - Check registration flow - ✅ **All candidates shadowed** - Review stack_level/priority - ✅ **Health check failures** - Investigate provider health - ✅ **Config file errors** - Validate YAML syntax - ✅ **Remote manifest issues** - Check remote sync status ### Resolution #### Scenario A: No Candidates Registered **Problem:** Resolver has no candidates for domain/key ```bash # Step 1: Check if candidates are registered uv run python -m oneiric.cli list --domain adapter # Step 2: If empty, check registration flow # - Verify plugins loaded # - Check local config files exist # - Verify remote manifest synced # Step 3: Check plugin diagnostics uv run python -m oneiric.cli plugins # Step 4: Check remote sync status uv run python -m oneiric.cli remote-status # Step 5: Force remote sync if stale uv run python -m oneiric.cli remote-sync --manifest <url> ``` **If plugins not loaded:** ```bash # Check plugin entry points python -c "import pkg_resources; print(list(pkg_resources.iter_entry_points('oneiric.adapters')))" # Re-install plugins uv pip install -e /path/to/plugin # Restart Oneiric # OR systemctl restart oneiric ``` #### Scenario B: All Candidates Shadowed **Problem:** Candidates exist but all are shadowed (inactive) ```bash # Step 1: List shadowed candidates uv run python -m oneiric.cli list --domain adapter --show-shadowed # Step 2: Check explain output for precedence uv run python -m oneiric.cli explain status --domain service # Step 3: Adjust selections in config # Edit settings/<domain>.yml vim settings/adapters.yml # Add explicit selection: # selections: # cache: redis # Force redis provider # Step 4: Reload config (watchers pick up changes automatically) # Or restart if watchers disabled: ``` **If stack_level issue:** ```bash # Option 1: Adjust ONEIRIC_STACK_ORDER env var export ONEIRIC_STACK_ORDER="myapp:20,oneiric:10,default:0" # Option 2: Edit metadata to increase stack_level # (requires code change in adapter registration) ``` #### Scenario C: Health Check Failures **Problem:** Candidates registered but health checks failing ```bash # Step 1: Check lifecycle status uv run python -m oneiric.cli status --domain adapter --key cache --json # Step 2: Review recent health check failures {app="oneiric"} | json | event="health-check-failed" # Step 3: Probe specific instance uv run python -m oneiric.cli health --probe --domain adapter --key cache # Step 4: Check provider configuration # Review settings/<domain>.yml for provider settings vim settings/adapters.yml # Step 5: Fix provider config (e.g., wrong Redis host) # Update settings and reload ``` #### Scenario D: Config File Errors **Problem:** Invalid YAML syntax in config files ```bash # Step 1: Validate YAML syntax python -c "import yaml; yaml.safe_load(open('settings/adapters.yml'))" # Step 2: Check for common issues: # - Missing colons # - Incorrect indentation # - Duplicate keys # Step 3: Fix syntax errors vim settings/adapters.yml # Step 4: Verify fix python -c "import yaml; yaml.safe_load(open('settings/adapters.yml'))" # Step 5: Reload config ``` ### Verification ```bash # Step 1: Check resolution success rate curl 'http://prometheus:9090/api/v1/query?query=oneiric:resolution_success_rate_global:5m' # Expected: > 0.99 # Step 2: Query recent resolutions {app="oneiric"} | json | event="resolver-decision" | outcome="success" # Step 3: Check Grafana dashboard open http://grafana:3000/d/oneiric-resolution # Verify success rate > 99% # Step 4: Test resolution manually uv run python -m oneiric.cli explain status --domain adapter # Step 5: Monitor for 15 minutes to ensure stability ``` ### Prevention 1. **Implement registration tests:** Unit tests verify candidates registered 1. **Config validation:** CI/CD validates YAML syntax before deploy 1. **Health check tuning:** Increase timeouts if providers slow to initialize 1. **Monitoring:** Alert on low candidate counts per domain 1. **Documentation:** Document registration process for each domain ### Escalation - **Initial Response:** Platform engineer (on-call) - **After 30 min:** Escalate to Platform Team Lead - **After 1 hour:** Escalate to Engineering Manager - **Contact:** [email protected], Slack #platform-oncall ______________________________________________________________________ ## Runbook 2: Hot-Swap Failures **Alert:** `OneiricLifecycleSwapFailureRateHigh` **Severity:** P0 - Critical **Response Time:** < 5 minutes **Owner:** DevOps Team ### Symptoms - Alert firing: "Oneiric swap failure rate exceeds 10%" - Hot-swap operations failing - Rollback operations occurring - Configuration changes not applying ### Impact - **User Impact:** Cannot deploy updates, stuck on old versions - **SLA Impact:** High - deployment blocked - **Affected Operations:** Configuration updates, version upgrades ### Diagnosis **Step 1: Check Lifecycle Dashboard** ```bash # Access Grafana Lifecycle Dashboard open http://grafana:3000/d/oneiric-lifecycle # Key metrics: # - Swap success rate (should be > 95%) # - Rollback rate # - Swap failure reasons ``` **Step 2: Query Failed Swaps** ```promql # Failed swaps in last 5 minutes sum(rate(oneiric_lifecycle_swap_total{outcome="failed"}[5m])) by (domain, key, provider) # Rollback rate rate(oneiric_lifecycle_swap_total{outcome="rollback"}[5m]) ``` **Step 3: Check Swap Logs** ```logql # Failed swap logs {app="oneiric"} | json | event="swap-failed" # Rollback logs {app="oneiric"} | json | event="swap-rollback" # Extract error messages {app="oneiric"} | json | event="swap-failed" | line_format "{{.domain}}/{{.key}}: {{.error}}" ``` **Step 4: Check Lifecycle Status** ```bash # Get lifecycle status for specific component uv run python -m oneiric.cli status --domain adapter --key cache --json # Check recent swap history {app="oneiric"} | json | event=~"swap-(complete|failed|rollback)" | line_format "{{.timestamp}}: {{.event}} {{.domain}}/{{.key}}" ``` **Step 5: Common Root Causes** - ✅ **Health check failures** - New instance fails health probes - ✅ **Factory errors** - Cannot instantiate new provider - ✅ **Timeout** - Swap takes longer than configured timeout - ✅ **Cleanup failures** - Old instance cleanup fails - ✅ **Hook errors** - Pre/post swap hooks fail ### Resolution #### Scenario A: Health Check Failures **Problem:** New instance fails health check during swap ```bash # Step 1: Review health check logs {app="oneiric"} | json | event="health-check-failed" | line_format "{{.provider}}: {{.error}}" # Step 2: Check provider configuration vim settings/adapters.yml # Verify connection strings, credentials, ports # Step 3: Test provider connectivity manually # Example for Redis: redis-cli -h redis-host -p 6379 PING # Step 4: Increase health check timeout if needed # Edit oneiric/core/config.py # lifecycle: # health_timeout: 30 # seconds # Step 5: Retry swap with longer timeout uv run python -m oneiric.cli swap --domain adapter --key cache --provider redis ``` **If provider dependency missing:** ```bash # Install missing dependencies uv pip install redis aioredis # Restart Oneiric ``` #### Scenario B: Factory Import Errors **Problem:** Cannot import or instantiate provider factory ```bash # Step 1: Check factory error logs {app="oneiric"} | json | event="swap-failed" | error=~".*ImportError.*" # Step 2: Verify factory path in metadata # Check adapter registration code grep -r "factory=" oneiric/adapters/*.py # Step 3: Test import manually python -c "from myapp.adapters.cache import RedisCache; print(RedisCache)" # Step 4: Fix import path if incorrect # Update adapter metadata registration vim myapp/adapters/__init__.py # Step 5: Restart and retry uv run python -m oneiric.cli swap --domain adapter --key cache --provider redis ``` #### Scenario C: Swap Timeout **Problem:** Swap operation exceeds timeout ```bash # Step 1: Check swap duration metrics histogram_quantile(0.95, rate(oneiric_lifecycle_swap_duration_ms_bucket[5m])) # Step 2: Review slow swap logs {app="oneiric"} | json | event="swap-complete" | duration_ms > 10000 # Step 3: Increase swap timeout # Edit settings vim settings/app.yml # Add lifecycle config: # lifecycle: # activation_timeout: 60 # health_timeout: 30 # cleanup_timeout: 30 # Step 4: Restart with new config # Step 5: Retry swap uv run python -m oneiric.cli swap --domain adapter --key cache --provider redis ``` #### Scenario D: Cleanup Failures **Problem:** Old instance cleanup fails but new instance active ```bash # Step 1: Check cleanup error logs {app="oneiric"} | json | event="cleanup-failed" # Step 2: Manually cleanup if safe # (Depends on provider - be cautious) # Step 3: Force swap to bypass cleanup uv run python -m oneiric.cli swap --domain adapter --key cache --provider redis --force # Step 4: Fix cleanup logic in provider # (Requires code change) # Step 5: Monitor for resource leaks # Check active instances count oneiric:system_active_instances_total:5m ``` ### Verification ```bash # Step 1: Check swap success rate curl 'http://prometheus:9090/api/v1/query?query=oneiric:lifecycle_swap_success_rate:5m' # Expected: > 0.95 # Step 2: Verify component active uv run python -m oneiric.cli status --domain adapter --key cache # Expected: state="ready", provider="redis" # Step 3: Test functionality # Run smoke test for swapped component # Step 4: Monitor for 15 minutes # Watch for rollbacks or failures ``` ### Prevention 1. **Health check tuning:** Increase timeouts for slow-initializing providers 1. **Factory validation:** Unit tests verify factory imports work 1. **Staged rollout:** Test swaps in staging before production 1. **Monitoring:** Alert on high rollback rates 1. **Cleanup hardening:** Ensure cleanup logic handles errors gracefully ### Escalation - **Initial Response:** DevOps engineer (on-call) - **After 30 min:** Escalate to DevOps Team Lead - **After 1 hour:** Escalate to Provider Owner - **Contact:** [email protected], Slack #devops-oncall ______________________________________________________________________ ## Runbook 3: Remote Sync Failures **Alert:** `OneiricRemoteSyncConsecutiveFailures` **Severity:** P1 - High **Response Time:** < 15 minutes **Owner:** Infrastructure Team ### Symptoms - Alert firing: "Remote sync failed 3+ consecutive times" - Cannot fetch remote manifests - Stale component versions - Remote sync duration high or timing out ### Impact - **User Impact:** Missing security updates, cannot pull new components - **SLA Impact:** Medium - stale versions but service operational - **Affected Operations:** Remote artifact updates, manifest changes ### Diagnosis **Step 1: Check Remote Sync Dashboard** ```bash # Access Grafana Remote Dashboard open http://grafana:3000/d/oneiric-remote # Key metrics: # - Sync success rate (should be > 99%) # - Last sync time # - Sync latency ``` **Step 2: Query Sync Status** ```bash # Check remote sync status uv run python -m oneiric.cli remote-status # Expected output: # - Last sync time # - Success/failure count # - Per-domain registrations ``` **Step 3: Check Sync Error Logs** ```logql # Remote sync errors {app="oneiric"} | json | event="remote-sync-error" # Network errors {app="oneiric"} | json | event="remote-sync-error" | error=~".*timeout.*|.*connection.*" # Signature verification errors {app="oneiric"} | json | event="signature-verification-failed" ``` **Step 4: Test Manifest URL** ```bash # Manually fetch manifest curl -v https://manifests.example.com/oneiric/manifest.yaml # Check DNS resolution nslookup manifests.example.com # Check network connectivity ping manifests.example.com ``` **Step 5: Common Root Causes** - ✅ **Network issues** - DNS, firewall, proxy blocking - ✅ **Signature verification failed** - Key rotation, manifest tampering - ✅ **Digest mismatch** - Artifact corruption or modified - ✅ **Circuit breaker open** - Too many consecutive failures - ✅ **Manifest syntax error** - Invalid YAML ### Resolution #### Scenario A: Network Issues **Problem:** Cannot reach remote manifest URL ```bash # Step 1: Check network connectivity ping manifests.example.com curl -v https://manifests.example.com/oneiric/manifest.yaml # Step 2: Check DNS resolution nslookup manifests.example.com host manifests.example.com # Step 3: Check firewall/proxy # Verify egress rules allow HTTPS to manifest host # Step 4: Test from Oneiric container # Step 5: If network OK, check circuit breaker # Wait for circuit breaker reset (default 60s) # Or restart to reset immediately ``` **If behind corporate proxy:** ```bash # Set proxy environment variables export HTTP_PROXY=http://proxy.example.com:8080 export HTTPS_PROXY=http://proxy.example.com:8080 export NO_PROXY=localhost,127.0.0.1 # Restart Oneiric with proxy vars ``` #### Scenario B: Signature Verification Failed **Problem:** ED25519 signature verification failing ```bash # Step 1: Check signature verification logs {app="oneiric"} | json | event="signature-verification-failed" # Step 2: Verify public key configured # Check settings/app.yml vim settings/app.yml # remote: # public_key: "ed25519_public_key_here" # Step 3: If key rotated, update config # Get new public key from manifest publisher # Update settings/app.yml # Restart Oneiric # Step 4: Temporarily disable signature verification (NOT RECOMMENDED) # Only for emergency debugging # remote: # require_signature: false # Step 5: Re-sign manifest with correct key # Contact manifest publisher/release engineer ``` **Security Note:** Signature verification failures may indicate: - Key rotation (legitimate) - MITM attack (security breach) - Manifest corruption (integrity issue) **If suspected security issue, escalate immediately to security team.** #### Scenario C: Digest Mismatch **Problem:** SHA256 digest doesn't match cached artifact ```bash # Step 1: Check digest errors {app="oneiric"} | json | event="digest-check-failed" # Step 2: Clear cache for affected artifact rm -rf .oneiric_cache/artifacts/<artifact_name> # Step 3: Force re-download uv run python -m oneiric.cli remote-sync --manifest <url> # Step 4: Verify digest matches # Download artifact manually and check SHA256 wget <artifact_url> sha256sum <artifact_file> # Step 5: If digest still mismatches, artifact corrupted # Contact release engineer to re-upload ``` #### Scenario D: Circuit Breaker Open **Problem:** Too many failures triggered circuit breaker ```bash # Step 1: Check circuit breaker state # Look for "circuit breaker open" in logs {app="oneiric"} | json | circuit_breaker="open" # Step 2: Wait for reset timeout (default 60s) # Or restart to reset immediately # Step 3: Fix underlying issue first # (network, signature, etc.) # Step 4: Retry sync after breaker resets uv run python -m oneiric.cli remote-sync --manifest <url> # Step 5: Adjust circuit breaker settings if too sensitive # Edit settings/app.yml # remote: # failure_threshold: 5 # Increase from 3 # reset_timeout: 120 # Increase from 60 ``` ### Verification ```bash # Step 1: Check sync success rate curl 'http://prometheus:9090/api/v1/query?query=oneiric:remote_sync_success_rate:5m' # Expected: > 0.99 # Step 2: Verify recent sync succeeded uv run python -m oneiric.cli remote-status # Check last_sync time is recent # Step 3: Verify artifacts registered uv run python -m oneiric.cli list --domain adapter # Should include remote-sourced candidates # Step 4: Monitor for 30 minutes # Watch for sync failures ``` ### Prevention 1. **Network monitoring:** Alert on DNS failures, connection timeouts 1. **Key rotation process:** Document procedure, test before production 1. **Manifest validation:** CI/CD validates manifest syntax before publish 1. **Circuit breaker tuning:** Adjust thresholds based on network reliability 1. **Artifact integrity:** Implement checksum verification in upload pipeline ### Escalation - **Initial Response:** Infrastructure engineer (on-call) - **After 30 min:** Escalate to Infrastructure Team Lead - **After 1 hour:** Escalate to Release Engineer - **If security issue:** Immediately escalate to Security Team - **Contact:** [email protected], Slack #infra-oncall ______________________________________________________________________ ## Runbook 4: Cache Corruption **Alert:** `OneiricDigestVerificationFailed` **Severity:** P0 - Critical (Security) **Response Time:** < 5 minutes **Owner:** Security Team + Platform Team ### Symptoms - Alert firing: "Artifact digest verification failed" - SHA256 mismatch errors - Corrupted cache files - Possible security breach indicators ### Impact - **User Impact:** Potentially running corrupted/malicious code - **Security Impact:** Critical - possible supply chain attack - **SLA Impact:** Critical - immediate action required ### Diagnosis **Step 1: Assess Security Risk** ```bash # Check if widespread or isolated {app="oneiric"} | json | event="digest-check-failed" | line_format "{{.artifact}}: {{.expected_digest}} != {{.actual_digest}}" # Multiple artifacts affected? = Possible attack # Single artifact? = Likely corruption # Check if digest changed in manifest curl https://manifests.example.com/oneiric/manifest.yaml | grep sha256 ``` **Step 2: Isolate Affected Systems** ```bash # If suspected attack: # 1. Stop Oneiric immediately systemctl stop oneiric # 2. Preserve evidence cp -r .oneiric_cache /tmp/evidence-$(date +%Y%m%d-%H%M%S) # 3. Notify security team # Slack #security-incidents ``` **Step 3: Investigate Root Cause** ```bash # Check disk errors (corruption) dmesg | grep -i error smartctl -a /dev/sda # Check file system integrity fsck /dev/sda1 # Check for unauthorized modifications stat .oneiric_cache/artifacts/<artifact> ls -la .oneiric_cache/artifacts/ ``` **Step 4: Verify Manifest Integrity** ```bash # Re-download manifest from trusted source curl -o manifest-fresh.yaml https://manifests.example.com/oneiric/manifest.yaml # Compare with cached version diff .oneiric_cache/manifest.yaml manifest-fresh.yaml # Verify signature # (Oneiric does this automatically, but verify manually) ``` ### Resolution #### Scenario A: Single Artifact Corruption (Disk Error) **Problem:** One artifact corrupted, likely disk issue ```bash # Step 1: Clear corrupted artifact rm -f .oneiric_cache/artifacts/<corrupted_artifact> # Step 2: Force re-download uv run python -m oneiric.cli remote-sync --manifest <url> # Step 3: Verify new digest matches {app="oneiric"} | json | event="digest-check-success" | artifact="<artifact>" # Step 4: Check disk health smartctl -a /dev/sda dmesg | grep -i error # Step 5: If disk failing, replace hardware # Schedule maintenance window ``` #### Scenario B: Multiple Artifacts Corrupted (Possible Attack) **Problem:** Many artifacts affected, possible security breach ```bash # Step 1: STOP ALL ONEIRIC INSTANCES # Step 2: Preserve evidence tar czf /tmp/oneiric-cache-$(date +%Y%m%d-%H%M%S).tar.gz .oneiric_cache/ # Step 3: Notify security team immediately # Slack #security-incidents # Email: [email protected] # Page: security-oncall # Step 4: Security team investigates # - Check manifest source for compromise # - Verify signature chain # - Analyze artifacts for malicious code # Step 5: Once cleared, full cache rebuild rm -rf .oneiric_cache/ uv run python -m oneiric.cli remote-sync --manifest <trusted_url> # Step 6: Restart with clean cache ``` #### Scenario C: Manifest Tampering **Problem:** Manifest digest changed, possible MITM ```bash # Step 1: Verify manifest signature # Oneiric logs signature verification result {app="oneiric"} | json | event="signature-verification-failed" # Step 2: If signature fails, DO NOT PROCEED # This indicates manifest tampering or MITM attack # Step 3: Notify security team # Escalate to P0 security incident # Step 4: Investigate network path # Check for proxy, firewall, CDN issues # Verify TLS certificates # Step 5: Once resolved, update public key if rotated # Or fix network issue if MITM ``` ### Verification ```bash # Step 1: Verify all digests match curl 'http://prometheus:9090/api/v1/query?query=rate(oneiric_remote_digest_checks_total{outcome="failed"}[5m])' # Expected: 0 # Step 2: Check recent digest verifications {app="oneiric"} | json | event="digest-check-success" # Step 3: List cached artifacts ls -lh .oneiric_cache/artifacts/ # Step 4: Verify no security alerts curl http://alertmanager:9093/api/v2/alerts | jq '.[] | select(.labels.component=="security")' # Expected: empty # Step 5: Monitor for 1 hour # Watch for recurrence ``` ### Prevention 1. **Disk monitoring:** Alert on disk errors, SMART failures 1. **Manifest signing:** Always verify ED25519 signatures 1. **Network security:** Use TLS, certificate pinning 1. **Access control:** Restrict who can publish manifests 1. **Audit logging:** Log all manifest/artifact changes 1. **Incident response drills:** Practice security scenarios ### Escalation - **Initial Response:** Security engineer (on-call) + Platform engineer - **Immediate:** Notify Security Team Lead - **Immediately:** Notify CISO if widespread compromise suspected - **Contact:** [email protected], Slack #security-incidents - **PagerDuty:** security-critical escalation policy **NOTE:** This is a security incident. Follow your organization's security incident response procedures. ______________________________________________________________________ ## Runbook 5: Memory Exhaustion **Alert:** `OneiricActiveInstancesExtremelyHigh` **Severity:** P0 - Critical **Response Time:** < 10 minutes **Owner:** Platform Team + SRE ### Symptoms - Alert firing: "200+ active instances, risk of memory exhaustion" - High memory usage - Application slowdown or crashes ### Impact - **User Impact:** Application crash, service unavailability - **SLA Impact:** Critical - service down risk - **Affected Systems:** Entire Oneiric runtime ### Diagnosis **Step 1: Check Memory Metrics** ```bash # Access Grafana Performance Dashboard open http://grafana:3000/d/oneiric-performance # Check: # - Active instances count # - Memory usage estimate # - Memory growth rate ``` **Step 2: Query Active Instances** ```promql # Total active instances oneiric:system_active_instances_total:5m # By domain sum(oneiric_lifecycle_active_instances) by (domain) # Memory estimate (50MB per instance) oneiric:system_memory_usage_estimate_bytes:5m / (1024^3) ``` **Step 3: Check System Memory** ```bash # Container memory usage # System memory (bare metal) free -h vmstat 1 ``` **Step 4: Identify Instance Leak** ```bash # Check lifecycle status uv run python -m oneiric.cli status --domain adapter --json # List all active instances {app="oneiric"} | json | event="instance-activated" | line_format "{{.domain}}/{{.key}}: {{.provider}}" # Check for instances not being cleaned up # Compare activation vs cleanup counts sum(oneiric_lifecycle_swap_total) - sum(oneiric_lifecycle_cleanup_total) ``` ### Resolution #### Scenario A: Instance Leak (Cleanup Not Called) **Problem:** Old instances not being cleaned up after swaps ```bash # Step 1: Check cleanup logs {app="oneiric"} | json | event=~"cleanup-(started|complete|failed)" # Step 2: If cleanup not called, code bug # Emergency: Restart to force cleanup # Step 3: Monitor instance count after restart oneiric:system_active_instances_total:5m # Step 4: Fix cleanup logic (code change required) # Ensure lifecycle.swap() calls cleanup_old() # Step 5: Deploy fix # Build new image, deploy to production ``` #### Scenario B: Memory Leak in Provider **Problem:** Provider holding references, not being garbage collected ```bash # Step 1: Profile memory usage # Install memray uv pip install memray # Step 2: Run with memory profiling memray run --live-port 8000 -m oneiric.cli orchestrate # Step 3: Access live dashboard open http://localhost:8000 # Step 4: Identify leaking provider # Look for increasing memory in flamegraph # Step 5: Fix provider cleanup # Ensure __del__() or cleanup() releases resources # Step 6: Deploy fix ``` #### Scenario C: Too Many Domains/Keys **Problem:** Legitimately high instance count, need more memory ```bash # Step 1: Calculate required memory # Formula: instances * 50MB + 500MB overhead # 200 instances = 200 * 50 + 500 = 10.5GB # Step 4: Increase memory limit (systemd) sudo systemctl edit oneiric # Add: # [Service] # MemoryMax=12G sudo systemctl daemon-reload sudo systemctl restart oneiric # Step 5: Monitor memory usage ``` #### Scenario D: Rapid Swapping Loop **Problem:** Components swapping rapidly, creating/destroying instances ```bash # Step 1: Check swap rate rate(oneiric_lifecycle_swap_total[5m]) # Step 2: Identify rapidly swapping components sum(rate(oneiric_lifecycle_swap_total[5m])) by (domain, key) # Step 3: Check for config watcher thrashing # Look for rapid config file changes {app="oneiric"} | json | event="config-changed" | line_format "{{.timestamp}}: {{.file}}" # Step 4: Pause watchers temporarily # Stop config watcher # (Requires code change or CLI command) # Step 5: Fix root cause # - Debounce config changes # - Increase watcher poll interval # - Fix config that's changing rapidly ``` ### Verification ```bash # Step 1: Check instance count stabilized curl 'http://prometheus:9090/api/v1/query?query=oneiric:system_active_instances_total:5m' # Expected: < 100 (reasonable) # Step 2: Check memory usage dropped # Step 3: Verify no OOMKills # Expected: false # Expected: no recent events # Step 4: Monitor for 30 minutes # Watch for memory growth ``` ### Prevention 1. **Cleanup enforcement:** Unit tests verify cleanup called after swaps 1. **Memory limits:** Set appropriate limits per environment 1. **Monitoring:** Alert on high instance counts before critical 1. **Resource pooling:** Reuse instances where possible 1. **Profiling:** Regular memory profiling in staging ### Escalation - **Initial Response:** Platform engineer (on-call) - **After 20 min:** Escalate to SRE Team - **After 1 hour:** Escalate to Engineering Manager - **Contact:** [email protected], [email protected], Slack #platform-oncall ______________________________________________________________________ ## Post-Incident Review After resolving any P0 or P1 incident, schedule a post-incident review within 48 hours. ### PIR Template ```markdown # Post-Incident Review: [INCIDENT NAME] **Date:** 2025-11-26 **Incident ID:** INC-12345 **Severity:** P0 - Critical **Duration:** 45 minutes **Participants:** Alice (Platform), Bob (SRE), Carol (Engineering Manager) ## Summary [1-2 sentence summary of incident] ## Timeline | Time | Event | |------|-------| | 14:00 | Alert fired: OneiricResolutionFailureRateHigh | | 14:02 | On-call acknowledged, began investigation | | 14:10 | Root cause identified: config file syntax error | | 14:15 | Fix deployed, config validated | | 14:30 | Metrics returned to normal | | 14:45 | Incident closed | ## Impact - **User Impact:** 45 minutes of degraded service - **Affected Users:** ~1,000 users - **Revenue Impact:** ~$500 (estimated) ## Root Cause [Detailed explanation of root cause] ## What Went Well - Alert fired within 1 minute of issue - On-call responded quickly (< 2 min) - Runbook was accurate and helpful - Fix deployed rapidly ## What Went Wrong - Config validation not in CI/CD pipeline - Lack of staging environment testing - Insufficient monitoring of config changes ## Action Items | Action | Owner | Due Date | Priority | |--------|-------|----------|----------| | Add config validation to CI/CD | Alice | 2025-12-01 | P0 | | Implement staging environment | Bob | 2025-12-15 | P1 | | Add config change monitoring | Carol | 2025-12-10 | P2 | ## Lessons Learned [Key takeaways and learnings] ``` ______________________________________________________________________ ## Escalation Matrix | Severity | Initial Response | After 30 min | After 1 hour | After 2 hours | |----------|-----------------|--------------|--------------|---------------| | **P0** | On-call engineer | Team Lead | Engineering Manager | Director of Engineering | | **P1** | On-call engineer | Team Lead | Engineering Manager | - | | **P2** | On-call engineer | Team Lead | - | - | | **P3** | On-call engineer | - | - | - | ### Contact Information | Team | Email | Slack | PagerDuty | |------|-------|-------|-----------| | **Platform** | [email protected] | #platform-oncall | platform-escalation | | **Security** | [email protected] | #security-incidents | security-critical | | **DevOps** | [email protected] | #devops-oncall | devops-escalation | | **Infrastructure** | [email protected] | #infra-oncall | infra-escalation | | **SRE** | [email protected] | #sre-oncall | sre-escalation | ______________________________________________________________________ ## Additional Resources - **Monitoring Dashboards:** http://grafana:3000/dashboards - **Prometheus Alerts:** http://prometheus:9090/alerts - **AlertManager:** http://alertmanager:9093 - **Loki Logs:** http://grafana:3000/explore (Loki datasource) - **Maintenance Runbooks:** `docs/runbooks/MAINTENANCE.md` - **Troubleshooting Guide:** `docs/runbooks/TROUBLESHOOTING.md` ______________________________________________________________________ **Document Version:** 1.0 **Last Reviewed:** 2025-11-26 **Next Review:** 2026-02-26 **Feedback:** [email protected]

Related Documents

評估系統

Monitoring Guide - HwpBridge

T20_enhancement_proposals

LLM Judge — Setup & Operations