Incident Response Plan

> **Scope:** Comprehensive incident response plan covering roles, communication, escalation, and postmortem processes. For the incident lifecycle summary and severity table overview, see [gitops-workflow.md](./gitops-workflow.md#incident-response-framework). For rollback procedures, see [gitops-workflow.md](./gitops-workflow.md#rollback-procedures). This document is the detailed operational reference for handling incidents.

sholaj

May 2, 2026

0 upvotes

0 downloads

0 views

ai workflow

View source

# Incident Response Plan > **Scope:** Comprehensive incident response plan covering roles, communication, escalation, and postmortem processes. For the incident lifecycle summary and severity table overview, see [gitops-workflow.md](./gitops-workflow.md#incident-response-framework). For rollback procedures, see [gitops-workflow.md](./gitops-workflow.md#rollback-procedures). This document is the detailed operational reference for handling incidents. --- ## Severity Classification ### Severity Table | Severity | Definition | Customer Impact | Response SLA | Update Cadence | Resolution Target | Examples | |---|---|---|---|---|---|---| | **SEV-1** | Complete service outage, data loss risk, or active security breach | All or most customers cannot use the platform | Acknowledge: 5 min, Engage: 10 min | Every 15 min | 1 hour to mitigate | Full API outage; database unavailable; data breach detected; all pods crash-looping | | **SEV-2** | Major degradation affecting >25% of customers or a critical feature is unavailable | Significant subset of customers impacted; core workflows broken | Acknowledge: 10 min, Engage: 15 min | Every 30 min | 4 hours to mitigate | AI inference service down; error rate > 5%; database replication lag > 30s; payment processing failure | | **SEV-3** | Minor degradation with workaround available | Small number of customers affected; non-critical feature impaired | Acknowledge: 30 min, Engage: 2 hours | Every 2 hours | Next business day | Slow queries on non-critical endpoint; single dashboard unavailable; non-critical background job failures | | **SEV-4** | Cosmetic or minimal impact issue | No meaningful customer impact | Acknowledge: 4 hours | As needed | Within 1 week | UI rendering glitch; non-critical metric missing from dashboard; minor log formatting issue | ### Severity Decision Guide When in doubt, **escalate up** (declare a higher severity). It is always acceptable to downgrade a severity after triage; it is never acceptable to under-declare and discover later that impact was worse than assessed. - If **any** customer reports being unable to use the platform → SEV-1 - If monitoring shows error budget burning at > 14.4x rate → SEV-1 - If a security alert fires (intrusion detection, credential exposure) → SEV-1 - If error rate exceeds 5% but service is partially functional → SEV-2 - If a single non-critical component is degraded → SEV-3 --- ## Incident Commander (IC) Responsibilities The Incident Commander is the single point of authority during an incident. For SEV-1 and SEV-2, an IC must be assigned within the first 10 minutes. ### Who Can Be IC - Primary on-call SRE (default IC for all incidents) - Staff SRE (takes over IC for SEV-1 or complex multi-service incidents) - Tech Lead (can serve as IC if SRE is unavailable) ### IC Duties | Responsibility | Detail | |---|---| | **Declare the incident** | Set severity, open the incident Slack channel, page required responders | | **Coordinate response** | Assign roles (see below), ensure parallel workstreams are not conflicting | | **Manage communication** | Post regular updates to the incident channel and status page | | **Make decisions** | Authorise rollbacks, break-glass deployments, or data recovery actions | | **Track timeline** | Maintain a running timeline of events, decisions, and actions in the incident channel | | **Prevent scope creep** | Keep the team focused on mitigation first, root cause second | | **Declare resolution** | Confirm service is restored, monitoring is stable, and schedule the postmortem | | **Never debug** | The IC coordinates — they do not investigate or write code. If the IC needs to debug, they hand off IC to someone else. | ### Supporting Roles (SEV-1 War Room) | Role | Responsibility | Assigned To | |---|---|---| | **Incident Commander** | Overall coordination and decision-making | On-call SRE or Staff SRE | | **Technical Lead** | Directs investigation and remediation efforts | Senior engineer familiar with affected service | | **Communications Lead** | Handles all external communications (status page, customer updates) | Engineering Manager or designated person | | **Scribe** | Documents timeline, actions, and decisions in real time | Any available engineer | --- ## Communication Plan ### Internal Communication #### Slack Channel Naming Convention ``` #inc-YYYY-MM-DD-short-title ``` Examples: - `#inc-2026-03-20-api-outage` - `#inc-2026-03-20-db-connection-exhaustion` #### Channel Setup (Automated via Slack Workflow) When an incident is declared, the following happens automatically: 1. Incident channel is created with the naming convention above 2. On-call SRE, Staff SRE, and relevant service owners are invited 3. Channel topic is set to: `SEV-X | IC: @name | Status: Investigating` 4. PagerDuty incident link is pinned #### Internal Update Template ``` --- INCIDENT UPDATE --- Time: HH:MM UTC Status: [Investigating / Identified / Mitigating / Monitoring / Resolved] Severity: SEV-X IC: @name Current understanding: [What we know about the issue] Actions in progress: - [Action 1] — @owner - [Action 2] — @owner Next update: HH:MM UTC ``` ### External Communication #### Status Page Updates (Statuspage.io) | Severity | Status Page Action | Update Template | |---|---|---| | **SEV-1** | Post immediately (within 10 min of declaration) | "We are experiencing a service disruption affecting [component]. Our team is actively investigating. We will provide updates every 15 minutes." | | **SEV-2** | Post within 20 min if customer-facing impact confirmed | "We are experiencing degraded performance for [component]. Some users may experience [symptoms]. Our team is working on a resolution." | | **SEV-3** | Post only if customer-reported or affecting a published SLA | "We are aware of intermittent issues with [component]. A workaround is available: [workaround]. We are working on a permanent fix." | | **SEV-4** | No external communication | — | #### Customer Communication (Enterprise Accounts) For SEV-1 and SEV-2 incidents affecting enterprise customers with contractual SLAs: 1. **Customer Success** is notified via `#customer-incidents` Slack channel within 15 minutes 2. **Account Manager** sends a direct communication to affected enterprise customers within 30 minutes 3. **Post-resolution** — affected enterprise customers receive a written incident summary within 24 hours --- ## Escalation Flowchart ```mermaid flowchart TD classDef platform fill:#1B4D3E,stroke:#0f3429,color:white classDef aws fill:#2563EB,stroke:#1d4ed8,color:white classDef external fill:#F59E0B,stroke:#d97706,color:white classDef security fill:#DC2626,stroke:#b91c1c,color:white classDef supporting fill:#6B7280,stroke:#4b5563,color:white DETECT["Alert Fires\nPagerDuty pages primary on-call"]:::external -->|"page sent"| ACK{Acknowledged\nwithin 5 min?} ACK -->|"acknowledged"| ASSESS["Impact Assessment\nOn-call assesses scope"]:::platform ACK -->|"no response"| PAGE2["Secondary Page\nPagerDuty pages secondary on-call"]:::external PAGE2 -->|"check response"| ACK2{Acknowledged\nwithin 5 min?} ACK2 -->|"acknowledged"| ASSESS ACK2 -->|"no response"| PAGE_MGR["Manager Page\nPagerDuty pages Engineering Manager"]:::external PAGE_MGR -->|"manager responds"| ASSESS ASSESS -->|"classify severity"| SEV{Determine\nseverity} SEV -->|"SEV-3 or SEV-4"| TICKET["Create Ticket\nResolve during business hours"]:::supporting SEV -->|"SEV-2"| S2_CHANNEL["Open Incident Channel\nAssign IC"]:::platform S2_CHANNEL -->|"begin response"| S2_INVESTIGATE["Investigate & Mitigate\nFollow runbook"]:::platform S2_INVESTIGATE -->|"check status"| S2_TIMER{Mitigated\nwithin 30 min?} S2_TIMER -->|"resolved"| S2_MONITOR["Confirm Resolution\nMonitor for 1 hour"]:::supporting S2_TIMER -->|"not resolved"| S2_ESCALATE["Escalate\nStaff SRE, consider SEV-1"]:::security S2_ESCALATE -->|"continue mitigation"| S2_INVESTIGATE SEV -->|"SEV-1"| S1_DECLARE["Declare SEV-1\nCritical incident"]:::security S1_DECLARE -->|"mobilise team"| S1_CHANNEL["Incident Channel\nPage all required responders"]:::platform S1_CHANNEL -->|"activate war room"| S1_WARROOM["War Room\nAssign IC + roles"]:::platform S1_WARROOM -->|"focus on recovery"| S1_MITIGATE["Mitigation\nRollback / feature flag / redirect"]:::platform S1_MITIGATE -->|"check status"| S1_TIMER{Mitigated\nwithin 30 min?} S1_TIMER -->|"resolved"| S1_MONITOR["Confirm Resolution\nMonitor for 2 hours"]:::supporting S1_TIMER -->|"not resolved"| S1_VP["Executive Escalation\nVP Engineering"]:::security S1_VP -->|"assess root cause"| S1_AWS{AWS issue\nsuspected?} S1_AWS -->|"yes"| AWS_CASE["AWS Support Case\nSeverity 1"]:::aws S1_AWS -->|"no, continue internally"| S1_MITIGATE S2_MONITOR -->|"schedule review"| POSTMORTEM["Schedule Postmortem\nWithin 48 hours"]:::supporting S1_MONITOR -->|"schedule review"| POSTMORTEM ``` --- ## On-Call Protocol ### Rotation Schedule | Role | Rotation Length | Handoff | Coverage | |---|---|---|---| | **Primary on-call** | 1 week (Mon 09:00 – Mon 09:00 UTC) | Written handoff in `#sre-oncall` | 24/7 response | | **Secondary on-call** | 1 week (same cycle) | Written handoff in `#sre-oncall` | Escalation backup | | **Staff SRE (tertiary)** | Always reachable | N/A | SEV-1 escalation | ### PagerDuty Escalation Policy ``` Level 1: Primary on-call → 5 min timeout Level 2: Secondary on-call → 5 min timeout Level 3: Engineering Manager → 10 min timeout Level 4: VP Engineering (SEV-1 only) ``` ### Handoff Process Every Monday at 09:00 UTC, the outgoing on-call engineer posts a handoff summary: ``` On-Call Handoff — YYYY-MM-DD Outgoing: @[name] Incoming: @[name] Active issues: - [Issue 1]: [status, next steps, link] - [Issue 2]: [status, next steps, link] Resolved this week: - [Issue]: [brief summary] Things to watch: - [Upcoming deployment, known flaky alert, etc.] ``` ### On-Call Expectations - **Response time:** Acknowledge PagerDuty alert within 5 minutes, 24/7 - **Availability:** Must have laptop and internet access at all times during on-call shift - **Escalation:** If you cannot respond (e.g., traveling, ill), escalate to secondary before going unavailable - **Compensation:** On-call compensation per company policy (out of scope for this document) - **Fatigue management:** No engineer should be primary on-call for more than 1 week in any 4-week period --- ## War Room Protocol (SEV-1) ### Activation Criteria A war room is activated for: - All SEV-1 incidents - SEV-2 incidents that are not mitigated within 30 minutes - Any incident where the IC requests additional coordination ### War Room Checklist 1. IC opens the incident Slack channel (naming: `#inc-YYYY-MM-DD-title`) 2. IC assigns roles: Technical Lead, Communications Lead, Scribe 3. IC posts initial situation summary with known facts only 4. Communications Lead posts first status page update within 10 minutes 5. Technical Lead directs investigation — assigns specific tasks to responders 6. Scribe maintains running timeline in the incident channel 7. IC posts updates every 15 minutes (even if update is "no change") 8. All communication happens in the incident channel — no side conversations in DMs 9. Non-essential personnel should not join the channel unless invited 10. IC declares "mitigated" when customer impact has stopped, then "resolved" when the fix is confirmed stable ### War Room Rules - **Mitigation first, root cause second** — restore service before investigating why - **One voice** — The IC makes decisions. Suggestions go through the IC. - **No blame** — Focus on facts and actions, not on who caused what - **Time-box investigations** — If an investigation path has not yielded results in 15 minutes, try a different approach - **Capture everything** — Every action, hypothesis, and decision is logged by the scribe --- ## Postmortem Process ### When a Postmortem Is Required | Severity | Postmortem Required | Timeline | |---|---|---| | SEV-1 | Always | Within 48 hours | | SEV-2 | Always | Within 5 business days | | SEV-3 | If error budget impact > 10% or customer-reported | Within 10 business days | | SEV-4 | Never | — | ### Postmortem Timeline Template ```markdown ## Incident: [Title] — [YYYY-MM-DD] **Severity:** SEV-X **Duration:** HH:MM (from detection to resolution) **Customer Impact:** [number of customers affected, what they experienced] **Error Budget Impact:** [X% of 28-day budget consumed by this incident] **IC:** @[name] **Author:** @[name] --- ## Summary [2-3 sentence summary of what happened and the business impact] ## Timeline (all times UTC) | Time | Event | |---|---| | HH:MM | First alert fired: [alert name] | | HH:MM | On-call acknowledged | | HH:MM | Severity declared: SEV-X | | HH:MM | Incident channel opened: #inc-... | | HH:MM | Root cause identified: [brief] | | HH:MM | Mitigation applied: [what was done] | | HH:MM | Service restored, monitoring | | HH:MM | Incident resolved | ## Detection - How was the incident detected? (Alert / Customer report / Internal observation) - Time from impact start to detection: [X minutes] - Was detection fast enough? If not, what alert is missing? ## Root Cause [Technical root cause — be specific] ## Contributing Factors [What made this possible, what made it worse, what delayed resolution] ## 5 Whys Analysis 1. **Why** did [symptom] happen? → Because [cause 1] 2. **Why** did [cause 1] happen? → Because [cause 2] 3. **Why** did [cause 2] happen? → Because [cause 3] 4. **Why** did [cause 3] happen? → Because [cause 4] 5. **Why** did [cause 4] happen? → Because [root cause] ## What Went Well - [Things that worked: fast detection, effective runbook, good coordination] ## What Could Be Improved - [Things that didn't work: slow escalation, missing runbook, gaps in monitoring] ## Action Items | Priority | Action | Owner | Due Date | Ticket | |---|---|---|---|---| | P1 | [Prevent recurrence] | @name | YYYY-MM-DD | JIRA-XXX | | P2 | [Improve detection] | @name | YYYY-MM-DD | JIRA-XXX | | P3 | [Improve process] | @name | YYYY-MM-DD | JIRA-XXX | ``` ### Postmortem Meeting **Attendees:** IC, responders, service owners, Engineering Manager **Agenda (60 minutes):** 1. **Timeline review** (15 min) — Walk through the timeline, fill in gaps 2. **Root cause and contributing factors** (15 min) — 5 Whys analysis 3. **What went well / what could be improved** (15 min) — Open discussion, blameless 4. **Action items** (15 min) — Assign owners and due dates, prioritise **Ground rules:** - Blameless — focus on systems and processes, not individuals - Assume good intent — everyone involved was doing their best with the information they had - Specificity — "we need better monitoring" is not an action item; "add alert for connection pool > 80%" is ### Action Item Tracking - All action items are created as tickets (GitHub Issues or Jira) within 24 hours of the postmortem meeting - P1 items (prevent recurrence) must be completed within 1 sprint - P2 items (improve detection) must be completed within 2 sprints - P3 items (improve process) are added to the backlog and prioritised normally - Action item completion is reviewed in the monthly SLO review (see [sli-slo-definitions.md](../observability/sli-slo-definitions.md)) --- ## Incident Tooling | Tool | Purpose | Access | |---|---|---| | **PagerDuty** | Alert routing, on-call scheduling, escalation | All SRE + engineering on-call | | **Slack** | Incident coordination, communication | All engineering | | **Grafana** | Metrics dashboards, SLO status, deployment annotations | All engineering (read), SRE (write) | | **Loki** | Log search and analysis | All engineering | | **Tempo** | Distributed trace analysis | All engineering | | **ArgoCD** | Deployment status, rollback execution | SRE + Tech Leads (admin), Engineers (read) | | **AWS Console** | EC2, RDS, EKS status checks | SRE (production access), Engineers (read-only) | | **Statuspage.io** | External status communication | Communications Lead + SRE | ### Grafana Dashboard Quick Links During Incidents | Dashboard | When to Use | |---|---| | Platform Overview | First dashboard to check — SLO status, error budgets, active alerts | | API Service | API error rate spikes, latency issues | | AI Inference | AI service failures, queue depth issues | | EKS Cluster | Node issues, pod scheduling problems, resource utilisation | | Database | Connection exhaustion, query latency, replication lag | See [monitoring-framework.md](../observability/monitoring-framework.md#dashboard-structure) for full dashboard details. --- ## Practice: Game Days and Chaos Engineering ### Game Day Schedule | Frequency | Exercise | Scope | |---|---|---| | **Monthly** | Tabletop exercise | Walk through a scenario verbally — test decision-making and communication | | **Quarterly** | Live incident simulation | Inject a failure in staging and run the full incident response process | | **Bi-annually** | Chaos engineering in production | Controlled fault injection in production (with customer notification if needed) | ### Game Day Scenarios (Rotating) 1. **API service outage** — Simulate complete API failure, practice rollback and customer communication 2. **Database failover** — Trigger Aurora failover, validate application reconnection and data integrity 3. **Node failure cascade** — Terminate multiple EC2 instances simultaneously, validate Karpenter recovery 4. **Dependency failure** — Simulate external API timeout, validate circuit breaker behaviour 5. **Security incident** — Simulate credential exposure, practice rotation and audit procedures 6. **Region degradation** — Simulate AZ failure, validate pod distribution and traffic routing ### Chaos Engineering Tools | Tool | Use Case | |---|---| | **Litmus Chaos** | Kubernetes-native chaos experiments (pod kill, network delay, CPU stress) | | **AWS Fault Injection Simulator (FIS)** | AWS resource failures (instance termination, AZ impairment, API throttling) | | **Custom scripts** | Application-level fault injection via feature flags | ### Chaos Engineering Rules 1. **Never in production without a documented hypothesis and rollback plan** 2. **Start small** — single pod kill before multi-node termination 3. **Business hours only** — full engineering team available 4. **Monitor continuously** — all Grafana dashboards open during experiments 5. **Stop immediately** if customer impact exceeds the agreed blast radius 6. **Document findings** — every chaos experiment gets a write-up with findings and action items --- ## References - [GitOps Workflow](./gitops-workflow.md) — Rollback procedures, deployment pipeline, incident lifecycle summary - [Change Control](./change-control.md) — Emergency change process, change freeze during incidents - [Monitoring Framework](../observability/monitoring-framework.md) — Alert configuration, dashboards, log management - [Alerting Runbooks](../observability/alerting-runbooks.md) — Specific alert investigation and remediation steps - [SLI/SLO Definitions](../observability/sli-slo-definitions.md) — Error budget policy, SLO review cadence

Related Documents

GhostWriter Complete Setup Guide

AGS Data Comparison Guide

Editor Preview - Quick Reference

Zumodra – Incident Response & Troubleshooting Guide