Loading...
Loading...
Loading...
On-Call & Incident Management is the practice of organizing 24/7 support for services, rapid incident response, and their effective resolution to minimize impact on users.
# SRO-001 On-Call & Incident Response
On-Call & Incident Management is the practice of organizing 24/7 support for services, rapid incident response, and their effective resolution to minimize impact on users.
---
## On-Call Organization
### Team Structure
```yaml
sre_organization:
on_call_structure:
primary_on_call:
title: "Primary On-Call Engineer"
responsibilities:
- first_response_to_alerts
- initial_incident_triage
- communication_with_stakeholders
- escalation_decisions
shift_duration: "12_hours" # or 24 hours
timezone: "primary"
secondary_on_call:
title: "Secondary On-Call Engineer"
responsibilities:
- escalation_from_primary
- critical_incidents
- complex_troubleshooting
- technical_decisions
availability: "15_minutes_to_respond"
incident_commander:
title: "Incident Commander (for major incidents)"
responsibilities:
- overall_incident_coordination
- stakeholder_communication
- resource_coordination
- decision_making
activated: "for_sev1_and_sev2_incidents"
support_engineer:
title: "Support Engineer (optional)"
responsibilities:
- l1_l2_support_tasks
- routine_operations
- documentation
escalation_path: "primary -> secondary"
```
### On-Call Schedule
**Weekly Rotation:**
```
Week 1: Alice (Primary), Bob (Secondary)
Week 2: Charlie (Primary), David (Secondary)
Week 3: Eve (Primary), Frank (Secondary)
Week 4: Grace (Primary), Henry (Secondary)
```
**Daily Rotation:**
```
Monday: Alice (Day), Bob (Night)
Tuesday: Charlie (Day), David (Night)
Wednesday: Eve (Day), Frank (Night)
Thursday: Grace (Day), Henry (Night)
Friday: Ivan (Day), Judy (Night)
Saturday: Kevin (Day), Laura (Night)
Sunday: Mike (Day), Nancy (Night)
```
**Tools:**
- PagerDuty
- VictorOps / Splunk On-Call
- Opsgenie
- Grafana OnCall
---
## On-Call Process
### Before Shift
```bash
#!/bin/bash
# pre-shift-checklist.sh
echo "š Pre-Shift Checklist"
# 1. Check work environment
Check laptop, internet, phone
Verify VPN access
Test access to monitoring systems
# 2. Review recent incidents
# - Review last 7 days of alerts
# - Check unresolved issues
# - Review trending problems
# 3. Sync with previous on-call
# - Pass context
# - Check action items
# - Review runbooks
# 4. Set up notifications
# - Check pager/mobile
# - Verify Slack/email
# - Test escalation chains
```
### During Shift
```bash
#!/bin/bash
# on-call-whats-happening.sh
echo "š Current Situation"
# Active Incidents
echo "=== Active Incidents ==="
curl -s https://pagerduty.com/api/incidents | jq '.incidents[] | select(.status=="triggered")'
# Recent Alerts
echo "\n=== Recent Alerts (Last Hour) ==="
curl -s https://datadog.com/api/alerts | jq '.alerts[] | select(.date>")'
# System Health
echo "\n=== System Health ==="
curl -s https://grafana.com/api/dashboards/health | jq '.'
# Escalation Policy
echo "\n=== Current On-Call ==="
curl -s https://pagerduty.com/api/oncalls | jq '.oncalls[].user.summary'
```
### After Shift
```bash
#!/bin/bash
# post-shift-handover.sh
echo "š¤ Shift Handover Summary"
echo "Date: $(date)"
# Incident Summary
echo "\n=== Incidents During Shift ==="
echo "Total: 3"
echo "Sev1: 0"
echo "Sev2: 1 (Payment gateway latency)"
echo "Sev3: 2 (Database backup delay, Cache refresh)"
# Action Items
echo "\n=== Action Items ==="
echo "1. Fix database connection pooling (Jira TICKET-123)"
echo "2. Update runbook for cache issues (assigned: Bob)"
echo "3. Review payment service alerts (trending)"
# Notes for next shift
echo "\n=== Notes for Next On-Call ==="
echo "- Database connections spiking around 15:00 UTC"
echo "- New deployment scheduled for tomorrow"
echo "- Increased latency in EU region"
```
---
## Incident Severity Levels
### Severity 1 (Sev1) - Critical
```yaml
sev1:
definition: "Complete service unavailability for most users"
examples:
- website_down: "Website is completely not working"
- payment_system_down: "Unable to process payments"
- data_loss: "Data loss"
- major_security_incident: "System breach"
response:
response_time: "immediate" # < 5 minutes
resolution_target: "4_hours"
participants:
- primary_on_call
- secondary_on_call
- incident_commander
- engineering_manager
- stakeholders
communication:
frequency: "30_minutes_to_stakeholders"
channels: ["incident_management_tool", "slack_war_room", "email", "conference_bridge"]
escalation: "immediate_vp_eng"
postmortem:
required: true
timeline: "5_business_days"
```
### Severity 2 (Sev2) - Major
```yaml
sev2:
definition: "Significant service degradation or important feature unavailability"
examples:
- partial_outage: "50% of requests fail"
- severe_performance: "P99 latency > 5000ms"
- backup_failures: "Backup failures"
- degraded_functionality: "Key features not working"
response:
response_time: "15_minutes"
resolution_target: "8_hours"
participants:
- primary_on_call
- secondary_on_call
- incident_commander
- relevant_team_lead
communication:
frequency: "1_hour"
channels: ["incident_tool", "slack"]
escalation: "on_slo_breach"
postmortem:
required: true
timeline: "10_business_days"
```
### Severity 3 (Sev3) - Moderate
```yaml
sev3:
definition: "Minimal user impact or workaround available"
examples:
- minor_degradation: "10% of requests slower"
- non_critical_functionality: "'Nice to have' feature"
- scheduled_task_failure: "Background task errors"
- monitoring_gap: "Data not being collected"
response:
response_time: "30_minutes"
resolution_target: "24_hours"
participants:
- primary_on_call
- optional: secondary_on_call
communication:
frequency: "as_needed"
channels: ["incident_tool"]
escalation: "if_not_resolved_in_24h"
postmortem:
required: false
timeline: "optional"
```
### Severity 4 (Sev4) - Low
```yaml
sev4:
definition: "Cosmetic or informational issues, no user impact"
examples:
- cosmetic_issues: "Wrong color in UI"
- doc_updates: "Documentation inaccuracy"
- low_priority_alerts: "Monitoring warnings"
- questions: "Technical questions"
response:
response_time: "business_hours"
resolution_target: "72_hours"
participants:
- on_call_as_needed
communication:
frequency: "none"
channels: ["ticket_system"]
escalation: "rare"
postmortem:
required: false
```
---
## Incident Lifecycle
```
Alert/Detection ā Triage ā Response ā Investigation ā Resolution ā Post-Incident
```
### 1. Detection (Detection)
```python
class IncidentDetection:
def __init__(self, monitoring_tools):
self.monitoring = monitoring_tools
def detect_incident(self):
"""Detect incidents from different sources"""
# From monitoring (Prometheus, Datadog)
alerts = self.monitoring.get_active_alerts(
severity=['warning', 'critical']
)
# From logs (ERROR, FATAL)
log_errors = self.monitoring.get_log_spikes(
level=['error', 'fatal'],
threshold=100 # errors per minute
)
# From user complaints
user_reports = self.monitoring.get_user_complaints(
source: ['support_tickets', 'twitter', 'status_page']
)
# Correlation - merge related problems
incidents = self.correlate_signals(
alerts + log_errors + user_reports
)
return incidents
```
### 2. Triage (Assessment and Classification)
```python
class IncidentTriage:
def triage_incident(self, incident):
"""Assess incident and determine severity"""
# Impact assessment
impact = self.calculate_impact(
affected_users=incident.affected_users,
affected_requests=incident.error_rate,
functionality=incident.affected_components
)
# Severity classification
if impact == 'total_outage' or incident.security_breach:
severity = 1
elif impact == 'significant_degradation' or incident.revenue_impact:
severity = 2
elif impact == 'minor_degradation':
severity = 3
else:
severity = 4
# Workflow based on severity
self.start_incident_workflow(severity, incident)
return {
'severity': severity,
'impact': impact,
'incident_id': self.generate_id()
}
```
### 3. Response (Reaction)
```yaml
incident_response:
declare_incident:
actions:
- create_incident_channel: "#inc-2024-001"
- start_incident_timer: true
- assign_incident_commander: based_on_severity
- notify_oncall: primary_and_secondary
- page_if_sev1_or_sev2: true
initial_communication:
template: |
šØ INCIDENT {{ incident.id }} - {{ incident.severity }}
Title: {{ incident.title }}
Started: {{ incident.started_at }}
Affected: {{ incident.affected_services }}
Commander: {{ incident.commander }}
Channel: {{ incident.slack_channel }}
https://incidents.example.com/{{ incident.id }}
recipients:
- engineering_team_slack
- oncall_pagerduty
- stakeholders_if_sev1
- status_page_if_customer_impact
```
### 4. Investigation (Investigation)
```python
class IncidentInvestigation:
def investigate(self, incident):
"""Investigate incident"""
steps = []
# Collect data
steps.append(self.collect_metrics(
timeframe: incident.timeframe,
metrics: ['error_rate', 'latency', 'throughput']
))
steps.append(self.collect_logs(
services: incident.affected_services,
level: ['error', 'warn', 'fatal'],
timeframe: incident.timeframe
))
steps.append(self.collect_traces(
error_spans: incident.error_spans,
timeframe: incident.timeframe
))
# Analysis
possible_causes = self.analyze_patterns(steps)
# Testing hypotheses
for cause in possible_causes:
verified = self.test_hypothesis(cause, incident)
if verified:
incident.root_cause = cause
break
return incident
```
### 5. Resolution (Solution)
```bash
#!/bin/bash
# incident-resolution.sh
echo "š§ Incident Resolution Process"
# 1. Test the fix
echo "1. Testing fix in staging..."
./test-fix.sh --environment=staging
if [ $? -eq 0 ]; then
echo "ā
Fix validated in staging"
else
echo "ā Fix failed in staging"
exit 1
fi
# 2. Deploy fix
echo "2. Deploying fix to production..."
./deploy.sh --service=$AFFECTED_SERVICE --check-only
# 3. Monitor after deployment
echo "3. Monitoring after fix..."
./monitor-deployment.sh --service=$AFFECTED_SERVICE --duration=15m
# 4. Verify recovery
echo "4. Verifying recovery..."
./verify-recovery.sh --incident=$INCIDENT_ID
# 5. Communication
echo "5. Notifying stakeholders..."
./notify.sh --incident=$INCIDENT_ID --message="resolved"
```
### 6. Post-Incident (Post-Incident Analysis)
```markdown
# Incident Report Template
## Summary
- **Incident ID:** INC-2024-001
- **Date:** 2024-01-15 14:30 UTC
- **Duration:** 2 hours 15 minutes
- **Severity:** Sev2
- **Affected:** Payment service
## Timeline
- 14:30: Alert triggered (error rate > 50%)
- 14:32: On-call acknowledged
- 14:35: Incident declared, war room opened
- 14:45: Root cause identified (DB connection pool exhaustion)
- 15:00: Mitigation deployed (increase pool size)
- 16:45: Service fully recovered
## Root Cause
Database connection pool size insufficient after deployment of v2.1
## Impact
- 30% of payment transactions failed
- ~$50,000 revenue impact
- 500+ affected customers
## Lessons Learned
1. Load testing didn't simulate real traffic patterns
2. Connection pool monitoring missing
3. Rollback procedure unclear
## Action Items
- [ ] Add connection pool metrics (Owner: Alice, Due: 2024-01-22)
- [ ] Improve load testing scenarios (Owner: Bob, Due: 2024-01-29)
- [ ] Document rollback procedures (Owner: Charlie, Due: 2024-01-20)
```
---
## Communication
### Incident Communication Matrix
```yaml
communication_channels:
internal:
team:
channel: slack #incident-war-room
frequency: real_time
content: technical_details
company:
channel: slack #general-updates
frequency: hourly
content: high_level_status
external:
customers:
channel: status_page
frequency: every_30_minutes_sev1
content: impact_summary
enterprise_clients:
channel: email + phone
frequency: every_15_minutes_sev1
content: detailed_status
partners:
channel: status_page + email
frequency: hourly
content: service_impact
```
### Status Page Updates
```markdown
# Status Page Template
## In Progress (14:45 UTC)
š” **Performance Degradation - Payment Service**
**Status:** Investigating
**Started:** 14:30 UTC
**Impact:** Some users experiencing slow payment processing
**ETA:** Under investigation
We're currently investigating reports of slow payment processing.
Our engineering team is working to identify the root cause.
Updates will be posted here every 30 minutes.
---
## Update (15:15 UTC)
**Status:** Identified
We've identified the issue: Database connection pool exhaustion.
**Mitigation:** We're increasing connection pool size
**Expected Resolution:** 15:45 UTC
---
## Resolved (16:45 UTC)
**Status:** Fully Resolved
Service is fully recovered. All payment processing back to normal.
**Root Cause:** Database connection pool insufficient after new deployment
**Timeline:** 2 hours 15 minutes
**Next Steps:** Post-incident review scheduled for tomorrow
```
---
## Tools
### PagerDuty
```yaml
# escalation_policy.yaml
escalation_policy:
name: "Critical Service On-Call"
repeat: 3
escalation_rules:
- level: 1
timeout_minutes: 5
targets:
- type: user
id: primary_oncall
- level: 2
timeout_minutes: 10
targets:
- type: user
id: secondary_oncall
- type: schedule
id: engineering_team
- level: 3
timeout_minutes: 15
targets:
- type: user
id: engineering_manager
- type: user
id: vp_engineering
```
### Incident Management Tool
```python
class IncidentManagement:
def __init__(self, slack, pagerduty, status_page):
self.slack = slack
self.pagerduty = pagerduty
self.status_page = status_page
def create_incident(self, incident):
"""Create incident in all systems"""
# Create Slack channel
channel = self.slack.create_channel(
name=f"inc-{incident.id}",
purpose=f"Incident: {incident.title}"
)
# Create PagerDuty incident
pd_incident = self.pagerduty.create_incident(
title=incident.title,
urgency=incident.severity,
assignee=incident.commander
)
# Create Status Page incident
if incident.severity <= 2:
sp_incident = self.status_page.create_incident(
name=incident.title,
impact=incident.impact_level,
components=incident.affected_components
)
# Log in system
return IncidentRecord(
id=incident.id,
slack_channel=channel,
pagerduty_id=pd_incident.id,
status_page_id=sp_incident.id if sp_incident else None
)
```
---
## On-Call and Incident Metrics
```python
ON_CALL_METRICS = {
# Time to detection
'incident_detection_time': 'Time from problem to alert',
# Time to response
'incident_acknowledgement_time': 'Time to acknowledge',
'on_call_response_time': 'On-call response time',
# Time to resolution
'incident_resolution_time': 'Time to full resolution',
'incident_mitigation_time': 'Time to impact mitigation (MTTM)',
# Performance
'incident_count_by_severity': 'Incident count by severity',
'on_call_alerts_per_shift': 'Alerts per shift',
'false_positive_rate': 'False positive rate',
# Workload
'on_call_interruptions': 'Interruptions per shift',
'after_hours_pages': 'Pages outside business hours',
'weekend_incidents': 'Weekend incidents',
# Quality
'postmortems_completed': 'Postmortems completed',
'action_items_completed': 'Action items completed',
'recurring_incidents': 'Recurring incidents'
}
```
---
*On-Call & Incident Management - 24/7 support and incident management practice*
This guide will help you set up the complete GhostWriter stack: Backend (Go), Frontend (Web), and iOS App.
This guide provides instructions for comparing AGS data from the Google Cloud Storage (GCS) bucket with existing dataset layouts stored in Google Drive/Google Sheets.
1. In chat, type: `/edit <filename>`
**Project:** Zumodra HR/Management SaaS