AWS Triage Playbook

Purpose

This playbook provides step-by-step guidance for triaging AWS incidents, identifying root causes, and applying immediate remediation.

1. Initial Assessment

Identify the affected service (EC2, S3, Lambda, DynamoDB, etc.)
Check monitoring dashboards
- CloudWatch metrics
- CloudTrail logs
- Trusted Advisor alerts
Determine impact
- Number of affected resources
- Business-critical impact
Set severity level
- Sev 1: Critical outage
- Sev 2: Major degradation
- Sev 3: Minor impact

2. Evidence Collection

Gather logs from:
- CloudWatch
- Application logs
- Security groups / IAM changes
Record timestamps of error occurrence
Note any recent deployments or changes

3. Containment

Stop further damage:
- Isolate affected resources
- Apply temporary configuration rollback if safe
- Disable suspicious IAM changes
Notify relevant teams and stakeholders

4. Investigation & Root Cause Analysis

Check configuration drift:
- VPC settings, security groups, subnets
Validate permissions:
- IAM policies, S3 bucket policies
Review service-specific logs:
- Lambda logs, EC2 system logs, RDS logs
Correlate changes with errors
- Recent deploys
- Scheduled jobs or scripts

5. Remediation

Apply fixes in a controlled environment first
Test impact on non-production resources
Implement fix in production after validation
Document steps taken

6. Post-Incident

Conduct retrospective / root cause review
Update playbook with lessons learned
Automate guardrails to prevent recurrence

7. Useful Commands & References

EC2

aws ec2 describe-instances
aws ec2 describe-security-groups
aws s3 ls
aws s3 ls
aws s3api get-bucket-policy --bucket <bucket-name>
## Overview
This project demonstrates practical Cloud Support & CloudOps skills by working with AWS services such as VPC, DynamoDB, CloudTrail, RDS, IAM, EC2, CloudWatch, S3, Lambda.

## Skills Demonstrated
Automation, monitoring, incident response, troubleshooting, and Infrastructure as Code using Terraform/CloudFormation.

## Usage
Clone the repo and follow the scripts or Terraform configurations to deploy and test resources. Designed to simulate realistic AWS cloud incidents.

## What I Learned
Hands-on experience troubleshooting AWS incidents, applying automation, monitoring with CloudWatch, and ensuring cloud reliability.

AWS Triage Playbook

AWS Triage Playbook

Purpose

1. Initial Assessment

2. Evidence Collection

3. Containment

4. Investigation & Root Cause Analysis

5. Remediation

6. Post-Incident

7. Useful Commands & References

EC2

Related Documents

Visual Truth Engine: Product-Market Fit & Go-to-Market Strategy

Media Handling Playbook - Zyeuté v3

Trader ROI Playbook (Codex + CI)

OSCP Attack Playbook