Loading...
Loading...
Loading...
Complete guide for setting up and managing alerts in production FFmpeg RTMP deployments.
# Alerting Guide for FFmpeg RTMP
Complete guide for setting up and managing alerts in production FFmpeg RTMP deployments.
## Table of Contents
1. [Overview](#overview)
2. [Alert Categories](#alert-categories)
3. [Setup Instructions](#setup-instructions)
4. [Alert Configuration](#alert-configuration)
5. [Notification Channels](#notification-channels)
6. [Testing Alerts](#testing-alerts)
7. [Incident Response](#incident-response)
8. [Maintenance and Tuning](#maintenance-and-tuning)
---
## Overview
The FFmpeg RTMP system includes comprehensive alerting rules for:
- **Critical incidents** requiring immediate action (page on-call)
- **Warnings** that need investigation (notify team)
- **Performance degradation** for proactive monitoring (info/ticket)
### Alert Severity Levels
| Severity | Response | Notification | Example |
|----------|----------|--------------|---------|
| **critical** | Immediate (page) | PagerDuty + Slack | Master down, all workers offline |
| **warning** | Within 1 hour | Slack + Email | Single worker down, high queue |
| **info** | Next business day | Slack (monitoring) | Performance degradation, low utilization |
---
## Alert Categories
### Critical Alerts (Page Immediately)
**FFmpegMasterNodeDown**
- **Trigger:** Master unreachable for 2+ minutes
- **Impact:** Complete system outage
- **Action:** Check service, logs, restart if needed
**FFmpegAllWorkersDown**
- **Trigger:** Zero workers available for 5+ minutes
- **Impact:** No processing capacity
- **Action:** Check worker services, network connectivity
**FFmpegCriticalFailureRate**
- **Trigger:** >50% jobs failing for 10+ minutes
- **Impact:** System critically degraded
- **Action:** Check worker resources, logs, input files
**FFmpegQueueCritical**
- **Trigger:** Queue > 2000 jobs for 15+ minutes
- **Impact:** Severe delays, SLA violations
- **Action:** Add emergency capacity, check workers
**FFmpegMasterDiskCritical**
- **Trigger:** Master disk < 5% free for 5+ minutes
- **Impact:** Database writes failing
- **Action:** Clean logs, remove backups, expand disk
### Warning Alerts (Investigate)
**FFmpegWorkerNodeDown**
- **Trigger:** Individual worker down 5+ minutes
- **Impact:** Reduced capacity
- **Action:** Check service on worker node
**FFmpegHighFailureRate**
- **Trigger:** >10% jobs failing for 10+ minutes
- **Impact:** Elevated error rate
- **Action:** Investigate failed jobs, check resources
**FFmpegQueueWarning**
- **Trigger:** Queue > 500 jobs for 15+ minutes
- **Impact:** Increasing delays
- **Action:** Monitor growth, plan capacity
**FFmpegWorkerCapacityHigh**
- **Trigger:** Worker > 85% capacity for 30+ minutes
- **Impact:** Limited headroom
- **Action:** Consider adding capacity
**FFmpegHighJobLatency**
- **Trigger:** P95 queue wait > 5 minutes for 15+ minutes
- **Impact:** Slow job processing
- **Action:** Add workers or increase concurrency
**FFmpegWorkerDiskWarning**
- **Trigger:** Worker disk < 20% free for 10+ minutes
- **Impact:** May reject jobs soon
- **Action:** Clean temp files
**FFmpegMasterDiskWarning**
- **Trigger:** Master disk < 15% free for 10+ minutes
- **Impact:** Approaching critical
- **Action:** Clean logs, plan expansion
### Performance Alerts (Informational)
**FFmpegSlowJobExecution**
- **Trigger:** Jobs 50% slower than baseline
- **Impact:** Degraded performance
- **Action:** Investigate CPU, disk, network
**FFmpegLowWorkerUtilization**
- **Trigger:** Worker < 20% utilized for 2+ hours
- **Impact:** Inefficient resource use
- **Action:** Consider cost optimization
---
## Setup Instructions
### Prerequisites
- Prometheus installed and configured
- Alertmanager installed
- Notification channels configured (Slack, PagerDuty, email)
### Step 1: Deploy Alert Rules
**Option A: Docker Compose**
```bash
# Copy alert rules
cp docs/prometheus/ffmpeg-rtmp-alerts.yml deployment/prometheus/rules/
# Mount in docker-compose.yml
volumes:
- ./deployment/prometheus/rules:/etc/prometheus/rules:ro
# Update prometheus.yml
rule_files:
- '/etc/prometheus/rules/*.yml'
# Restart Prometheus
docker-compose restart prometheus
```
**Option B: System Installation**
```bash
# Copy rules to Prometheus directory
sudo cp docs/prometheus/ffmpeg-rtmp-alerts.yml /etc/prometheus/rules/
# Fix permissions
sudo chown prometheus:prometheus /etc/prometheus/rules/ffmpeg-rtmp-alerts.yml
# Validate rules
promtool check rules /etc/prometheus/rules/ffmpeg-rtmp-alerts.yml
# Update /etc/prometheus/prometheus.yml
rule_files:
- '/etc/prometheus/rules/*.yml'
# Reload Prometheus
sudo systemctl reload prometheus
# OR
curl -X POST http://localhost:9090/-/reload
```
### Step 2: Deploy Alertmanager Configuration
**Docker Compose:**
```bash
# Copy config
cp docs/prometheus/alertmanager.yml deployment/prometheus/
# Update docker-compose.yml
alertmanager:
image: prom/alertmanager:latest
volumes:
- ./deployment/prometheus/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
ports:
- "9093:9093"
# Start Alertmanager
docker-compose up -d alertmanager
```
**System Installation:**
```bash
# Install Alertmanager
sudo apt-get install prometheus-alertmanager
# Copy config
sudo cp docs/prometheus/alertmanager.yml /etc/alertmanager/
# Set environment variables
echo 'SMTP_PASSWORD=your-password' | sudo tee -a /etc/default/alertmanager
echo 'PAGERDUTY_SERVICE_KEY=your-key' | sudo tee -a /etc/default/alertmanager
# Validate config
amtool check-config /etc/alertmanager/alertmanager.yml
# Start service
sudo systemctl enable alertmanager
sudo systemctl start alertmanager
```
### Step 3: Configure Prometheus to Use Alertmanager
**Add to prometheus.yml:**
```yaml
alerting:
alertmanagers:
- static_configs:
- targets:
- 'localhost:9093' # or alertmanager:9093 for Docker
```
**Reload Prometheus:**
```bash
# Docker
docker-compose restart prometheus
# System
sudo systemctl reload prometheus
```
---
## Alert Configuration
### Customizing Thresholds
Edit `docs/prometheus/ffmpeg-rtmp-alerts.yml`:
```yaml
# Example: Change queue critical threshold
- alert: FFmpegQueueCritical
expr: jobs_queued_total{job="ffmpeg-master"} > 3000 # Changed from 2000
for: 10m # Changed from 15m
```
### Adding Custom Alerts
```yaml
- name: custom_alerts
interval: 1m
rules:
- alert: CustomAlert
expr: your_metric > threshold
for: duration
labels:
severity: warning
component: custom
annotations:
summary: "Alert summary"
description: "Detailed description"
```
### Alert Label Best Practices
Always include these labels:
- `severity`: critical, warning, or info
- `component`: master, worker, queue, database, etc.
- `team`: Team responsible for response
---
## Notification Channels
### Slack Setup
1. **Create Slack App:**
```
https://api.slack.com/apps → Create New App
```
2. **Enable Incoming Webhooks:**
```
Features → Incoming Webhooks → Activate
```
3. **Add to Workspace:**
```
Add New Webhook to Workspace
Select channel: #ffmpeg-alerts-critical
Copy webhook URL
```
4. **Configure Alertmanager:**
```bash
# Save webhook URL
echo "https://hooks.slack.com/services/YOUR/WEBHOOK/URL" | \
sudo tee /etc/alertmanager/slack_webhook_url
# Or set in alertmanager.yml
global:
slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
```
**Channels to Create:**
- `#ffmpeg-alerts-critical` - Critical alerts (page team)
- `#ffmpeg-alerts` - Warning alerts
- `#ffmpeg-ops` - Operations team alerts
- `#ffmpeg-monitoring` - Performance/info alerts
### PagerDuty Setup
1. **Create Service:**
```
PagerDuty → Services → New Service
Name: FFmpeg RTMP Production
```
2. **Add Integration:**
```
Integrations → Add Integration
Type: Prometheus
Copy Integration Key
```
3. **Configure Alertmanager:**
```bash
export PAGERDUTY_SERVICE_KEY="your-integration-key"
# Add to /etc/default/alertmanager or docker-compose.yml
```
4. **Set Escalation Policy:**
```
Escalation Policies → Create Policy
Level 1: On-call engineer (immediate)
Level 2: Team lead (after 15 min)
Level 3: Manager (after 30 min)
```
### Email Setup
**Gmail Example:**
```yaml
global:
smtp_from: '[email protected]'
smtp_smarthost: 'smtp.gmail.com:587'
smtp_auth_username: '[email protected]'
smtp_auth_password: 'app-specific-password' # Generate in Gmail settings
smtp_require_tls: true
receivers:
- name: 'email-ops'
email_configs:
- to: '[email protected]'
headers:
Subject: '[FFmpeg RTMP] {{ .GroupLabels.alertname }}'
```
**SendGrid Example:**
```yaml
global:
smtp_smarthost: 'smtp.sendgrid.net:587'
smtp_auth_username: 'apikey'
smtp_auth_password: '${SENDGRID_API_KEY}'
```
---
## Testing Alerts
### Test Alert Rules
```bash
# Validate syntax
promtool check rules /etc/prometheus/rules/ffmpeg-rtmp-alerts.yml
# Test specific query
promtool query instant http://localhost:9090 \
'up{job="ffmpeg-master"} == 0'
```
### Test Alertmanager Config
```bash
# Validate config
amtool check-config /etc/alertmanager/alertmanager.yml
# Test routing
amtool config routes test \
--config.file=/etc/alertmanager/alertmanager.yml \
--tree \
severity=critical component=master
```
### Send Test Alert
**Manual test alert:**
```bash
curl -X POST http://localhost:9093/api/v1/alerts -d '[
{
"labels": {
"alertname": "TestAlert",
"severity": "warning",
"component": "test"
},
"annotations": {
"summary": "This is a test alert",
"description": "Testing the alerting pipeline"
},
"startsAt": "'$(date -Iseconds)'",
"endsAt": "'$(date -Iseconds -d '+5 minutes')'"
}
]'
```
**Trigger real alert (temporary):**
```bash
# Stop master to trigger alert
sudo systemctl stop ffmpeg-master
# Wait 2+ minutes for alert to fire
# Check Prometheus: http://localhost:9090/alerts
# Check Alertmanager: http://localhost:9093
# Restart master
sudo systemctl start ffmpeg-master
```
### Verify Alert Delivery
1. **Check Prometheus Alerts:**
```
http://localhost:9090/alerts
```
2. **Check Alertmanager:**
```
http://localhost:9093/#/alerts
```
3. **Check Notification Channels:**
- Slack: Look for test message
- PagerDuty: Check incidents page
- Email: Check inbox
---
## Incident Response
### Alert Response Workflow
1. **Receive Alert**
- Critical: PagerDuty page
- Warning: Slack notification
- Info: Email/ticket
2. **Acknowledge**
- PagerDuty: Acknowledge incident
- Slack: React with 👀 emoji
- Update team on status
3. **Investigate**
- Click dashboard link in alert
- Check master/worker logs
- Review recent changes
4. **Remediate**
- Follow runbook for alert type
- Apply fix
- Monitor for resolution
5. **Close**
- Verify alert resolved in Alertmanager
- Update incident notes
- Schedule postmortem if needed
### Runbooks
See **[docs/INCIDENT_PLAYBOOKS.md](INCIDENT_PLAYBOOKS.md)** for detailed response procedures:
- Master Down
- All Workers Down
- High Failure Rate
- Queue Overload
- Disk Space Issues
- Performance Degradation
---
## Maintenance and Tuning
### Regular Maintenance
**Weekly:**
- Review alert statistics
- Check for noisy alerts (tune thresholds)
- Verify notification delivery
**Monthly:**
- Review and update alert thresholds
- Update runbook documentation
- Test disaster recovery procedures
### Silence Alerts During Maintenance
**Web UI:**
```
http://localhost:9093/#/silences
Click "New Silence"
Set matchers: alertname=FFmpegWorkerNodeDown, instance=worker1
Duration: 2 hours
Comment: Scheduled maintenance
```
**CLI:**
```bash
amtool silence add \
alertname=FFmpegWorkerNodeDown \
instance=worker1 \
--duration=2h \
--comment="Scheduled maintenance"
```
**API:**
```bash
curl -XPOST http://localhost:9093/api/v1/silences -d '{
"matchers": [
{"name": "alertname", "value": "FFmpegWorkerNodeDown"},
{"name": "instance", "value": "worker1"}
],
"startsAt": "'$(date -Iseconds)'",
"endsAt": "'$(date -Iseconds -d '+2 hours)'",
"createdBy": "[email protected]",
"comment": "Scheduled maintenance"
}'
```
### Alert Tuning
**Reduce false positives:**
- Increase `for` duration
- Add more specific label matchers
- Adjust thresholds based on baseline
**Reduce alert fatigue:**
- Use inhibition rules
- Group related alerts
- Increase repeat_interval
**Example tuning:**
```yaml
# Before: Too sensitive
- alert: FFmpegHighFailureRate
expr: rate(jobs_failed_total[5m]) > 0.1
for: 5m
# After: More reasonable
- alert: FFmpegHighFailureRate
expr: |
(rate(jobs_failed_total[5m]) /
(rate(jobs_completed_total[5m]) + rate(jobs_failed_total[5m]))) > 0.1
for: 10m # Increased duration
```
### Metrics for Alert Health
**Alert firing rate:**
```promql
rate(alertmanager_alerts_received_total[1h])
```
**Alert resolution time:**
```promql
histogram_quantile(0.95,
rate(alertmanager_notification_latency_seconds_bucket[1h]))
```
**Failed notifications:**
```promql
rate(alertmanager_notifications_failed_total[1h])
```
---
## Troubleshooting
### Alerts Not Firing
1. **Check Prometheus targets:**
```
http://localhost:9090/targets
Ensure ffmpeg-master and ffmpeg-worker are UP
```
2. **Test alert expression:**
```
http://localhost:9090/graph
Run the expr from alert rule
Should return data when condition met
```
3. **Check alert state:**
```
http://localhost:9090/alerts
State should be: inactive → pending → firing
```
### Alerts Not Notifying
1. **Check Alertmanager received alert:**
```
http://localhost:9093/#/alerts
```
2. **Check routing:**
```bash
amtool config routes show
```
3. **Check notification logs:**
```bash
# Docker
docker-compose logs alertmanager | grep -i notification
# System
journalctl -u alertmanager -f
```
4. **Test receiver manually:**
```bash
# Test Slack webhook
curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK \
-d '{"text":"Test message"}'
```
### Common Issues
**Issue: Slack notifications not arriving**
- Check webhook URL is correct
- Verify Slack app has permissions
- Check Alertmanager logs for errors
**Issue: PagerDuty not creating incidents**
- Verify integration key is correct
- Check PagerDuty service is active
- Ensure escalation policy configured
**Issue: Too many alerts**
- Review inhibition rules
- Increase alert thresholds
- Add grouping/aggregation
---
## Additional Resources
- **[Prometheus Alerting Documentation](https://prometheus.io/docs/alerting/latest/overview/)**
- **[Alertmanager Configuration](https://prometheus.io/docs/alerting/latest/configuration/)**
- **[PromQL Guide](https://prometheus.io/docs/prometheus/latest/querying/basics/)**
- **[Incident Playbooks](INCIDENT_PLAYBOOKS.md)**
- **[Production Operations](PRODUCTION_OPERATIONS.md)**
---
**Version**: 1.0
**Last Updated**: 2026-01-05
**Status**: Production Ready
This guide will help you set up the complete GhostWriter stack: Backend (Go), Frontend (Web), and iOS App.
This guide provides instructions for comparing AGS data from the Google Cloud Storage (GCS) bucket with existing dataset layouts stored in Google Drive/Google Sheets.
1. In chat, type: `/edit <filename>`
**Project:** Zumodra HR/Management SaaS