Phase 2B.1: Monitoring & Alerting Blueprint

# Phase 2B.1: Monitoring & Alerting Blueprint **Purpose**: Real-time system health monitoring and automated incident response **Scope**: Testnet E2E through mainnet production **Update Frequency**: Every 60 seconds --- ## 🎯 Monitoring Hierarchy ### Tier 1: Critical System Invariants (HALT on violation) These must NEVER be violated. Any violation halts the relayer immediately. ``` Critical Invariant Violations (Triggers Relayer HALT) ├─ SUPPLY_MISMATCH │ ├─ Trigger: locked_db != chain_locked (difference > 0.001 SIGIL) │ ├─ Action: Stop all operations, alert on-call │ └─ Log Level: CRITICAL │ ├─ DUPLICATE_BATCH │ ├─ Trigger: UNIQUE(merkle_root) constraint violation │ ├─ Action: Rollback operation, investigate indexer │ └─ Log Level: CRITICAL │ ├─ PROOF_VERIFICATION_FAIL │ ├─ Trigger: merkle proof rejects valid withdrawal │ ├─ Action: Investigate merkle logic, halt settlement │ └─ Log Level: CRITICAL │ └─ NONCE_DESYNC ├─ Trigger: expected_nonce != actual_nonce on chain ├─ Action: Rescan chain, realign nonce state └─ Log Level: CRITICAL ``` ### Tier 2: Operational Health (Alert but don't halt) These indicate problems that need attention but don't require immediate stop. ``` Operational Health Alerts (Alert, don't halt) ├─ RPC_TIMEOUT │ ├─ Trigger: 3+ consecutive RPC call failures │ ├─ Action: Switch RPC endpoint, retry with backoff │ └─ Log Level: ERROR │ ├─ DB_CONNECTION_LOST │ ├─ Trigger: 5+ database connection failures │ ├─ Action: Check DB health, restart connection pool │ └─ Log Level: ERROR │ ├─ HIGH_LATENCY │ ├─ Trigger: Cycle time >2s (target <1s) │ ├─ Action: Investigate bottleneck, adjust batch size │ └─ Log Level: WARNING │ ├─ GAS_ESTIMATE_MISS │ ├─ Trigger: Actual gas > estimate * 1.5 │ ├─ Action: Increase gas limit buffer, log for analysis │ └─ Log Level: WARNING │ └─ QUEUE_BUILDUP ├─ Trigger: Pending operations > 10 ├─ Action: Increase batch frequency or size └─ Log Level: WARNING ``` ### Tier 3: Operational Metrics (Track but don't alert) These are tracked for trends and long-term optimization. ``` Operational Metrics (Tracked, no alert threshold) ├─ Settlement Latency │ ├─ Metric: Time from batch creation to anchor confirmation │ └─ Target: <5 minutes │ ├─ Proof Verification Success Rate │ ├─ Metric: successful_verifications / total_attempts │ └─ Target: >99.9% │ ├─ Duplicate Event Rate │ ├─ Metric: duplicate_events / total_events │ └─ Target: 0% │ ├─ RPC Success Rate │ ├─ Metric: successful_calls / total_calls │ └─ Target: >99% │ └─ Database Performance ├─ Metric: Query time percentiles (p50, p95, p99) └─ Target: p95 <100ms ``` --- ## 📊 Dashboard Specification ### Dashboard 1: System Health (Main) **Refresh**: Every 60 seconds **Audience**: Operations team, monitoring ``` ┌─────────────────────────────────────────────────────────┐ │ SIGILBRIDGE RELAYER - SYSTEM STATUS │ ├─────────────────────────────────────────────────────────┤ │ │ │ ┌────────────────────────┐ ┌──────────────────────┐ │ │ │ SUPPLY INVARIANT │ │ RELAYER STATUS │ │ │ ├────────────────────────┤ ├──────────────────────┤ │ │ │ DB Total: 1000.00 ✅ │ │ Status: RUNNING ✅ │ │ │ │ Chain Total: 1000.00 ✅│ │ Uptime: 48h 23m │ │ │ │ Delta: 0.00 ✅ │ │ Cycles: 2,847 ✅ │ │ │ │ Status: HEALTHY │ │ Errors: 0 ✅ │ │ │ └────────────────────────┘ └──────────────────────┘ │ │ │ │ ┌────────────────────────┐ ┌──────────────────────┐ │ │ │ SETTLEMENT PROGRESS │ │ NETWORK STATUS │ │ │ ├────────────────────────┤ ├──────────────────────┤ │ │ │ Deposits Indexed: 1,234│ │ Execution Chain: OK │ │ │ │ Batches Created: 847 │ │ Cronos RPC: OK │ │ │ │ Batches Anchored: 847 │ │ DB Connection: OK │ │ │ │ Success Rate: 99.8% │ │ All Systems: SYNCED │ │ │ └────────────────────────┘ └──────────────────────┘ │ │ │ │ Last Update: 2026-02-09 14:32:15 UTC │ └─────────────────────────────────────────────────────────┘ ``` ### Dashboard 2: Detailed Metrics ``` ┌─────────────────────────────────────────────────────────┐ │ DETAILED PERFORMANCE METRICS │ ├─────────────────────────────────────────────────────────┤ │ │ │ INDEX CYCLE (target: <100ms) │ │ ████████░░ 89ms ✅ │ │ │ │ BATCH CREATION (target: <50ms) │ │ ████░░░░░░ 38ms ✅ │ │ │ │ EXECUTION LATENCY (target: <500ms) │ │ ████████░░ 425ms ✅ │ │ │ │ CONFIRMATION DEPTH (target: ≥10 blocks) │ │ ████████████ 12 blocks ✅ │ │ │ │ PROOF VERIFICATION (success rate) │ │ ██████████ 100.0% ✅ │ │ │ │ RPC SUCCESS RATE (target: >99%) │ │ ██████████ 99.7% ✅ │ │ │ └─────────────────────────────────────────────────────────┘ ``` ### Dashboard 3: Alerts & Events ``` ┌─────────────────────────────────────────────────────────┐ │ ACTIVE ALERTS & RECENT EVENTS │ ├─────────────────────────────────────────────────────────┤ │ │ │ CRITICAL ALERTS: 0 ✅│ │ ERROR ALERTS: 0 ✅│ │ WARNING ALERTS: 0 ✅│ │ │ │ Recent Events: │ │ ───────────────────────────────────────────── │ │ 14:30:02 ✅ Batch #847 anchored (root: 0xa3f...) │ │ 14:29:45 ✅ Invariant check passed (delta: 0.00) │ │ 14:28:32 ✅ 12 deposits indexed, added to batches │ │ 14:27:15 ✅ Proof verification successful (batch 846)│ │ 14:15:00 🔄 RPC retry #1 on execution chain │ │ │ └─────────────────────────────────────────────────────────┘ ``` --- ## 🚨 Alert Configuration ### Alert 1: Supply Mismatch (CRITICAL) ```yaml Alert: SUPPLY_MISMATCH Trigger: | locked_db_total != chain_locked_total AND abs(difference) > 0.001 SIGIL Severity: CRITICAL Action: - Halt relayer immediately - Page on-call engineer - Create incident ticket - Log full balance breakdown - Save DB snapshot for audit Notification: - Email to [email protected] - Slack to #incidents - PagerDuty trigger (critical) - SMS to on-call Recovery: 1. Investigate root cause 2. Verify chain state (Web3 call) 3. Check DB logs for corruption 4. Manually correct data if needed 5. Re-validate invariant 6. Resume relayer ``` ### Alert 2: Duplicate Batch (CRITICAL) ```yaml Alert: DUPLICATE_BATCH_DETECTED Trigger: | UNIQUE(merkle_root) constraint violation OR duplicate batch number in DB Severity: CRITICAL Action: - Halt batch creation immediately - Page on-call engineer - Investigate indexer logic - Check for event replay attack - Audit event processing logs Notification: - Email to [email protected] - Slack to #incidents (with logs) - PagerDuty critical Recovery: 1. Stop indexer immediately 2. Review last 100 events 3. Identify duplicate source 4. Manually remove duplicate from DB (if safe) 5. Restart indexer with manual event replay 6. Verify no further duplicates ``` ### Alert 3: RPC Timeout (ERROR) ```yaml Alert: RPC_TIMEOUT Trigger: | 3 consecutive RPC call failures OR any single call timeout >10s Severity: ERROR (not critical) Action: - Log error with retry count - Switch to fallback RPC endpoint - Retry with exponential backoff - Alert ops team if persists >5min Notification: - Log at ERROR level - Slack notification if >10 failures - Email if >1 hour of issues Backoff Strategy: - Attempt 1: immediate - Attempt 2: wait 1s - Attempt 3: wait 2s - Attempt 4: wait 5s - Attempt 5: switch RPC + restart ``` ### Alert 4: Database Connection Lost (ERROR) ```yaml Alert: DB_CONNECTION_LOST Trigger: | 5+ consecutive database connection failures Severity: ERROR Action: - Halt all DB operations - Attempt connection pool restart - Alert ops team - Check PostgreSQL logs on server Notification: - Log at ERROR level - Email to [email protected] - Slack to #ops Recovery: 1. Check database server health 2. Verify network connectivity 3. Check PostgreSQL logs 4. Restart connection pool 5. Verify connectivity 6. Resume relayer ``` ### Alert 5: High Latency (WARNING) ```yaml Alert: HIGH_LATENCY Trigger: | Cycle time > 2 seconds (target: <1 second) Severity: WARNING Action: - Continue operations (don't halt) - Log warning with latency measurements - Alert ops for investigation - Check for bottlenecks Notification: - Log at WARNING level - Slack notification if >5min sustained Investigation: 1. Check RPC latency (should be <500ms) 2. Check database query times 3. Check batch size (might be too large) 4. Check confirmation depth setting 5. Analyze merkle tree generation time ``` --- ## 📈 Monitoring Infrastructure ### Logging Level Configuration ```yaml Logging Levels: CRITICAL: System invariants violated - Supply mismatch detected - Duplicate batch created - Nonce desynchronized - Proof verification failed ERROR: Operational failures - RPC connection lost - Database error - Transaction reverted - Proof generation failed WARNING: Degraded performance - High latency - Gas estimate miss - Queue buildup - Low success rate INFO: Normal operations - Batch created (batch_id, deposits_count) - Deposit indexed (deposit_id, amount) - Proof verified (withdrawal_id, valid=True) - Cycle completed (duration) DEBUG: Detailed traces - RPC call details (method, args, result) - DB query execution (sql, duration) - Merkle tree construction steps - Nonce management state ``` ### Log Storage & Retention ``` Log Destinations: ├─ Console: All levels (for development) ├─ File: /var/log/sigilbridge/relayer.log (rotate daily) │ ├─ Retention: 30 days │ └─ Size: Max 500MB per file ├─ Database: CRITICAL/ERROR only (for audit) │ └─ Retention: 90 days └─ Centralized Logging: ELK or similar ├─ All levels ├─ Real-time indexing └─ 1-year retention for compliance ``` ### Metrics Collection ```python # Metrics to track (export to Prometheus/CloudWatch) metrics = { # Counter: Total operations by type "relayer_deposits_indexed_total": 1234, "relayer_batches_created_total": 847, "relayer_batches_anchored_total": 847, "relayer_withdrawals_claimed_total": 234, # Gauge: Current state "relayer_pending_deposits": 12, "relayer_pending_batches": 0, "relayer_db_locked_sigil": 1000.00, "relayer_chain_locked_sigil": 1000.00, "relayer_invariant_delta": 0.00, # Histogram: Latency measurements "relayer_index_cycle_seconds": [0.045, 0.089, 0.052, ...], "relayer_batch_creation_seconds": [0.038, 0.041, 0.039, ...], "relayer_settlement_latency_seconds": [0.425, 0.389, 0.412, ...], "relayer_proof_verification_seconds": [0.012, 0.011, 0.013, ...], # Rate: Operations per minute "relayer_index_rate": 234.5, # deposits per minute "relayer_batch_rate": 12.3, # batches per minute "relayer_settlement_rate": 12.1, # settlements per minute } ``` --- ## 🚑 Incident Response Runbook ### Incident: Supply Mismatch Detected **Severity**: CRITICAL **Alert**: Supply invariant violated (db_total != chain_total) **On-Call**: Page immediately **Timeline**: - T+0m: Alert fires, relayer halts - T+1m: On-call acknowledges - T+2m: Begin investigation - T+5m: Root cause identified - T+10m: Decision to rollback or fix forward - T+20m: Issue mitigated - T+30m: Post-mortem meeting scheduled **Investigation Steps**: 1. **Verify Alert** (T+1m) ```bash SELECT SUM(amount) FROM deposit_confirmed = 1000.00 SELECT SUM(amount) FROM withdrawal_confirmed = 234.50 # Calculate expected: locked_db = 1000.00 - 234.50 = 765.50 # Query chain SIGIL.balanceOf(cronos_lockbox) = ? ExecutionSigilBridged.totalSupply() = ? # Calculate actual: locked_chain = cronos_lockbox + execution_bridged ``` 2. **Determine Direction** (T+3m) ``` If locked_db > locked_chain: SIGIL was LOST (worse scenario) Action: Audit settlement transactions If locked_db < locked_chain: SIGIL was CREATED (inflation) Action: Check minting logic ``` 3. **Audit Transactions** (T+5m) ```sql -- Check recent batch anchoring SELECT * FROM settlement_batch WHERE status='anchored' ORDER BY created_at DESC LIMIT 10; -- Check proof executions SELECT * FROM proof_executions ORDER BY created_at DESC LIMIT 10; -- Check where SIGIL went SELECT * FROM relayer_operations ORDER BY created_at DESC LIMIT 20; ``` 4. **Fix Forward or Rollback** (T+10m) - **Option A (Fix Forward)**: If issue identified in code, deploy fix - **Option B (Rollback)**: If transaction has bad effects, manually reverse 5. **Validate** (T+20m) ```bash ./scripts/validate_invariant.py # Should show: INVARIANT VALID ``` 6. **Resume** (T+25m) ```bash relayer resume --from-last-checkpoint ``` --- ## 📞 Escalation Policy | Severity | Response Time | Escalation | |----------|---|---| | CRITICAL | <5 min | Page on-call immediately | | ERROR | <15 min | Email + Slack | | WARNING | <60 min | Slack notification | | INFO | <24h | Daily report | --- ## ✅ Monitoring Readiness Checklist - [ ] All alerts configured and tested - [ ] Dashboard deployed and accessible - [ ] Logging pipeline operational - [ ] Metrics exported to monitoring system - [ ] Runbook accessible to on-call - [ ] Team trained on incident response - [ ] PagerDuty/alerting tool configured - [ ] Backup RPC endpoints configured - [ ] Database backups automated - [ ] 24/7 monitoring active --- **Document Created**: February 9, 2026 **System Status**: Ready for production monitoring **Next Phase**: Deploy and test monitoring on testnet

Related Documents

評估系統

Monitoring Guide - HwpBridge

T20_enhancement_proposals

LLM Judge — Setup & Operations