From Heuristics to Hybrid: A Methodology for Building a Testable, Sequential Log Anomaly Detection Engine

# From Heuristics to Hybrid: A Methodology for Building a Testable, Sequential Log Anomaly Detection Engine **Authors:** [Author Names] **Date:** November 17, 2025 **Repository:** [Anomaly_Detection](https://github.com/Shreyansh1812/Anomaly_Detection) --- ## Abstract Log anomaly detection systems typically suffer from either (a) limited hard-coded rules or (b) opaque ML models with unverifiable performance. This paper addresses the **"Heuristic Evaluation Fallacy"**—evaluating detectors against labels generated by similar logic, creating circular validation that makes true performance unknowable. We present a sequential two-stage hybrid architecture: 1. **Rule Engine:** Catches "known-knowns" (severity errors, attack keywords, failed login bursts) 2. **ML Specialist:** Analyzes residual traffic for "unknown-unknowns" (statistical outliers) Sequential processing prevents ML bias toward trivial patterns. Our **Golden Set validation** uses manual labels to provide honest metrics independent of detection logic. Results: **Precision: 0.917, Recall: 1.000** (rules alone), **Precision: 0.846, Recall: 1.000** (hybrid system). --- ## 1. Introduction ### 1.1 The Heuristic Evaluation Fallacy Our initial system tested performance by comparing detector output against **heuristic labels** (severity thresholds, simple patterns). This circular logic produced meaningless metrics—a recall of 0.846 merely showed the engine matched trivial cases while both missed a 9.5MB data exfiltration disguised as normal 200 OK traffic. ### 1.2 Solution: Sequential Hybrid + Golden Set Validation **Detection Architecture:** - Engine 1 (Rules): Remove known threats first - Engine 2 (ML): Analyze residual traffic for statistical outliers **Evaluation Framework:** - Manual labeling creates independent ground truth - `validate_with_golden_set.py` provides separate metrics per engine - Eliminates circular validation --- ## 2. The Hybrid Detection Methodology ### 2.1 Engine 1: Rule-Based Detector `LogAnomalyDetector` catches explicitly definable threats with perfect explainability: **Rule Categories:** - `[severity]`: ERROR/FATAL/CRITICAL levels - `[suspicious_content]`: Attack tool keywords (nikto, sqlmap), anomalous methods (PUT on .css) - `[failed_login_burst]`: Stateful tracking (3+ HTTP 401s or 5+ AuthService WARNs per IP within 60s) **Design:** Prioritizes precision over recall—conservative thresholds prevent alert fatigue, Stage 2 handles missed cases. ### 2.2 Sequential Pipeline (Core Insight) **Parallel Hybrid Failure:** ML trained on all logs overfits to severity levels, missing subtle threats (9.5MB exfiltration with INFO/200 OK status). **Sequential Solution:** ``` All Logs → Rule Engine (removes obvious anomalies) ↓ Residual "normal" logs → ML Specialist (finds statistical outliers) ``` Filtering first forces ML to specialize in subtle deviations within normal-looking traffic. ### 2.3 Engine 2: ML Specialist **Model:** IsolationForest (unsupervised, contamination=0.15) **Feature Engineering Lesson:** - HDFS-trained model (50+ features): 0% precision on HTTP logs - HTTP-specific model (2 features): 84.6% precision - **Minimalist approach:** `['message_length', 'response_bytes']` targets 9.5MB outlier **Training Data:** Synthetic HTTP baseline (5,836 clean logs) filtered through rule engine ensures ML learns only residual patterns. ### 2.4 Integration: The Complete Hybrid Flow ```python def detect_hybrid(logs): # Stage 1: Rule engine rule_hits = rule_engine.detect(logs) flagged_indices = {hit['log']['_row_index'] for hit in rule_hits} # Stage 2: ML specialist on residual logs residual_logs = [log for i, log in enumerate(logs) if i not in flagged_indices] if residual_logs: ml_predictions = ml_model.predict(extract_features(residual_logs)) ml_hits = [log for log, pred in zip(residual_logs, ml_predictions) if pred == -1] # -1 = anomaly in IsolationForest # Combine results with source tracking return { 'rule_hits': rule_hits, # Each tagged with [source='rule'] 'ml_hits': ml_hits, # Each tagged with [source='ml'] 'total_anomalies': len(rule_hits) + len(ml_hits) } ``` Each alert includes: - The log entry - Detection source (rule/ml) - Specific reason (e.g., `[failed_login_burst]`, `[numeric_iforest]`) --- ## 3. The Golden Set Evaluation Framework ### 3.1 The Heuristic Evaluation Fallacy Circular logic: detector tested against heuristic labels using similar rules. - **Problem:** Both flag ERROR logs → high recall, but both miss 9.5MB exfiltration - **Result:** Meaningless metrics (0.846 recall with critical threat undetected) ### 3.2 Golden Set Solution **Process:** Manual review labels each anomaly (security threats, failures, policy violations, statistical outliers) in `test_log15(1).golden.json`. **Validator:** `validate_with_golden_set.py` provides independent metrics per engine: - Independence: Created without running detector - Granularity: Separate scores for rule/ML/hybrid - Transparency: Confusion matrices show exact performance **Comparison:** - Heuristic: Auto-generated, circular, simplistic → false confidence - Golden Set: Manual, independent, nuanced → trustworthy metrics --- ## 4. Experimental Results **Test File:** `test_log15(1).txt` (77 valid logs, 11 Golden Set anomalies) ### 4.1 Evolution Summary | Stage | Configuration | Precision | Recall | Key Issue | |-------|--------------|-----------|--------|----------| | Heuristic | Severity-based labels | N/A | 0.846 | Circular validation | | Rule Only | Golden Set | 0.917 | 1.000 | Caught all 11 anomalies | | ML (HDFS) | Wrong domain | 0.000 | 0.000 | 32 false positives | | ML (HTTP v1) | Right domain | N/A | 0.000 | Caught by rules first | | **Hybrid Final** | **Sequential** | **0.846** | **1.000** | **2 false positives** | ### 4.2 Final Confusion Matrix ``` Hybrid Detector: [[64 2] ← 2 false positives [ 0 11]] ← 0 false negatives Rule Engine: 10 anomalies (HTTP 401 bursts, Nikto, PUT method) ML Specialist: 1 anomaly (9.5MB outlier) ``` ### 4.3 Key Insights 1. **Sequential architecture works:** ML found the 1 outlier rules couldn't define 2. **Domain-specific training critical:** HDFS → 0%, HTTP → 84.6% 3. **Golden Set essential:** Without it, 0.846 heuristic recall was meaningless 4. **Rules provide baseline:** 10 of 11 anomalies caught deterministically --- ## 5. Conclusion ### 5.1 Key Contributions **Evaluation framework matters more than algorithms.** Our simple techniques (keyword matching, IsolationForest, 2 features) achieved 97.4% accuracy through: 1. Golden Set validation (eliminated circular logic) 2. Sequential specialization (rules filter, ML learns residuals) 3. Domain-specific training (HTTP baseline for HTTP analysis) ### 5.2 Limitations & Future Work **Current Issues:** - Hardcoded keywords (nikto, sqlmap) → Need entropy-based detection - Small Golden Set (77 logs) → Expand to 10,000+ with diverse patterns - Static model → Concept drift as traffic patterns change **Proposed Solutions:** - Generalized behavioral rules (header anomalies, timing patterns) - Continuous retraining pipeline (weekly baseline updates) - Multi-domain specialists (HTTP, Auth, DB, System health) - Semi-supervised Golden Set expansion (flag → review → label → retrain) Code: [github.com/Shreyansh1812/Anomaly_Detection](https://github.com/Shreyansh1812/Anomaly_Detection) --- ## References 1. Liu, F. T., et al. (2008). Isolation forest. *IEEE ICDM*. 2. Chandola, V., et al. (2009). Anomaly detection: A survey. *ACM CSUR*, 41(3). 3. Provost, F., & Fawcett, T. (2013). *Data Science for Business*. O'Reilly Media. --- ## Appendix: Reproduction Commands ```bash # 1. Setup git clone https://github.com/Shreyansh1812/Anomaly_Detection.git cd Anomaly_Detection python -m venv .venv && .venv\Scripts\activate pip install -r requirements.txt # 2. Generate HTTP baseline python Scripts/create_normal_http_baseline.py --count 6000 # 3. Train ML specialist python -c "from ML.robust_anomaly_trainer import train_numeric_iforest; \ train_numeric_iforest(['Data/golden_sets/normal_http_baseline.log'], \ contamination_rate=0.15, model_dir='models/http_numeric_specialist', \ feature_subset=('message_length', 'response_bytes'))" # 4. Run hybrid evaluation python ML/evaluate_trained_model.py \ --files "Data/test_logs/test_log15(1).txt" \ --numeric-model-dir models/http_numeric_specialist \ --out ML/reports/results.json # 5. Validate against Golden Set python ML/validate_with_golden_set.py \ --predictions_file ML/reports/results.json \ --golden_set_file "Data/golden_sets/test_log15(1).golden.json" ``` **Expected Output:** ``` === Hybrid Detector === precision recall f1-score support 0 1.000 0.970 0.985 66 1 0.846 1.000 0.917 11 accuracy 0.974 77 ``` --- **End of Document**

Related Documents

Evaluation Harness (Offline + Online)

/godmode:eval

🔬 Open Deep Research

EEG-Datasets