Loading...
Loading...
Loading...
**Authors:** [Author Names]
# From Heuristics to Hybrid: A Methodology for Building a Testable, Sequential Log Anomaly Detection Engine
**Authors:** [Author Names]
**Date:** November 17, 2025
**Repository:** [Anomaly_Detection](https://github.com/Shreyansh1812/Anomaly_Detection)
---
## Abstract
Log anomaly detection systems typically suffer from either (a) limited hard-coded rules or (b) opaque ML models with unverifiable performance. This paper addresses the **"Heuristic Evaluation Fallacy"**—evaluating detectors against labels generated by similar logic, creating circular validation that makes true performance unknowable.
We present a sequential two-stage hybrid architecture:
1. **Rule Engine:** Catches "known-knowns" (severity errors, attack keywords, failed login bursts)
2. **ML Specialist:** Analyzes residual traffic for "unknown-unknowns" (statistical outliers)
Sequential processing prevents ML bias toward trivial patterns. Our **Golden Set validation** uses manual labels to provide honest metrics independent of detection logic. Results: **Precision: 0.917, Recall: 1.000** (rules alone), **Precision: 0.846, Recall: 1.000** (hybrid system).
---
## 1. Introduction
### 1.1 The Heuristic Evaluation Fallacy
Our initial system tested performance by comparing detector output against **heuristic labels** (severity thresholds, simple patterns). This circular logic produced meaningless metrics—a recall of 0.846 merely showed the engine matched trivial cases while both missed a 9.5MB data exfiltration disguised as normal 200 OK traffic.
### 1.2 Solution: Sequential Hybrid + Golden Set Validation
**Detection Architecture:**
- Engine 1 (Rules): Remove known threats first
- Engine 2 (ML): Analyze residual traffic for statistical outliers
**Evaluation Framework:**
- Manual labeling creates independent ground truth
- `validate_with_golden_set.py` provides separate metrics per engine
- Eliminates circular validation
---
## 2. The Hybrid Detection Methodology
### 2.1 Engine 1: Rule-Based Detector
`LogAnomalyDetector` catches explicitly definable threats with perfect explainability:
**Rule Categories:**
- `[severity]`: ERROR/FATAL/CRITICAL levels
- `[suspicious_content]`: Attack tool keywords (nikto, sqlmap), anomalous methods (PUT on .css)
- `[failed_login_burst]`: Stateful tracking (3+ HTTP 401s or 5+ AuthService WARNs per IP within 60s)
**Design:** Prioritizes precision over recall—conservative thresholds prevent alert fatigue, Stage 2 handles missed cases.
### 2.2 Sequential Pipeline (Core Insight)
**Parallel Hybrid Failure:** ML trained on all logs overfits to severity levels, missing subtle threats (9.5MB exfiltration with INFO/200 OK status).
**Sequential Solution:**
```
All Logs → Rule Engine (removes obvious anomalies)
↓
Residual "normal" logs → ML Specialist (finds statistical outliers)
```
Filtering first forces ML to specialize in subtle deviations within normal-looking traffic.
### 2.3 Engine 2: ML Specialist
**Model:** IsolationForest (unsupervised, contamination=0.15)
**Feature Engineering Lesson:**
- HDFS-trained model (50+ features): 0% precision on HTTP logs
- HTTP-specific model (2 features): 84.6% precision
- **Minimalist approach:** `['message_length', 'response_bytes']` targets 9.5MB outlier
**Training Data:** Synthetic HTTP baseline (5,836 clean logs) filtered through rule engine ensures ML learns only residual patterns.
### 2.4 Integration: The Complete Hybrid Flow
```python
def detect_hybrid(logs):
# Stage 1: Rule engine
rule_hits = rule_engine.detect(logs)
flagged_indices = {hit['log']['_row_index'] for hit in rule_hits}
# Stage 2: ML specialist on residual logs
residual_logs = [log for i, log in enumerate(logs)
if i not in flagged_indices]
if residual_logs:
ml_predictions = ml_model.predict(extract_features(residual_logs))
ml_hits = [log for log, pred in zip(residual_logs, ml_predictions)
if pred == -1] # -1 = anomaly in IsolationForest
# Combine results with source tracking
return {
'rule_hits': rule_hits, # Each tagged with [source='rule']
'ml_hits': ml_hits, # Each tagged with [source='ml']
'total_anomalies': len(rule_hits) + len(ml_hits)
}
```
Each alert includes:
- The log entry
- Detection source (rule/ml)
- Specific reason (e.g., `[failed_login_burst]`, `[numeric_iforest]`)
---
## 3. The Golden Set Evaluation Framework
### 3.1 The Heuristic Evaluation Fallacy
Circular logic: detector tested against heuristic labels using similar rules.
- **Problem:** Both flag ERROR logs → high recall, but both miss 9.5MB exfiltration
- **Result:** Meaningless metrics (0.846 recall with critical threat undetected)
### 3.2 Golden Set Solution
**Process:** Manual review labels each anomaly (security threats, failures, policy violations, statistical outliers) in `test_log15(1).golden.json`.
**Validator:** `validate_with_golden_set.py` provides independent metrics per engine:
- Independence: Created without running detector
- Granularity: Separate scores for rule/ML/hybrid
- Transparency: Confusion matrices show exact performance
**Comparison:**
- Heuristic: Auto-generated, circular, simplistic → false confidence
- Golden Set: Manual, independent, nuanced → trustworthy metrics
---
## 4. Experimental Results
**Test File:** `test_log15(1).txt` (77 valid logs, 11 Golden Set anomalies)
### 4.1 Evolution Summary
| Stage | Configuration | Precision | Recall | Key Issue |
|-------|--------------|-----------|--------|----------|
| Heuristic | Severity-based labels | N/A | 0.846 | Circular validation |
| Rule Only | Golden Set | 0.917 | 1.000 | Caught all 11 anomalies |
| ML (HDFS) | Wrong domain | 0.000 | 0.000 | 32 false positives |
| ML (HTTP v1) | Right domain | N/A | 0.000 | Caught by rules first |
| **Hybrid Final** | **Sequential** | **0.846** | **1.000** | **2 false positives** |
### 4.2 Final Confusion Matrix
```
Hybrid Detector:
[[64 2] ← 2 false positives
[ 0 11]] ← 0 false negatives
Rule Engine: 10 anomalies (HTTP 401 bursts, Nikto, PUT method)
ML Specialist: 1 anomaly (9.5MB outlier)
```
### 4.3 Key Insights
1. **Sequential architecture works:** ML found the 1 outlier rules couldn't define
2. **Domain-specific training critical:** HDFS → 0%, HTTP → 84.6%
3. **Golden Set essential:** Without it, 0.846 heuristic recall was meaningless
4. **Rules provide baseline:** 10 of 11 anomalies caught deterministically
---
## 5. Conclusion
### 5.1 Key Contributions
**Evaluation framework matters more than algorithms.** Our simple techniques (keyword matching, IsolationForest, 2 features) achieved 97.4% accuracy through:
1. Golden Set validation (eliminated circular logic)
2. Sequential specialization (rules filter, ML learns residuals)
3. Domain-specific training (HTTP baseline for HTTP analysis)
### 5.2 Limitations & Future Work
**Current Issues:**
- Hardcoded keywords (nikto, sqlmap) → Need entropy-based detection
- Small Golden Set (77 logs) → Expand to 10,000+ with diverse patterns
- Static model → Concept drift as traffic patterns change
**Proposed Solutions:**
- Generalized behavioral rules (header anomalies, timing patterns)
- Continuous retraining pipeline (weekly baseline updates)
- Multi-domain specialists (HTTP, Auth, DB, System health)
- Semi-supervised Golden Set expansion (flag → review → label → retrain)
Code: [github.com/Shreyansh1812/Anomaly_Detection](https://github.com/Shreyansh1812/Anomaly_Detection)
---
## References
1. Liu, F. T., et al. (2008). Isolation forest. *IEEE ICDM*.
2. Chandola, V., et al. (2009). Anomaly detection: A survey. *ACM CSUR*, 41(3).
3. Provost, F., & Fawcett, T. (2013). *Data Science for Business*. O'Reilly Media.
---
## Appendix: Reproduction Commands
```bash
# 1. Setup
git clone https://github.com/Shreyansh1812/Anomaly_Detection.git
cd Anomaly_Detection
python -m venv .venv && .venv\Scripts\activate
pip install -r requirements.txt
# 2. Generate HTTP baseline
python Scripts/create_normal_http_baseline.py --count 6000
# 3. Train ML specialist
python -c "from ML.robust_anomaly_trainer import train_numeric_iforest; \
train_numeric_iforest(['Data/golden_sets/normal_http_baseline.log'], \
contamination_rate=0.15, model_dir='models/http_numeric_specialist', \
feature_subset=('message_length', 'response_bytes'))"
# 4. Run hybrid evaluation
python ML/evaluate_trained_model.py \
--files "Data/test_logs/test_log15(1).txt" \
--numeric-model-dir models/http_numeric_specialist \
--out ML/reports/results.json
# 5. Validate against Golden Set
python ML/validate_with_golden_set.py \
--predictions_file ML/reports/results.json \
--golden_set_file "Data/golden_sets/test_log15(1).golden.json"
```
**Expected Output:**
```
=== Hybrid Detector ===
precision recall f1-score support
0 1.000 0.970 0.985 66
1 0.846 1.000 0.917 11
accuracy 0.974 77
```
---
**End of Document**
- Without a harness, you **can't compare** prompts, models, retrieval configs, or costs.
Evaluate, benchmark, and regression-test AI/LLM systems. Covers evaluation framework design, benchmark creation, human evaluation protocols, automated evaluation (LLM-as-judge), regression testing, statistical significance, and continuous evaluation pipelines.
<img width="1388" height="298" alt="full_diagram" src="https://github.com/user-attachments/assets/12a2371b-8be2-4219-9b48-90503eb43c69" />
A list of all public EEG-datasets. This list of EEG-resources is not exhaustive. If you find something new, or have explored any unfiltered link in depth, please update the repository.