Loading...
Loading...
Loading...
**Generated from:** `docs/FEATURE_ROADMAP.md`
# LLMTrace Implementation TODO
**Generated from:** `docs/FEATURE_ROADMAP.md`
**Updated:** 2026-02-15
**Methodology:** RALPH loops β each loop spawns a Claude Code agent with strict quality gates, reviewed by lead engineer before merge.
---
## Status Legend
- β¬ Not started
- π In progress
- β
Done
- β Blocked
---
## Acceptance Criteria (Literature-Anchored)
These criteria define when a task can be marked β
. If any criterion is not met, status must remain π or β¬.
Input Security (IS): Must implement the specific algorithmic behaviors described in the literature, not heuristic approximations. For IS-001βIS-003, MOF requires token-wise bias detection, debiasing data generation, and retraining with reported over-defense gains in `docs/research/injecguard-over-defense-mitigation.md`. For IS-006/IS-007, thresholds must be calibrated at 0.1/0.5/1% FPR with TPR reporting per `docs/research/security-state-of-art-2026.md`. For IS-010/IS-011, WordNet-style synonym expansion and true lemmatization are required, not regex-only stems, per `docs/research/dmpi-pmhfe-prompt-injection-detection.md`. For IS-024βIS-029, adversarial robustness must include attack-specific defenses and calibration beyond normalization per `docs/research/bypassing-llm-guardrails-evasion.md`.
DMPI-PMHFE Architecture (DMPI-001βDMPI-006): The fusion pipeline matches the paper's dual-channel design. All 6 deviations resolved: average pooling (DMPI-001), 2 FC layers (DMPI-002), 10 binary heuristic features with paper keyword sets (DMPI-003, DMPI-005), repetition threshold >=3 (DMPI-004), `is_*` naming convention (DMPI-006). See Loop 12a and `docs/research/dmpi-pmhfe-prompt-injection-detection.md` for full specification. ML-001 (fusion training) is no longer blocked by DMPI deviations.
Tool/Agent Security (AS): Tool boundary defenses must parse/sanitize with LLM-based extraction and CheckTool-style triggering detection, not heuristic filters, per `docs/research/defense-tool-result-parsing.md` and `docs/research/indirect-injection-firewalls.md`. Multi-agent defense requires an explicit coordinator + guard multi-pass architecture (and second opinion path) rather than a single-pass heuristic pipeline, per `docs/research/multi-agent-defense-pipeline.md`. Pattern enforcement must detect plan compliance and routing by trust level as defined in `docs/research/design-patterns-securing-agents.md`.
Output Security (OS): HaluGate-style token-level detection requires ModernBERT token classification and NLI explanation layer, not heuristic or sentence-only checks, per `docs/research/security-state-of-art-2026.md`. Streaming safety must use partial-sequence models and progressive confidence (SCM), not re-running full-text detectors, per `docs/research/security-state-of-art-2026.md`. CodeShield parity requires Semgrep integration and coverage beyond basic static rules, per `docs/research/security-state-of-art-2026.md`.
Privacy/Protocol/Multimodal (PR/AS/MM/SA): Membership inference, poisoning, MINJA, and protocol exploit defenses must match the threat models in `docs/research/prompt-injections-to-protocol-exploits.md`. Multimodal defenses must include OCR and modality-specific detectors as described in the same literature. Policy language and taint/blast-radius controls must align with `docs/research/llmtrace-defense-pipeline-design.md`.
Evaluation (EV): Benchmarks must implement the named suites with published dataset sizes and result formatting per `docs/research/benchmarks-and-tools-landscape.md` and `docs/research/wasp-web-agent-security-benchmark.md`.
Non-Functional Requirements (NFR): Security-critical detections must be deterministic and testable, with clear latency budgets where specified (e.g., HaluGate sentinel 12ms class, token-level detection 76β162ms) from `docs/research/security-state-of-art-2026.md`. Any ML integration must include reproducible model loading, configuration, and tests demonstrating expected metrics.
## Phase 1: Critical / Quick Wins
### Loop 1 β Unicode Evasion Defenses
> Close the 100% ASR emoji smuggling and upside-down text gaps
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| IS-020 | Emoji normalisation/stripping β 100% ASR, zero current defense | Low | β
`a62855b` |
| IS-021 | Upside-down text mapping β 100% jailbreak evasion | Low | β
`a62855b` |
| IS-022 | Unicode tag character stripping (U+E0001βU+E007F) | Low | β
`a62855b` |
| IS-031 | Diacritics-based evasion defense β accent marks | Low | β
`a62855b` |
| IS-015 | Braille encoding evasion defense | Low | β
`a62855b` |
### Loop 2 β NotInject Benchmark + 3D Evaluation
> Establish over-defense baseline and evaluation framework (current dataset: 210 samples, difficulty split 90/60/60)
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| IS-004 | NotInject-style over-defense benchmark dataset (339 samples, 3 difficulty levels) | Low | β
|
| IS-005 | Three-dimensional evaluation metrics (benign/malicious/over-defense) | Low | β
`33b3f55` |
| EV-002 | NotInject evaluation runner (dataset complete: 339 samples) | Low | β
|
| EV-010 | Paper-table output format for results | Low | β
`33b3f55` |
### Loop 3 β FPR-Aware Threshold Optimisation
> Evaluate at deployment-realistic FPR operating points
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| IS-006 | FPR-aware threshold optimisation β evaluate at 0.1%, 0.5%, 1% FPR | Medium | β
`fpr_monitor.rs` |
| IS-007 | Configurable operating points (high-precision / balanced / high-recall) | Low | β
(R8) |
### Loop 4 β Canary Token System
> Detect system prompt leakage in responses
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| SA-002 | Canary token injection and leakage detection | Low | β
`5b43d93` |
### Loop 5 β Tool Registry & Classification
> Foundation for agent security features
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| AS-008 | Tool registry with security classification (category, risk score, permissions) | Medium | β
`eae4ca3` |
| AS-015 | Action-type rate limiting | Low | β
`eae4ca3` |
### Loop 6 β Context Window Flooding Detection
> DoS prevention (OWASP LLM10)
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| IS-017 | Context window flooding detection | Low | β
`9997962` |
---
### Loop R0 β Scaffold the Workspace
> Create workspace, crates, and baseline repo hygiene
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL0-01 | Initialize Cargo workspace and required crates | Medium | β
|
| RL0-02 | Add root README, .gitignore, rustfmt config | Low | β
|
| RL0-03 | Ensure crates compile cleanly | Medium | β
|
### Loop R1 β Core Types & Traits
> Define foundational core types and traits
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL1-01 | Core types: TraceEvent, TraceSpan, TenantId, SecurityFinding, SecuritySeverity, LLMProvider, ProxyConfig | Medium | β
|
| RL1-02 | Core traits: StorageBackend (or successors), SecurityAnalyzer | Medium | β
|
| RL1-03 | Error types via thiserror, serde on public types, timestamp types | Medium | β
|
| RL1-04 | Serialization roundtrip tests | Medium | β
|
### Loop R2 β SQLite Storage Backend
> Implement SQLite storage backend
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL2-01 | Sqlite storage implementation with migrations | Medium | β
|
| RL2-02 | store/query/health_check for traces | Medium | β
|
| RL2-03 | Integration tests with temp DB | Medium | β
|
### Loop R3 β Basic Prompt Injection Detection
> Regex-based prompt injection detection
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL3-01 | RegexSecurityAnalyzer request/response scanning | Medium | β
|
| RL3-02 | Patterns: system override, role injection, base64, PII | Medium | β
|
| RL3-03 | Comprehensive tests for known attacks | Medium | β
|
### Loop R4 β Transparent Proxy Core
> Core proxy flow and async analysis
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL4-01 | HTTP proxy flow (accept, parse, forward, return) | High | β
|
| RL4-02 | Support OpenAI-compatible routes | Medium | β
|
| RL4-03 | Async trace capture + security analysis | Medium | β
|
| RL4-04 | Circuit breaker and health endpoint | Medium | β
|
| RL4-05 | YAML config loading | Medium | β
|
### Loop R5 β Streaming SSE Support
> Stream passthrough and token tracking
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL5-01 | Detect streaming requests and forward SSE | High | β
|
| RL5-02 | Incremental token/TTFT tracking | High | β
|
| RL5-03 | Integration tests with mock SSE upstream | Medium | β
|
### Loop R5.5 β Storage Layer Refactor
> Repository pattern split for traces/metadata/cache
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL5-501 | Split storage traits into trace/metadata/cache | High | β
|
| RL5-502 | Add tenant/config/audit types | Medium | β
|
| RL5-503 | Storage composite + profile factory | Medium | β
|
| RL5-504 | SQLite repos for traces + metadata, in-memory cache | High | β
|
| RL5-505 | Proxy integration with new storage profile config | High | β
|
### Loop R6 β Configuration & CLI
> CLI and config validation
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL6-01 | Clap CLI with proxy/validate subcommands | Medium | β
|
| RL6-02 | Example config + env var overrides | Medium | β
|
| RL6-03 | Structured logging | Low | β
|
### Loop R7 β Python Bindings
> PyO3 bindings and tests
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL7-01 | PyO3 crate setup + Python API | High | β
|
| RL7-02 | Python tests via maturin | Medium | β
|
### Loop R8 β Integration Test & Polish
> End-to-end proxy + docs
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL8-01 | Integration test with proxy + mock upstream | High | β
|
| RL8-02 | Top-level README, LICENSE | Low | β
|
## Phase 2: Major Features
### Loop 7 β Tool-Boundary Firewalling
> The "minimize & sanitize" approach β reported low ASR on paper benchmarks (scope-specific)
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| AS-001 | Tool-Input Firewall (Minimizer) β heuristic minimizer; no LLM-based minimization | High | π |
| AS-002 | Tool-Output Firewall (Sanitizer) β heuristic sanitizer; no LLM-based parsing | High | π |
| AS-003 | Tool context awareness β tool context defined but not used in minimizer/sanitizer | Medium | π |
| AS-004 | ParseData β extract minimal required data from tool outputs (LLM-based parsing not implemented) | High | π |
| AS-005 | Format constraint validation β heuristic rules only (no schema-driven parsing) | Medium | π |
| AS-006 | CheckTool β detect tool-output-triggered tool calls (heuristic only) | High | π |
| AS-007 | Tool output sanitization against injection triggers (heuristic only) | High | π |
### Loop 8 β Model Ensemble Diversification
> Replace single-model reliance with multi-architecture ensemble
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| ML-002 | InjecGuard model integration | Medium | β
`10a2369` |
| ML-003 | Meta Prompt Guard 2 integration (86M + 22M) | Medium | β
`10a2369` |
| ML-006 | Multi-model ensemble voting with diverse architectures β InjecGuard wired as 3rd detector, majority voting replaces union merge | Medium | β
|
| ML-004 | PIGuard model integration | Medium | β
|
| ML-007 | Model hot-swapping without proxy restart | Medium | β¬ |
### Loop 9 β Action-Selector Pattern Enforcement
> Provable security patterns at proxy level
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| AS-010 | Action-Selector pattern β enforce action allowlists at proxy level | Medium | β
`89ba304` |
| AS-012 | Context-Minimization β strip unnecessary context | Medium | β
`89ba304` |
| AS-011 | Plan-then-execute pattern detection | High | β¬ |
| AS-014 | Plan compliance monitoring for declared security patterns | High | β¬ |
| AS-013 | Dual LLM routing for trusted/untrusted data | High | β¬ |
| AS-016 | Trust-based routing by data source | High | β¬ |
### Loop 10 β Multi-Agent Defense Coordination
> Coordinator + Guard architecture β reported low ASR on paper benchmarks (scope-specific)
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| AS-020 | Coordinator agent β pre-input classification (policy/heuristic pipeline only) | High | π |
| AS-021 | Guard agent β post-generation validation (policy/heuristic pipeline only) | High | π |
| AS-022 | Hierarchical coordinator pipeline (safe routing/refusal) | High | β¬ |
| AS-023 | Second opinion pass for borderline cases (no true multi-agent LLM pass) | Medium | π |
| AS-024 | Policy store β centralised security rules (in-memory, not externalized) | Medium | π |
| AS-025 | Multi-step action correlation across requests | High | β
|
| AS-026 | Multi-turn persistence detection for gradual bypass attempts | High | β
|
### Loop 11 β MCP Protocol Monitoring
> First-mover in protocol-level security
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| AS-030 | MCP monitoring β detect manipulation and server-side attacks | High | β
`mcp_monitor.rs` |
| AS-035 | Toxic Agent Flow defense β GitHub MCP vulnerability (generic MCP scanning only) | Medium | π |
| AS-036 | ToolHijacker defense β tool selection manipulation (generic MCP scanning only) | High | π |
### Loop 12 β Advanced Prompt Injection Detection
> Synonym expansion, lemmatisation, P2SQL
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| IS-010 | Synonym expansion for attack patterns (manual synonym regex, not WordNet) | Medium | π |
| IS-011 | Lemmatisation before pattern matching (basic stemming, not true lemmatization) | Low | π |
| IS-012 | P2SQL injection detection (regex only, no structured SQL parsing) | Medium | π |
| IS-013 | Long-context jailbreak detection (position-aware sliding window) | High | β¬ |
| IS-014 | Automated jailbreak defense (GPTFuzz-style genetic templates) | High | β¬ |
| IS-016 | Multi-turn extraction detection (session-aware probing) | High | π |
| IS-040 | Data format coverage expansion (17 formats) | Medium | β¬ |
| IS-041 | Multi-language trigger detection | High | β¬ |
| IS-018 | "Important Messages" header attack hardening | Low | π |
| IS-050 | Perplexity-based anomaly detection for GCG-optimized strings in tool outputs | Medium | β¬ |
| IS-051 | Adaptive monitoring scope (input-only vs hybrid) to control attack surface | Medium | β¬ |
| IS-052 | Adversarial string propagation blocking in tool outputs (perplexity threshold) | High | β¬ |
### Loop 12a β DMPI-PMHFE Architecture Alignment
> Resolve 6 architectural deviations between codebase and DMPI-PMHFE paper (arXiv 2506.06384). All 6 resolved (DMPI-001, DMPI-002, DMPI-003, DMPI-004, DMPI-005, DMPI-006).
> Loop 15 (Fusion Training Pipeline) is no longer blocked by DMPI deviations.
> Reference: `docs/research/dmpi-pmhfe-prompt-injection-detection.md`
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| DMPI-001 | **Average pooling instead of CLS token** β Implemented `masked_mean_pool()` in `ml_detector.rs`. Added `PoolingStrategy` enum (Cls/MeanPool), defaulting to MeanPool. BERT and DeBERTa paths both use attention-mask-aware average pooling over all non-padding tokens, matching paper spec. `DebertaV2ContextPooler` is now optional (only loaded for Cls strategy). Architecture doc: `docs/architecture/DMPI_001_AVERAGE_POOLING.md`. | Medium | :white_check_mark: |
| DMPI-002 | **2 FC layers instead of 3** β Removed `HIDDEN_2` and `fc3`; collapsed to `fc1(783->256)->ReLU->fc2(256->2)->SoftMax` matching paper spec. Input dim changes from 783 to 778 once DMPI-003 is also applied (768 + 10 = 778). Architecture doc: `docs/architecture/DMPI_002_TWO_FC_LAYERS.md`. | Medium | :white_check_mark: |
| DMPI-003 | **10 binary features instead of 15 mixed** β Replaced 15-dim vector (8 binary + 7 numeric) with 10 binary features matching paper Appendix A. Removed all numeric features. Added keyword-based detection for `is_ignore`, `is_format_manipulation`, `is_immoral`. Reordered to paper spec. Architecture doc: `docs/architecture/DMPI_003_TEN_BINARY_FEATURES.md`. | High | :white_check_mark: |
| DMPI-004 | **Repetition threshold >=3 instead of >10** β Named constant `REPETITION_THRESHOLD = 3`. Word-level and phrase-level conditions changed to `>= REPETITION_THRESHOLD`. Expanded `COMMON_WORDS` (+37 words) and added `COMMON_PHRASES` exclusion list (29 common English bigrams) to control false positives at the lower threshold. | Low | :white_check_mark: |
| DMPI-005 | **Missing paper features: is_immoral, is_ignore, is_format_manipulation** β All 3 missing features now implemented as keyword-in-text checks in `feature_extraction.rs`. `is_ignore` (index 0): ignore, reveal, disregard, forget, overlook, regardless. `is_format_manipulation` (index 4): encode, disguising, morse, binary, hexadecimal. `is_immoral` (index 7): hitting, amoral, immoral, deceit, irresponsible, offensive, violent, unethical, smack, fake, illegal, biased. Resolved as part of DMPI-003. Architecture doc: `docs/architecture/DMPI_003_TEN_BINARY_FEATURES.md`. | Medium | :white_check_mark: |
| DMPI-006 | **Feature naming alignment to paper convention** β All 8 finding types renamed to paper's `is_*` convention: `flattery_attack->is_incentive`, `urgency_attack->is_urgent`, `roleplay_attack->is_hypothetical`, `impersonation_attack->is_systemic`, `covert_attack->is_covert`, `excuse_attack->is_immoral`, `many_shot_attack->is_shot_attack`, `repetition_attack->is_repeated_token`. Updated in `lib.rs`, `feature_extraction.rs`, and documentation. | Low | β
|
### Loop 13 β Hallucination Detection Upgrade
> HaluGate-style token-level detection
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| OS-001 | Token-level hallucination detection (ModernBERT) | High | β¬ |
| OS-002 | NLI explanation layer for flagged spans | High | β¬ |
| OS-003 | ModernBERT sentinel pre-classifier | Medium | β¬ |
| OS-004 | Tool-call result as ground truth for fact-checking | Medium | β¬ |
| OS-005 | Semantic entropy-based detection | High | β¬ |
| OS-006 | Citation validation | High | β¬ |
| ML-005 | ModernBERT support (for token/sentinel classifiers) | High | β¬ |
### Loop 14 β Content Safety Expansion
> Llama Guard integration, bias detection
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| OS-022 | Llama Guard 3 integration (14 harm categories) | Medium | β¬ |
| OS-021 | Bias detection in responses | Medium | β¬ |
| OS-020 | Constitutional classifiers for output moderation | High | β¬ |
| OS-023 | Language detection for unexpected output switches | Low | β¬ |
| OS-024 | Sentiment analysis for manipulative content | Low | β¬ |
| OS-030 | CodeShield-style code security expansion | High | π |
| OS-031 | Semgrep rule integration for code outputs | High | β¬ |
| OS-032 | Supply chain security in code (typosquatting, confusion) | High | β¬ |
### Loop 15 β Fusion Training Pipeline
> Train the fusion classifier with real data
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| ML-001 | Joint end-to-end training for fusion FC layer | High | β
|
| ML-014 | Curated training dataset (61k benign + 16k injection) | Medium | β
|
| IS-001 | Token-wise bias detection for over-defense | High | β¬ |
| IS-002 | Adaptive debiasing data generation (1β3 token combos) | High | β¬ |
| IS-003 | MOF retraining pipeline on debiased data | High | β¬ |
| ML-010 | MOF training pipeline (token bias β debiasing β retraining) | High | β¬ |
| ML-011 | Data-centric augmentation across 17 formats | Medium | β¬ |
| ML-015 | GradSafe integration | High | β¬ |
| ML-016 | GCG adversarial sample generation (Python/PyTorch tooling; shared with EV-017) | High | β¬ |
| ML-020 | ONNX runtime support for inference | Medium | β¬ |
| ML-021 | INT8/INT4 quantized model loading | Medium | β¬ |
| ML-022 | Batched inference for GPU utilization | Medium | β¬ |
### Loop 16 β Benchmark Evaluation Suite
> Evaluate against all major benchmarks
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| EV-001 | AgentDojo evaluation (97 environments) | Medium | β¬ (requires Python framework, not static dataset) |
| EV-003 | InjecAgent evaluation (2108 indirect injection samples) | Medium | β
|
| EV-004 | ASB evaluation (400 agent security attack samples) | Medium | β
|
| EV-005 | WASP evaluation | Medium | β¬ (requires live web environment) |
| EV-006 | CyberSecEval 2 prompt injection evaluation (251 attack samples per DMPI-PMHFE [28]) | Medium | β
`7ce0cf9` |
| EV-007 | MLCommons AILuminate jailbreak benchmark (1200 demo prompts) | Medium | β
|
| EV-008 | HPI attack approximation (55 instances, 8-category taxonomy from arXiv:2509.14285) | Low | β
(best-effort 55-attack approximation) |
| EV-009 | Automated CI-integrated benchmark runner | Medium | β
`b15f4f0` |
| EV-011 | safeguard-v2 evaluation (2060 samples) | Low | β
|
| EV-012 | deepset-v2 evaluation (355 samples) | Low | β
|
| EV-013 | Ivanleomk-v2 evaluation (610 samples) | Low | β
|
| EV-014 | BIPIA evaluation (400 samples: 200 benign + 200 indirect injection, 3 scenarios) | Medium | β
|
| EV-015 | HarmBench evaluation (400 harmful behaviors, jailbreak/safety ASR) | Medium | β
|
| EV-016 | AgentDojo Slack suite adaptive attack evaluation (Agent-as-a-Proxy resilience, 89 samples) | High | β¬ |
| EV-017 | Multi-objective GCG adversarial robustness red-team testing against LLMTrace ensemble | High | β¬ |
| EV-018 | Cross-model transfer attack resistance testing across ensemble members | Medium | β
|
| EV-019 | Tensor Trust prompt hijacking/extraction evaluation (1000 sampled attacks) | Low | β
|
| EV-020 | Harelix mixed-techniques evaluation (1174 samples, tri-class) | Low | β (dataset deleted from HuggingFace) |
| EV-021 | Jackhhao jailbreak-classification over-defense test (1306 samples, balanced) | Low | β
|
---
### Loop R9 β REST Query API
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL9-01 | Trace/span query endpoints + pagination | High | β
|
| RL9-02 | Security findings endpoint | Medium | β
|
| RL9-03 | API tests | Medium | β
|
### Loop R10 β LLM Provider Auto-Detection
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL10-01 | Provider detection by path/header/host | Medium | β
|
| RL10-02 | Provider-specific response parsing | Medium | β
|
| RL10-03 | Provider detection tests | Medium | β
|
### Loop R11 β Cost Estimation Engine
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL11-01 | Pricing table + estimate_cost API | Medium | β
|
| RL11-02 | Custom pricing config | Medium | β
|
| RL11-03 | Tests for pricing | Medium | β
|
### Loop R12 β Alert Engine (Webhooks)
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL12-01 | Webhook alerting with thresholds + cooldown | Medium | β
|
| RL12-02 | Mock webhook tests | Medium | β
|
### Loop R13 β Tenant Management API
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL13-01 | Tenant CRUD endpoints + audit | High | β
|
| RL13-02 | Auto-create tenant on first request | Medium | β
|
| RL13-03 | API tests | Medium | β
|
### Loop R14 β ClickHouse TraceRepository
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL14-01 | ClickHouse TraceRepository implementation | High | β
|
| RL14-02 | Feature-gated ClickHouse tests | High | β
|
### Loop R15 β PostgreSQL MetadataRepository
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL15-01 | Postgres MetadataRepository + migrations | High | β
|
| RL15-02 | Postgres integration tests | High | β
|
### Loop R16 β Redis CacheLayer
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL16-01 | Redis CacheLayer implementation | Medium | β
|
| RL16-02 | Cache TTL and invalidation tests | Medium | β
|
### Loop R17 β Data Retention & Purging
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL17-01 | Retention policies + purge job | Medium | π |
| RL17-02 | Purge audit logging | Medium | β¬ |
### Loop R18 β Agent Action Analysis
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL18-01 | AgentAction model + auto-parse tool calls | High | β
|
| RL18-02 | Actions reporting API + query filters | High | β
|
| RL18-03 | Action security analysis + storage | High | β
|
| RL18-04 | Python SDK action reporting | Medium | β
|
## Phase 3: Research Frontier
### Loop 17 β Multimodal Security
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| MM-001 | Image injection detection | High | β¬ |
| MM-004 | OCR-based text extraction from images | Medium | β¬ |
| MM-002 | Audio injection detection | High | β¬ |
| MM-003 | Cross-modal consistency checking | High | β¬ |
| MM-005 | Steganography detection (image/audio) | High | β¬ |
| MM-006 | Video frame injection detection | High | β¬ |
### Loop 18 β Protocol Security (A2A/ANP)
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| AS-031 | A2A protocol security | High | β¬ |
| AS-032 | ANP protocol security | High | β¬ |
| AS-033 | Dynamic trust management | High | β¬ |
| AS-034 | Inter-agent trust verification | High | β¬ |
### Loop 19 β Streaming Content Monitor
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| OS-010 | Purpose-built partial-sequence detection models | High | β¬ |
| OS-011 | Training-inference gap mitigation (partial sequence training) | High | β¬ |
| OS-012 | Token-level harm annotations | High | β¬ |
| OS-013 | Progressive confidence scoring | Medium | β¬ |
### Loop 20 β Advanced Privacy
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| PR-001 | Membership inference defense | High | β¬ |
| PR-002 | Data extraction prevention | High | β¬ |
| PR-003 | Federated learning poisoning defense | High | β¬ |
| PR-004 | Vector/embedding poisoning detection | High | β¬ |
| PR-005 | RAG retrieval anomaly monitoring | Medium | β¬ |
| PR-006 | Multi-language PII detection (non-Latin scripts) | High | π |
| PR-007 | Context-aware PII enhancement (lemma-based boosting) | Medium | π |
| PR-009 | Compliance mapping to GDPR/HIPAA/CCPA entities | Medium | π |
| PR-010 | Memory poisoning detection (MINJA) | High | β¬ |
| PR-011 | Cross-session state integrity | High | β¬ |
| PR-008 | Custom PII entity type plugins | Medium | β¬ |
| PR-012 | Speculative side-channel defense | High | β¬ |
### Loop 21 β Policy Language
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| SA-001 | Declarative policy specification (Colang/OPA-style) | High | β¬ |
| SA-003 | Taint tracking | High | β¬ |
| SA-004 | Blast radius reduction for tool access | Medium | β¬ |
| SA-005 | Backdoor detection (prompt/parameter level) | High | β¬ |
| SA-006 | Composite backdoor detection (CBA-style) | High | β¬ |
| SA-007 | Data poisoning detection (PoisonedRAG) | High | β¬ |
| SA-008 | Social engineering simulation defense | High | β¬ |
| SA-009 | Contagious recursive blocking defense | High | β¬ |
| SA-010 | GuardReasoner integration | High | β¬ |
### Loop 22 β Adversarial ML Robustness
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| IS-024 | AML evasion resistance (TextFooler, BERT-Attack, BAE) β normalization only, no attack-specific defenses | High | π |
| IS-025 | Ensemble diversification against transferability β no transferability testing or training | High | π |
| IS-026 | Adversarial training integration (TextAttack samples) | High | β¬ |
| IS-027 | Adaptive thresholding for evasion indicators | Medium | β¬ |
| IS-028 | Multi-pass normalisation (aggressive + conservative + semantic-preserving) | Medium | π |
| ML-012 | Adversarial training on TextAttack samples | High | β¬ (needs training pipeline) |
| ML-013 | Robust training with Unicode/character injection samples | High | β¬ |
| IS-029 | Confidence calibration (Platt scaling) β temperature scaling only | Medium | π |
| IS-023 | Character smuggling variants (comprehensive unicode exploitation) | Medium | π |
| IS-030 | Word-importance transferability mitigation | High | β¬ |
### Loop 23 β E2E Accuracy Optimization (Post Stress Test)
> After wiring OperatingPoint, threshold filtering, over-defence suppression, and score capping for single-detector findings, the E2E stress test reached **83.7% accuracy, 84.7% F1** on a 153-sample corpus (79 malicious, 74 benign) from 13+ benchmark datasets. The remaining 15 FPs and 10 FNs require ML-level fixes documented below.
> Reference: `docs/FEATURE_ROADMAP.md` section 3.4.4 for full analysis.
>
> **Review findings (2026-02-15, AI Engineer + MLOps Engineer):**
> - Combined ML-030 + ML-033 impact is NOT additive; realistic combined: -7 to -12 FPs.
> - ML-030 must precede ML-033 (calibrating before fine-tuning is wasted work).
> - ML-030 triggers ML-001 re-evaluation (fusion classifier needs re-validation after base model changes).
> - ML-033 supersedes IS-029 (Loop 22). IS-029 remains for temperature scaling only; ML-033 adds proper Platt scaling.
> - IS-060 elevated to P0 (4 FNs, largest single FN category, indirect injection is most dangerous for agent systems).
> - IS-070 elevated to P1 (shell injection in agent contexts is high-severity).
> - ML-034 elevated to P1 (encoding bypass is an active evasion vector).
> - Acceptance criterion for ALL items: full benchmark suite recall must not decrease by >1pp.
> - MLOps prerequisites (OPS-001 through OPS-008) must be addressed before deploying model changes.
> - Recommended execution order: ML-032 + ML-034 (patch evasion vectors) -> ML-033 (calibrate existing system) -> IS-070 (expand detection) -> ML-030 (model fine-tuning, highest risk last).
**Infrastructure already wired (this session):**
- `SecurityAnalysisConfig.operating_point` + `SecurityAnalysisConfig.over_defence` config fields
- `EnsembleSecurityAnalyzer::filter_by_thresholds()` applying per-category confidence gates
- `EnsembleSecurityAnalyzer::apply_over_defence()` suppressing auxiliary-only findings (no injection corroboration)
- Single-detector score cap at 60 (Medium) in `add_security_finding()`
- 3 new regex patterns: `roleplay_lets` (jailbreak), `authority_claim_update` (is_systemic), `disable_safety` (prompt_injection)
**Dependency chain:**
```
OPS-001..OPS-008 (prerequisites)
|
v
ML-032 + ML-034 (patch evasion vectors, low risk)
|
v
ML-033 (calibrate existing system, supersedes IS-029)
|
v
IS-060 + IS-070 (new detection capabilities)
|
v
ML-030 (fine-tune DeBERTa, highest risk)
|
v
ML-001 re-evaluation (fusion classifier re-validation)
|
v
ML-031 (multilingual calibration, depends on language detection infra)
```
**MLOps prerequisites (must complete before deploying ML changes):**
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| OPS-001 | **Externalize SecurityConfig to file/env** β All thresholds, operating points, and model identifiers loadable from config file or environment variables. Hardcoded values serve only as fallback defaults. Required before ML-033 (recalibrated thresholds currently require code change and rebuild). | Medium | β¬ |
| OPS-002 | **Pin model revisions with SHA** β Use `hf_hub` revision/commit SHA parameters for all model downloads. Add SHA256 integrity verification of SafeTensors files. Current code downloads latest revision on cold start, risking silent behavior changes. | Low | β¬ |
| OPS-003 | **Model inference metrics** β Expose per-model inference latency histograms, raw score distributions, classification outcome counters, ensemble agreement rate. Essential for validating model changes in production. | Medium | β¬ |
| OPS-004 | **Model version manifest** β `models.toml` declaring model name, revision SHA, expected SafeTensors SHA256, deployment timestamp. Enables rollback by reverting manifest to previous version. | Low | β¬ |
| OPS-005 | **CI regression test gate** β CI step that loads ensemble, runs fixed regression set (50-100 canonical examples), asserts no accuracy regression beyond threshold. Blocks merge on failure. | Medium | β¬ |
| OPS-006 | **Training infrastructure for ML-030** β Define training environment (Python/PyTorch), GPU provisioning, SafeTensors export validation step ensuring layer-name compatibility with Candle loader. Define artifact registry for fine-tuned weights. | High | β¬ |
| OPS-007 | **Expand calibration dataset to 1,000+ samples** β 153 stress test samples is insufficient for Platt scaling. Collect stratified samples across injection types. Separate calibration holdout from ML-030 training set (at least 30% of NotInject reserved for calibration). | Medium | β¬ |
| OPS-008 | **Shadow-mode inference** β Run new model ensemble in parallel without affecting response path, log predictions for offline comparison. Required for safe validation of ML-030 fine-tuned model before production cutover. | High | β¬ |
**ML accuracy work items:**
| ID | Feature | Complexity | Priority | Status |
|----|---------|-----------|----------|--------|
| ML-032 | **Short-input confidence scaling** β For inputs < 10 tokens, scale confidence threshold linearly from 0.95 (at 1 token) to normal threshold (at 10 tokens). Do NOT bypass ML entirely to avoid blind spots for short attacks like "Ignore all previous instructions" (5 tokens). Estimated impact: -1 FP. | Low | P1 | β¬ |
| ML-034 | **Encoding decoder preprocessor** β Before ML inference, apply decoding pipeline: base64, rot13, leetspeak, hex, binary, upside-down text, Cyrillic homoglyphs. Add content-type heuristic before decoding (skip base64 if string contains spaces/punctuation). Specify latency cap (5ms max). Must define integration plan with existing `jailbreak_detector.rs` encoding detection (augment, not replace). 7/11 encoding evasion test cases detected (64%); 4 misses are encoded payloads without plaintext injection markers. | Medium | P1 | β¬ |
| ML-033 | **Confidence recalibration (Platt scaling)** β Apply Platt scaling (logistic regression) to recalibrate DeBERTa output probabilities. Supersedes IS-029 temperature scaling. Requires OPS-007 (1,000+ calibration samples). Calibration set MUST be disjoint from ML-030 training set. Specify per-model vs post-ensemble calibration. Re-derive operating point thresholds after calibration (current HighRecall/Balanced/HighPrecision values become invalid). Estimated impact: -2 to -4 FPs. Depends on: OPS-007. | Medium | P1 | β¬ |
| IS-060 | **Spotlighting/datamarking for indirect injection** β Split input into instruction zones and data zones using configurable boundary markers. Apply injection detection only to data zones. Sub-tasks: (a) zone boundary detection heuristics for common data formats (HTML tables, email headers, CSV, JSON data fields), (b) config-declared boundary support, (c) ensemble integration (feed datamarking results into existing voting). Targets 4 BIPIA FNs (40% of all FNs). Reference: `docs/research/spotlighting-indirect-injection-defense.md` (datamarking reduces ASR from >50% to <3%). | High | P0 | β¬ |
| IS-070 | **Shell command injection detection** β Detect dangerous shell commands (curl with exfiltration, python -c with socket, wget, reverse shell, rm -rf) in prompt content. Extend existing RL3-02 regex patterns (do not duplicate). Distinct from prompt injection; targets 2 FN code execution attacks. Critical for agent systems with tool-use capabilities. | Medium | P1 | β¬ |
| ML-030 | **DeBERTa fine-tuning on NotInject dataset** β Fine-tune `protectai/deberta-v3-base-prompt-injection-v2` using 339 NotInject samples + 15 stress test FPs + 10-20 "creative writing instruction" samples as hard negatives. Mix with full training set (61k benign + 16k injection from ML-014) to prevent catastrophic forgetting. Training: 3 epochs, lr=2e-5, batch_size=16. Reserve 20% of NotInject for validation. Acceptance criteria: F1 >= 0.88 on held-out set, no per-class recall regression > 2%, full benchmark suite pass. Estimated impact: -5 to -10 FPs. Depends on: OPS-002, OPS-004, OPS-005, OPS-006. Triggers: ML-001 re-evaluation. | High | P0 | β¬ |
| ML-031 | **Multilingual calibration** β Two sub-tasks: (a) add language detection to ensemble pipeline (e.g., `lingua-rs` or trigram detector), (b) calibrate per-language confidence thresholds using holdout set. Collect 1,000+ benign Chinese samples (traditional + simplified, technical/conversational/educational). Fine-tuning is a separate future item. Estimated impact: -2 FPs. Depends on: ML-030. | Medium | P2 | β¬ |
---
### Loop R19 β ML Prompt Injection Detection (Candle)
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL19-01 | Candle ML detector + ensemble integration | High | β
|
| RL19-02 | ML config wiring + fallback | Medium | β
|
| RL19-03 | Benchmark + tests | Medium | π |
### Loop R20 β OpenTelemetry Ingestion Gateway
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL20-01 | OTLP/HTTP endpoint + mapping | High | β
|
| RL20-02 | OTEL ingestion tests | Medium | β
|
### Loop R21 β Web Dashboard
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL21-01 | Next.js dashboard scaffolding + pages | High | β
|
| RL21-02 | API client + charts + Docker | High | β
|
### Loop R22 β CI/CD Pipeline
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL22-01 | CI workflow (fmt/clippy/test) | Medium | β
|
| RL22-02 | Release workflow + image scan | Medium | β
|
### Loop R23 β RBAC & Auth
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL23-01 | API keys + role enforcement | High | β
|
| RL23-02 | Tenant isolation | High | β
|
### Loop R24 β Compliance Reporting
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL24-01 | Report generator + API | High | β
|
| RL24-02 | Optional PDF export | Medium | β¬ |
### Loop R25 β gRPC Ingestion Gateway
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL25-01 | gRPC ingestion server + proto | High | β
|
| RL25-02 | Streaming ingestion support | High | β
|
### Loop R26 β Kubernetes Operator + Helm
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL26-01 | Helm chart + deployment docs | High | β
|
| RL26-02 | Optional CRD operator | High | β¬ |
### Loop R27 β WASM Bindings
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL27-01 | wasm-bindgen crate + JS API | Medium | β
|
| RL27-02 | WASM tests | Medium | β
|
### Loop R28 β Node.js Bindings
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL28-01 | napi-rs bindings + TS types | Medium | β
|
| RL28-02 | Node tests | Medium | β
|
### Loop R29 β Statistical Anomaly Detection
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL29-01 | Anomaly detector + config | High | β
|
| RL29-02 | Alert integration + tests | High | β
|
### Loop R30 β Real-time Streaming Security Analysis
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL30-01 | Streaming incremental analysis | High | β
|
| RL30-02 | Mid-stream alerting tests | High | β
|
### Loop R31 β Expanded PII Detection
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL31-01 | International PII patterns + suppression | High | β
|
| RL31-02 | PII redaction modes + tests | High | β
|
### Loop R32 β ML PII via NER
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL32-01 | NER model integration + ensemble | High | β
|
| RL32-02 | NER tests | Medium | β
|
### Loop R33 β ML Inference Monitoring + Warm-up
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL33-01 | Inference timing + preload | Medium | β
|
| RL33-02 | Warm-up tests | Medium | β
|
### Loop R34 β Multi-Channel Alerting
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL34-01 | Slack (Block Kit) + PagerDuty (Events API v2) done; Email channel TODO | High | π |
| RL34-02 | Deduplication done; escalation stub only (no full escalation policy engine) | High | π |
### Loop R35 β Externalize Pricing + OWASP Tests
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL35-01 | Pricing config externalization | Medium | β
|
| RL35-02 | OWASP LLM Top 10 test suite | High | β
|
### Loop R36 β Graceful Shutdown + Signal Handling
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL36-01 | SIGTERM/SIGINT handling + task drain | High | β
|
| RL36-02 | Shutdown tests | Medium | β
|
### Loop R37 β Prometheus Metrics Endpoint
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL37-01 | Metrics endpoint + instrumentation | High | β
|
| RL37-02 | Metrics tests | Medium | β
|
### Loop R38 β Database Migration Management
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL38-01 | Migration tooling + CLI | High | β
|
| RL38-02 | Migration tests | Medium | β
|
### Loop R39 β Secrets Hardening + Startup Probe
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL39-01 | Secrets hardening + startup probe | Medium | β
|
### Loop R40 β Integration Tests in CI + Container Scanning
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL40-01 | Compose-based integration tests in CI | High | β
|
| RL40-02 | Container scanning in release | Medium | β
|
### Loop R41 β Per-tenant Rate Limiting + Compliance Persistence
| ID | Feature | Complexity | Status |
|----|---------|-----------|--------|
| RL41-01 | Tenant rate limiting middleware | High | β
|
| RL41-02 | Compliance report persistence + API | High | β
|
## Quality Gates (enforced on every loop)
1. **cargo fmt --all --check** β zero diffs
2. **cargo clippy --workspace -- -D warnings** β zero warnings
3. **cargo test --workspace** β zero failures (pre-existing failures must be fixed)
4. **Lead engineer review** β diff reviewed before commit
5. **CI green** β verified after push
## Notes
- IS-007 (Configurable operating points) completed in R8 commit `41e219b`. Fully wired to proxy config and ensemble in 2026-02-15 session: `SecurityAnalysisConfig.operating_point` field drives `EnsembleSecurityAnalyzer::with_operating_point()`, `filter_by_thresholds()` applies per-category confidence gates, `over_defence` flag enables auxiliary-only suppression.
- R11 (code_security module) completed in commit `b08dccc`, tests fixed in `aa9ab98`
- Each loop targets a coherent feature set that can be tested independently
- Phase 1 focuses on closing critical 100% ASR gaps and establishing evaluation baseline
- RALPH quality policy: no placeholders/mocks; if spec requires ML, implement real ML inference (regex fallback only when model weights unavailable).
- AS-004/AS-006/AS-007 are π because literature expects LLM-based parsing/sanitization for tool outputs; current implementation is heuristic only.
- AS-020/AS-021/AS-023/AS-024 are π because literature expects multi-agent LLM coordination; current implementation is heuristic/policy-only.
- IS-024/IS-027/IS-028/IS-029 are π because only normalization/temperature scaling exists (no attack-specific defenses or Platt scaling). ML-033 (Loop 23) supersedes IS-029 for Platt scaling; IS-029 remains for temperature-scaling-only scope.
- PR-006 is π because full non-Latin PII coverage and a custom-entity plugin architecture are not fully implemented.
- Tool parsing expectations come from `docs/research/defense-tool-result-parsing.md` and `docs/research/indirect-injection-firewalls.md`.
- Multi-agent expectations come from `docs/research/multi-agent-defense-pipeline.md`.
- Adversarial robustness expectations come from `docs/research/bypassing-llm-guardrails-evasion.md`.
- Over-defense mitigation expectations come from `docs/research/injecguard-over-defense-mitigation.md`.
- Benchmark coverage expectations come from `docs/research/benchmarks-and-tools-landscape.md` and `docs/research/wasp-web-agent-security-benchmark.md`.
- CyberSecEval 2 benchmark expectations (EV-006) come from `docs/research/cyberseceval2-llm-security-benchmark.md`. The 251 attack sample count is sourced from DMPI-PMHFE (arXiv 2506.06384) which used the CyberSecEval 2 prompt injection dataset; the full paper covers additional suites (500 code interpreter abuse prompts, exploit generation, FRR).
- BIPIA benchmark expectations (EV-014) come from `docs/research/bipia-indirect-prompt-injection-benchmark.md`. First indirect prompt injection benchmark (KDD 2025, arXiv 2312.14197): 86,250 test prompts, 50 attack types, 25-model baseline. Boundary token defense (`<data>`/`</data>`) is most impactful intervention (1064% ASR increase without it) and is implementable at proxy level (relevant to AS-001/AS-002).
- Agent-as-a-Proxy attack implications (EV-016) come from `docs/research/agent-as-a-proxy-attacks.md`. Monitoring-based defenses (including LLMTrace proxy monitoring) are fundamentally fragile: 90%+ ASR via GCG-optimized adversarial strings. Validates that structural defenses (AS-001/AS-002 sanitization, boundary tokens) are more robust than observation-based monitoring. High-perplexity detection in tool outputs is a viable countermeasure.
- IS-050 -> IS-052 dependency: IS-052 (adversarial string propagation blocking) depends on IS-050 (perplexity-based anomaly detection) for surprisal scoring. IS-050 must be implemented first. IS-052 runs before AS-002 in the tool-output sanitization pipeline.
- IS-050 -> IS-051 implicit dependency: IS-051 (adaptive monitoring scope) auto-switches to input-only mode when IS-050 detects sustained high-perplexity anomalies in tool outputs (suggests active adaptive attack). IS-050 must be implemented first for auto-switching; manual override works independently.
- ML-016 and EV-017 share GCG Python/PyTorch offline tooling (`tools/gcg/` or `scripts/adversarial/`). Not part of the Rust proxy runtime.
- EV-016 and EV-001 share AgentDojo benchmark infrastructure. EV-016 focuses on Slack suite (89 samples) with adaptive (GCG) attacks; EV-001 covers the full 97 environments.
- EV-018 depends on ML-006 (ensemble must be wired before transfer resistance can be tested).
- ML-016 (GCG adversarial sample generation) is in Loop 15 (Fusion Training Pipeline). Requires Python/PyTorch offline tooling, not Rust proxy code. Shared with EV-017.
- Token-level perplexity detection expectations (IS-050) come from `docs/research/token-level-perplexity-detection.md`. PGM-based per-token detection with GPT-2 124M (CPU-only, <1GB) achieves perfect sequence-level detection and 0.93+ token-level F1. O(n) DP algorithm. Core implementation reference for IS-050.
- Perplexity-based attack detection expectations (IS-050) come from `docs/research/perplexity-based-attack-detection.md`. Two-feature LightGBM (PPL + token length) achieves 99.1% F2 on GCG attacks. GCG mean PPL 3525 vs benign ~30-45. Perplexity alone is insufficient (false positives on code/non-English); token length as second feature resolves this.
- Task Shield alignment expectations (ML-016) come from `docs/research/task-shield-alignment-defense.md`. Task-alignment defense ("does this serve the user?") achieves 2.07% ASR with 69.79% utility on GPT-4o. ContributesTo scoring at message boundaries. Directly informs ML-016 goal-drift detector design; provides EV-016 baseline comparison targets.
- Spotlighting expectations (IS-004, AS-001/AS-002) come from `docs/research/spotlighting-indirect-injection-defense.md`. Datamarking reduces ASR from >50% to <3% with zero NLP quality impact. Dynamic/randomized tokens essential. Encoding (base64) achieves 0% ASR but requires GPT-4-class models. Validates and extends boundary tag approach.
- Instruction hierarchy expectations (IS-004, SA-003) come from `docs/research/instruction-hierarchy-defense.md`. Privilege hierarchy (system > user > tool) via SFT+RLHF. +63.1 pp on system message extraction defense. Validates proxy-level boundary tags as complement to model-level hierarchy. Over-refusal is main trade-off (-22.7 pp).
- DMPI-001βDMPI-006 (Loop 12a) were prerequisites for ML-001 (Loop 15). All 6 deviations are now resolved; the fusion classifier architecture matches the DMPI-PMHFE specification. See `docs/research/dmpi-pmhfe-prompt-injection-detection.md` for the authoritative paper breakdown.
- DMPI-003 and DMPI-005 resolved together: feature vector is now 10 binary dimensions matching paper Appendix A. See `docs/architecture/DMPI_003_TEN_BINARY_FEATURES.md`.
- DMPI-006 (naming) resolved: all 8 finding types renamed to paper's `is_*` convention.
- EV-002 is β
because the NotInject dataset is 339 samples with equal difficulty tiers (113/113/113).
- Without a harness, you **can't compare** prompts, models, retrieval configs, or costs.
Evaluate, benchmark, and regression-test AI/LLM systems. Covers evaluation framework design, benchmark creation, human evaluation protocols, automated evaluation (LLM-as-judge), regression testing, statistical significance, and continuous evaluation pipelines.
<img width="1388" height="298" alt="full_diagram" src="https://github.com/user-attachments/assets/12a2371b-8be2-4219-9b48-90503eb43c69" />
A list of all public EEG-datasets. This list of EEG-resources is not exhaustive. If you find something new, or have explored any unfiltered link in depth, please update the repository.