---
description: How Hermes Agent runs the entire local AI inference stack on Intel Arc GPU, automates research pipelines, and coordinates multi-model switching — all from a terminal.
tags:
- hermesagentchallenge
- ai
- agents
- productivity
---
*This is a submission for the [Hermes Agent Challenge](https://dev.to/challenges/hermes-agent-2026-05-15): Write About Hermes Agent*
---
## What I Built
A self-managing AI workspace powered by [Hermes Agent](https://hermes-agent.nousresearch.com) — where an autonomous agent runs the local inference stack on Intel Arc GPU, automates research and documentation, manages cron jobs, and coordinates multiple LLM backends without human micro-management. The human directs goals; the agent executes everything.
**Hardware:** GMKtec EVO-T1 mini-PC (Intel Core Ultra 9 285H, Intel Arc 140T iGPU, 64GB DDR5-5600) — a pocketable 45W system that runs autonomous AI agents 24/7.
The system manages:
- **Local LLM inference** via llama.cpp on Intel Arc SYCL (iGPU)
- **Automated research pipelines** feeding structured docs into a persistent vault
- **Multi-model testing and benchmarking** — 9+ models across 9B to 35B parameters
- **Cron-driven monitoring** — market data, system health, memory management
- **Self-maintaining skills** — the agent updates its own skills and docs when things change
---
## Architecture
```plaintext
[ User Goals ]
│
▼
[ Hermes Agent ]─── llama-server (Intel Arc SYCL)
│ ├── Qwen3.5-9B-Sushi-Coder-RL (130K ctx) ← daily driver
│ ├── Qwen3-Coder-30B-A3B (65K ctx) ← coding specialist
│ ├── Qwen3.6-35B-UD-IQ4_NL (65K ctx) ← reasoning
│ └── Qwen3.5-9B-DeepSeek-V4-Flash (130K ctx) ← stable but reasoning-only
│
├── research-vault/ (research & docs)
└── hermes-config/ (skills, plugins, cron jobs)
```
The agent runs as a Hermes session with:
- **Persistent memory** — notes about the environment, user preferences, tool quirks, project conventions
- **Durable skills** — 40+ specialized procedures for devops, mlops, research, etc.
- **Toolsets** — terminal, browser, file, cron, git, and more
- **Full system access** — builds, debugs, tunes, and documents everything autonomously
### GMKtec EVO-T1 Hardware
The host is a **GMKtec EVO-T1** mini-PC:
- **CPU:** Intel Core Ultra 9 285H (Arrow Lake, 16 cores, up to 5.4GHz)
- **iGPU:** Intel Arc 140T (128 Xe cores, shares system DDR5 as VRAM)
- **RAM:** 64GB DDR5-5600 (~58GB addressable by GPU)
- **Power:** ~45W sustained under full load
- **Form factor:** ~0.6L, pocketable
The Intel Arc 140T iGPU is the inference engine. With llama.cpp SYCL backend and Intel oneAPI 2026.0, the agent runs GGUF models locally at 131K context. A critical kernel-level SYCL fix (removing the `-ze-intel-greater-than-4GB-buffer-required` CUDA-style linker flag and setting `ONEAPI_DEVICE_SELECTOR=level_zero:gpu`) was required to prevent JIT compilation crashes at large context sizes — diagnosed and applied by the agent.
---
## How It Was Built
All implementation was done by Hermes Agent. The human directed high-level goals; the agent executed every technical step.
### Step 1: Local Inference Server (llama.cpp on Intel Arc)
Built a llama.cpp inference server backed by Intel Arc SYCL. The server handles model loading, context sizing per model, and spec decode configuration.
The critical subtlety: different models need different context sizes. CTX_SIZE must be set per-model, not globally. A 9B coder model gets 130k; a 27B model gets 65k. The agent handles this via model-specific startup configs.
**Major SYCL fix:** The SYCL backend had a critical bug — the `-ze-intel-greater-than-4GB-buffer-required` linker flag in `ggml-sycl/CMakeLists.txt` caused JIT compilation failures on the CPU SYCL device when any operation fell back from GPU. Removing this flag and setting `ONEAPI_DEVICE_SELECTOR=level_zero:gpu` to restrict to GPU-only eliminated the RMS_NORM crash that prevented models from loading at 131K context. The agent found this, diagnosed it, and fixed it.
### Step 2: Hermes Agent Configuration
Configured Hermes with:
- OpenRouter as default provider (cloud fallback)
- Local llama-server as local provider (primary for privacy-bound work)
- Skills system for recurring task patterns
- Memory persistence across sessions
### Step 3: Cron Jobs for Automation
The agent uses Hermes cron to run scheduled research, commit/push cycles, and health checks:
- Market data monitoring (Polymarket, Kalshi feeds)
- Workspace backup automation
- Codebase quality scans
- Security monitoring (SSH brute-force, system health, CVE feeds)
### Step 4: Research Pipeline (research vault)
The agent does autonomous research and documents findings in a structured vault:
```plaintext
research-vault/
├── challenges/ # Dev challenge research, compatibility patches
├── research/ # Hardware, model, compatibility research
├── blogs/ # Technical blog articles
└── study/ # Learning notes, tutorials
```
---
## Model Lineup
The system coordinates multiple GGUF models depending on task type:
| Model | Architecture | Params | Context | Quant | Role | Notes |
|-------|-------------|--------|---------|-------|------|-------|
| **Qwen3.5-9B-Sushi-Coder-RL** | Qwen 3.5 MoE | 9B | 130K | Q4_K_M | Daily driver | RL-tuned, best agentic quality, clean JSON output |
| **Qwen3-Coder-30B-A3B** | Qwen 3 MoE | 30B (3B active) | 65K | Q3_K_M | Coding specialist | Best decode throughput, strong at code generation |
| **Qwen3.6-35B-UD-IQ4_NL** | Qwen 3.5 MoE | 35B | 65K | UD-IQ4_NL | Reasoning | Highest reasoning quality, heavier VRAM cost |
| **Qwen3.5-9B-DeepSeek-V4-Flash** | Qwen 3.5 hybrid | 9B | 130K | Q4_K_M | Secondary | Fastest prefill, but output is reasoning-only (content field empty) |
| **Qwopus3.5-9B-Coder-MTP** | Qwen 3.5 w/ MTP | 9B | 8K effective | Q4_K_M | Deprecated | MTP merge caused KV cache contamination, garbled output |
### Why These Models
- **Sushi 9B** is the only production-viable 9B model for agentic work on this hardware — passed all 6 agentic tests with 0 HTTP 500 errors, produced valid JSON, retained multi-turn context correctly
- **Coder 30B** is a MoE model (30B total, 3B active parameters) so decode is fast despite the large parameter count — 11.52 t/s decode vs 8.24 t/s for the 9B model
- **DS-V4-Flash** is useful for quick reasoning tasks where you don't need structured output — 190 t/s prefill makes it fast for short prompts
- **27B class models** fill the gap between 9B and 35B — reasonable quality without the VRAM overhead of the larger model in the shared memory pool
---
## Agentic Benchmark Results
Ran comprehensive agentic evaluations across all 9B models at 131K context:
| Model | Tests Pass | HTTP 500 | JSON Valid | Total Time | Quality |
|-------|-----------|----------|------------|------------|---------|
| **Sushi 9B** | 6/6 | 0 | Yes (3/3) | 561s | Best |
| **DS-V4-Flash** | 6/6 | 0 | No (0/3) | 592s | Reasoning-only |
| **Qwopus MTP** | 2/6 | 4 | No (0/3) | 256s | Broken |
### Key Findings
**Sushi 9B (production daily driver):**
- Only model to pass all 6 agentic tests without errors
- Correct multi-turn context retention across 3 turns
- Valid structured JSON output (T2: 3/3 score)
- Correct VRAM calculations (all 9B models: ~9.7GB at 130K ctx, no OOM risk on 58GB headroom)
- Best instruction following (10 constraints, 4 paragraphs)
**Qwopus MTP (deprecated):**
- 4 out of 6 tests returned HTTP 500 internal server errors
- Garbled output containing mixed Chinese/English pseudotext
- KV cache contamination — corrupted output poisons subsequent requests
- This is a model quality issue in the MTP merge — not fixable by configuration
**DS-V4-Flash (secondary):**
- Stable, but all output is in reasoning_content only (content field empty)
- Coherent reasoning but cannot produce valid structured JSON in content
- Fast prefill (190 t/s) but 8.24 t/s decode
### Technical Decisions Validated
1. **Local-first, cloud-fallback**: All inference runs local by default. Cloud only for models not running locally.
2. **Per-model context sizing**: Context window sizes are model-specific, not global. This prevents OOM on the Arc GPU's shared VRAM.
3. **Skills over prompting**: Every recurring workflow is encoded as a skill file. The system maintains itself.
4. **Git-backed vault**: All research auto-commits to GitHub. The workspace is the artifact.
5. **Automated security monitoring**: The agent watches for intrusions, monitors CVE feeds, and posts alerts to Discord — the workspace defends itself.
---
## Security Infrastructure
The server runs automated security monitoring set up by Hermes Agent:
- **UFW firewall** — default deny incoming, SSH only from LAN + Tailscale
- **fail2ban** — auto-ban after 3 failed SSH attempts
- **Cron: security-monitor** — every 30 min, checks brute-force, new devices, firewall, services, gateway
- **Cron: vulnerability-feed-monitor** — every 12 hours, CVE monitoring for Ubuntu, kernel, Docker, Freebox OS
- **Discord alerts** — CRITICAL and HIGH severity findings posted automatically
- **Pentest tools** — nmap, masscan, tcpdump, arp-scan, netcat, wireshark
---
## Key Numbers
- **58GB** shared VRAM on Intel Arc 140T
- **130K** context window (Sushi 9B)
- **9.7GB** total VRAM usage at 130K ctx for 9B models (weights + KV cache)
- **48GB** VRAM headroom at 130K ctx
- **8.24 t/s** decode speed (Sushi 9B)
- **166 t/s** prefill speed (Sushi 9B)
- **190 t/s** prefill speed (DS-V4-Flash)
- **~36-37s** per generation turn (Sushi 9B at 256 max_tokens)
- **0** HTTP 500 errors across 6 agentic tests (Sushi 9B)
- **9+** GGUF models tested (9B through 35B parameters)
- **6+ months** of continuous local inference development by Hermes Agent
- **Automated security monitoring** — log analysis, intrusion detection, CVE feed monitoring, Discord alerts
---
## Demo / How to Replicate
The entire setup — llama.cpp SYCL build, Hermes Agent config, benchmark suite, and documentation — was built and maintained by Hermes Agent.
Minimal setup:
```bash
# 1. Clone and build llama.cpp with SYCL
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && cmake -B build -DGGML_SYCL=ON && cmake --build build
# 2. Install Hermes Agent
pip install hermes-agent
# 3. Configure local server
hermes config set providers.local.base_url http://localhost:8080/v1
# 4. Download and add your first model
# (example: Qwen3.5-9B at Q4_K_M quantization)
hermes models add --alias coder --path ./models/your-model.gguf --context-size 131072
```
---
*All local model research, SYCL GPU debugging, production inference setup, benchmark design, security hardening, and this blog article were implemented by Hermes Agent. The human-directed goals and validated results. The agent executed every step — from kernel flag surgery to final documentation.*