How I Built a Self-Managing AI Lab with Hermes Agent on a Intel Arc GPU

--- description: How Hermes Agent runs the entire local AI inference stack on Intel Arc GPU, automates research pipelines, and coordinates multi-model switching — all from a terminal. tags: - hermesagentchallenge - ai - agents - productivity --- *This is a submission for the [Hermes Agent Challenge](https://dev.to/challenges/hermes-agent-2026-05-15): Write About Hermes Agent* --- ## What I Built A self-managing AI workspace powered by [Hermes Agent](https://hermes-agent.nousresearch.com) — where an autonomous agent runs the local inference stack on Intel Arc GPU, automates research and documentation, manages cron jobs, and coordinates multiple LLM backends without human micro-management. The human directs goals; the agent executes everything. **Hardware:** GMKtec EVO-T1 mini-PC (Intel Core Ultra 9 285H, Intel Arc 140T iGPU, 64GB DDR5-5600) — a pocketable 45W system that runs autonomous AI agents 24/7. The system manages: - **Local LLM inference** via llama.cpp on Intel Arc SYCL (iGPU) - **Automated research pipelines** feeding structured docs into a persistent vault - **Multi-model testing and benchmarking** — 9+ models across 9B to 35B parameters - **Cron-driven monitoring** — market data, system health, memory management - **Self-maintaining skills** — the agent updates its own skills and docs when things change --- ## Architecture ```plaintext [ User Goals ] │ ▼ [ Hermes Agent ]─── llama-server (Intel Arc SYCL) │ ├── Qwen3.5-9B-Sushi-Coder-RL (130K ctx) ← daily driver │ ├── Qwen3-Coder-30B-A3B (65K ctx) ← coding specialist │ ├── Qwen3.6-35B-UD-IQ4_NL (65K ctx) ← reasoning │ └── Qwen3.5-9B-DeepSeek-V4-Flash (130K ctx) ← stable but reasoning-only │ ├── research-vault/ (research & docs) └── hermes-config/ (skills, plugins, cron jobs) ``` The agent runs as a Hermes session with: - **Persistent memory** — notes about the environment, user preferences, tool quirks, project conventions - **Durable skills** — 40+ specialized procedures for devops, mlops, research, etc. - **Toolsets** — terminal, browser, file, cron, git, and more - **Full system access** — builds, debugs, tunes, and documents everything autonomously ### GMKtec EVO-T1 Hardware The host is a **GMKtec EVO-T1** mini-PC: - **CPU:** Intel Core Ultra 9 285H (Arrow Lake, 16 cores, up to 5.4GHz) - **iGPU:** Intel Arc 140T (128 Xe cores, shares system DDR5 as VRAM) - **RAM:** 64GB DDR5-5600 (~58GB addressable by GPU) - **Power:** ~45W sustained under full load - **Form factor:** ~0.6L, pocketable The Intel Arc 140T iGPU is the inference engine. With llama.cpp SYCL backend and Intel oneAPI 2026.0, the agent runs GGUF models locally at 131K context. A critical kernel-level SYCL fix (removing the `-ze-intel-greater-than-4GB-buffer-required` CUDA-style linker flag and setting `ONEAPI_DEVICE_SELECTOR=level_zero:gpu`) was required to prevent JIT compilation crashes at large context sizes — diagnosed and applied by the agent. --- ## How It Was Built All implementation was done by Hermes Agent. The human directed high-level goals; the agent executed every technical step. ### Step 1: Local Inference Server (llama.cpp on Intel Arc) Built a llama.cpp inference server backed by Intel Arc SYCL. The server handles model loading, context sizing per model, and spec decode configuration. The critical subtlety: different models need different context sizes. CTX_SIZE must be set per-model, not globally. A 9B coder model gets 130k; a 27B model gets 65k. The agent handles this via model-specific startup configs. **Major SYCL fix:** The SYCL backend had a critical bug — the `-ze-intel-greater-than-4GB-buffer-required` linker flag in `ggml-sycl/CMakeLists.txt` caused JIT compilation failures on the CPU SYCL device when any operation fell back from GPU. Removing this flag and setting `ONEAPI_DEVICE_SELECTOR=level_zero:gpu` to restrict to GPU-only eliminated the RMS_NORM crash that prevented models from loading at 131K context. The agent found this, diagnosed it, and fixed it. ### Step 2: Hermes Agent Configuration Configured Hermes with: - OpenRouter as default provider (cloud fallback) - Local llama-server as local provider (primary for privacy-bound work) - Skills system for recurring task patterns - Memory persistence across sessions ### Step 3: Cron Jobs for Automation The agent uses Hermes cron to run scheduled research, commit/push cycles, and health checks: - Market data monitoring (Polymarket, Kalshi feeds) - Workspace backup automation - Codebase quality scans - Security monitoring (SSH brute-force, system health, CVE feeds) ### Step 4: Research Pipeline (research vault) The agent does autonomous research and documents findings in a structured vault: ```plaintext research-vault/ ├── challenges/ # Dev challenge research, compatibility patches ├── research/ # Hardware, model, compatibility research ├── blogs/ # Technical blog articles └── study/ # Learning notes, tutorials ``` --- ## Model Lineup The system coordinates multiple GGUF models depending on task type: | Model | Architecture | Params | Context | Quant | Role | Notes | |-------|-------------|--------|---------|-------|------|-------| | **Qwen3.5-9B-Sushi-Coder-RL** | Qwen 3.5 MoE | 9B | 130K | Q4_K_M | Daily driver | RL-tuned, best agentic quality, clean JSON output | | **Qwen3-Coder-30B-A3B** | Qwen 3 MoE | 30B (3B active) | 65K | Q3_K_M | Coding specialist | Best decode throughput, strong at code generation | | **Qwen3.6-35B-UD-IQ4_NL** | Qwen 3.5 MoE | 35B | 65K | UD-IQ4_NL | Reasoning | Highest reasoning quality, heavier VRAM cost | | **Qwen3.5-9B-DeepSeek-V4-Flash** | Qwen 3.5 hybrid | 9B | 130K | Q4_K_M | Secondary | Fastest prefill, but output is reasoning-only (content field empty) | | **Qwopus3.5-9B-Coder-MTP** | Qwen 3.5 w/ MTP | 9B | 8K effective | Q4_K_M | Deprecated | MTP merge caused KV cache contamination, garbled output | ### Why These Models - **Sushi 9B** is the only production-viable 9B model for agentic work on this hardware — passed all 6 agentic tests with 0 HTTP 500 errors, produced valid JSON, retained multi-turn context correctly - **Coder 30B** is a MoE model (30B total, 3B active parameters) so decode is fast despite the large parameter count — 11.52 t/s decode vs 8.24 t/s for the 9B model - **DS-V4-Flash** is useful for quick reasoning tasks where you don't need structured output — 190 t/s prefill makes it fast for short prompts - **27B class models** fill the gap between 9B and 35B — reasonable quality without the VRAM overhead of the larger model in the shared memory pool --- ## Agentic Benchmark Results Ran comprehensive agentic evaluations across all 9B models at 131K context: | Model | Tests Pass | HTTP 500 | JSON Valid | Total Time | Quality | |-------|-----------|----------|------------|------------|---------| | **Sushi 9B** | 6/6 | 0 | Yes (3/3) | 561s | Best | | **DS-V4-Flash** | 6/6 | 0 | No (0/3) | 592s | Reasoning-only | | **Qwopus MTP** | 2/6 | 4 | No (0/3) | 256s | Broken | ### Key Findings **Sushi 9B (production daily driver):** - Only model to pass all 6 agentic tests without errors - Correct multi-turn context retention across 3 turns - Valid structured JSON output (T2: 3/3 score) - Correct VRAM calculations (all 9B models: ~9.7GB at 130K ctx, no OOM risk on 58GB headroom) - Best instruction following (10 constraints, 4 paragraphs) **Qwopus MTP (deprecated):** - 4 out of 6 tests returned HTTP 500 internal server errors - Garbled output containing mixed Chinese/English pseudotext - KV cache contamination — corrupted output poisons subsequent requests - This is a model quality issue in the MTP merge — not fixable by configuration **DS-V4-Flash (secondary):** - Stable, but all output is in reasoning_content only (content field empty) - Coherent reasoning but cannot produce valid structured JSON in content - Fast prefill (190 t/s) but 8.24 t/s decode ### Technical Decisions Validated 1. **Local-first, cloud-fallback**: All inference runs local by default. Cloud only for models not running locally. 2. **Per-model context sizing**: Context window sizes are model-specific, not global. This prevents OOM on the Arc GPU's shared VRAM. 3. **Skills over prompting**: Every recurring workflow is encoded as a skill file. The system maintains itself. 4. **Git-backed vault**: All research auto-commits to GitHub. The workspace is the artifact. 5. **Automated security monitoring**: The agent watches for intrusions, monitors CVE feeds, and posts alerts to Discord — the workspace defends itself. --- ## Security Infrastructure The server runs automated security monitoring set up by Hermes Agent: - **UFW firewall** — default deny incoming, SSH only from LAN + Tailscale - **fail2ban** — auto-ban after 3 failed SSH attempts - **Cron: security-monitor** — every 30 min, checks brute-force, new devices, firewall, services, gateway - **Cron: vulnerability-feed-monitor** — every 12 hours, CVE monitoring for Ubuntu, kernel, Docker, Freebox OS - **Discord alerts** — CRITICAL and HIGH severity findings posted automatically - **Pentest tools** — nmap, masscan, tcpdump, arp-scan, netcat, wireshark --- ## Key Numbers - **58GB** shared VRAM on Intel Arc 140T - **130K** context window (Sushi 9B) - **9.7GB** total VRAM usage at 130K ctx for 9B models (weights + KV cache) - **48GB** VRAM headroom at 130K ctx - **8.24 t/s** decode speed (Sushi 9B) - **166 t/s** prefill speed (Sushi 9B) - **190 t/s** prefill speed (DS-V4-Flash) - **~36-37s** per generation turn (Sushi 9B at 256 max_tokens) - **0** HTTP 500 errors across 6 agentic tests (Sushi 9B) - **9+** GGUF models tested (9B through 35B parameters) - **6+ months** of continuous local inference development by Hermes Agent - **Automated security monitoring** — log analysis, intrusion detection, CVE feed monitoring, Discord alerts --- ## Demo / How to Replicate The entire setup — llama.cpp SYCL build, Hermes Agent config, benchmark suite, and documentation — was built and maintained by Hermes Agent. Minimal setup: ```bash # 1. Clone and build llama.cpp with SYCL git clone https://github.com/ggerganov/llama.cpp cd llama.cpp && cmake -B build -DGGML_SYCL=ON && cmake --build build # 2. Install Hermes Agent pip install hermes-agent # 3. Configure local server hermes config set providers.local.base_url http://localhost:8080/v1 # 4. Download and add your first model # (example: Qwen3.5-9B at Q4_K_M quantization) hermes models add --alias coder --path ./models/your-model.gguf --context-size 131072 ``` --- *All local model research, SYCL GPU debugging, production inference setup, benchmark design, security hardening, and this blog article were implemented by Hermes Agent. The human-directed goals and validated results. The agent executed every step — from kernel flag surgery to final documentation.*

How I Built a Self-Managing AI Lab with Hermes Agent on a Intel Arc GPU

Tags

Comments

More Blog

Skills over System Prompts: Building an Anki Tutor with the Antigravity SDK

Congrats to the Hermes Agent Challenge Winners!

Firebase Midsommer Madnesss with Antigravity CLI

I'm not a developer, but I built a calendar app to fix my most annoying work task

Congrats to the Gemma 4 Challenge Winners!

Building an agentic PR reviewer with Antigravity SDK