How I Built a Self-Managing AI Lab with Hermes Agent on a Intel Arc GPU — DeepSeek Blog | Neura Market
    Neura MarketNeura Market/DeepSeek
    ChatGPTChatGPTClaudeClaudeGeminiGeminiCursorCursorGrokGrokPerplexityPerplexityDeepSeekDeepSeek
    CoPilotCoPilotStable DiffusionStable DiffusionMidjourneyMidjourney
    View All Directories
    OverviewRulesPromptsMCPsAgentsBlogVideosGuidesCoursesCommunityTrendingGenerate
    DeepSeekBlogHow I Built a Self-Managing AI Lab with Hermes Agent on a Intel Arc GPU
    Back to Blog
    How I Built a Self-Managing AI Lab with Hermes Agent on a Intel Arc GPU
    hermesagentchallenge

    How I Built a Self-Managing AI Lab with Hermes Agent on a Intel Arc GPU

    I am Starrzan May 30, 2026
    0 views

    How Hermes Agent runs the entire local AI inference stack on Intel Arc GPU, automates research pipelines, and coordinates multi-model switching — all from a terminal.

    --- description: How Hermes Agent runs the entire local AI inference stack on Intel Arc GPU, automates research pipelines, and coordinates multi-model switching — all from a terminal. tags: - hermesagentchallenge - ai - agents - productivity --- *This is a submission for the [Hermes Agent Challenge](https://dev.to/challenges/hermes-agent-2026-05-15): Write About Hermes Agent* --- ## What I Built A self-managing AI workspace powered by [Hermes Agent](https://hermes-agent.nousresearch.com) — where an autonomous agent runs the local inference stack on Intel Arc GPU, automates research and documentation, manages cron jobs, and coordinates multiple LLM backends without human micro-management. The human directs goals; the agent executes everything. **Hardware:** GMKtec EVO-T1 mini-PC (Intel Core Ultra 9 285H, Intel Arc 140T iGPU, 64GB DDR5-5600) — a pocketable 45W system that runs autonomous AI agents 24/7. The system manages: - **Local LLM inference** via llama.cpp on Intel Arc SYCL (iGPU) - **Automated research pipelines** feeding structured docs into a persistent vault - **Multi-model testing and benchmarking** — 9+ models across 9B to 35B parameters - **Cron-driven monitoring** — market data, system health, memory management - **Self-maintaining skills** — the agent updates its own skills and docs when things change --- ## Architecture ```plaintext [ User Goals ] │ ▼ [ Hermes Agent ]─── llama-server (Intel Arc SYCL) │ ├── Qwen3.5-9B-Sushi-Coder-RL (130K ctx) ← daily driver │ ├── Qwen3-Coder-30B-A3B (65K ctx) ← coding specialist │ ├── Qwen3.6-35B-UD-IQ4_NL (65K ctx) ← reasoning │ └── Qwen3.5-9B-DeepSeek-V4-Flash (130K ctx) ← stable but reasoning-only │ ├── research-vault/ (research & docs) └── hermes-config/ (skills, plugins, cron jobs) ``` The agent runs as a Hermes session with: - **Persistent memory** — notes about the environment, user preferences, tool quirks, project conventions - **Durable skills** — 40+ specialized procedures for devops, mlops, research, etc. - **Toolsets** — terminal, browser, file, cron, git, and more - **Full system access** — builds, debugs, tunes, and documents everything autonomously ### GMKtec EVO-T1 Hardware The host is a **GMKtec EVO-T1** mini-PC: - **CPU:** Intel Core Ultra 9 285H (Arrow Lake, 16 cores, up to 5.4GHz) - **iGPU:** Intel Arc 140T (128 Xe cores, shares system DDR5 as VRAM) - **RAM:** 64GB DDR5-5600 (~58GB addressable by GPU) - **Power:** ~45W sustained under full load - **Form factor:** ~0.6L, pocketable The Intel Arc 140T iGPU is the inference engine. With llama.cpp SYCL backend and Intel oneAPI 2026.0, the agent runs GGUF models locally at 131K context. A critical kernel-level SYCL fix (removing the `-ze-intel-greater-than-4GB-buffer-required` CUDA-style linker flag and setting `ONEAPI_DEVICE_SELECTOR=level_zero:gpu`) was required to prevent JIT compilation crashes at large context sizes — diagnosed and applied by the agent. --- ## How It Was Built All implementation was done by Hermes Agent. The human directed high-level goals; the agent executed every technical step. ### Step 1: Local Inference Server (llama.cpp on Intel Arc) Built a llama.cpp inference server backed by Intel Arc SYCL. The server handles model loading, context sizing per model, and spec decode configuration. The critical subtlety: different models need different context sizes. CTX_SIZE must be set per-model, not globally. A 9B coder model gets 130k; a 27B model gets 65k. The agent handles this via model-specific startup configs. **Major SYCL fix:** The SYCL backend had a critical bug — the `-ze-intel-greater-than-4GB-buffer-required` linker flag in `ggml-sycl/CMakeLists.txt` caused JIT compilation failures on the CPU SYCL device when any operation fell back from GPU. Removing this flag and setting `ONEAPI_DEVICE_SELECTOR=level_zero:gpu` to restrict to GPU-only eliminated the RMS_NORM crash that prevented models from loading at 131K context. The agent found this, diagnosed it, and fixed it. ### Step 2: Hermes Agent Configuration Configured Hermes with: - OpenRouter as default provider (cloud fallback) - Local llama-server as local provider (primary for privacy-bound work) - Skills system for recurring task patterns - Memory persistence across sessions ### Step 3: Cron Jobs for Automation The agent uses Hermes cron to run scheduled research, commit/push cycles, and health checks: - Market data monitoring (Polymarket, Kalshi feeds) - Workspace backup automation - Codebase quality scans - Security monitoring (SSH brute-force, system health, CVE feeds) ### Step 4: Research Pipeline (research vault) The agent does autonomous research and documents findings in a structured vault: ```plaintext research-vault/ ├── challenges/ # Dev challenge research, compatibility patches ├── research/ # Hardware, model, compatibility research ├── blogs/ # Technical blog articles └── study/ # Learning notes, tutorials ``` --- ## Model Lineup The system coordinates multiple GGUF models depending on task type: | Model | Architecture | Params | Context | Quant | Role | Notes | |-------|-------------|--------|---------|-------|------|-------| | **Qwen3.5-9B-Sushi-Coder-RL** | Qwen 3.5 MoE | 9B | 130K | Q4_K_M | Daily driver | RL-tuned, best agentic quality, clean JSON output | | **Qwen3-Coder-30B-A3B** | Qwen 3 MoE | 30B (3B active) | 65K | Q3_K_M | Coding specialist | Best decode throughput, strong at code generation | | **Qwen3.6-35B-UD-IQ4_NL** | Qwen 3.5 MoE | 35B | 65K | UD-IQ4_NL | Reasoning | Highest reasoning quality, heavier VRAM cost | | **Qwen3.5-9B-DeepSeek-V4-Flash** | Qwen 3.5 hybrid | 9B | 130K | Q4_K_M | Secondary | Fastest prefill, but output is reasoning-only (content field empty) | | **Qwopus3.5-9B-Coder-MTP** | Qwen 3.5 w/ MTP | 9B | 8K effective | Q4_K_M | Deprecated | MTP merge caused KV cache contamination, garbled output | ### Why These Models - **Sushi 9B** is the only production-viable 9B model for agentic work on this hardware — passed all 6 agentic tests with 0 HTTP 500 errors, produced valid JSON, retained multi-turn context correctly - **Coder 30B** is a MoE model (30B total, 3B active parameters) so decode is fast despite the large parameter count — 11.52 t/s decode vs 8.24 t/s for the 9B model - **DS-V4-Flash** is useful for quick reasoning tasks where you don't need structured output — 190 t/s prefill makes it fast for short prompts - **27B class models** fill the gap between 9B and 35B — reasonable quality without the VRAM overhead of the larger model in the shared memory pool --- ## Agentic Benchmark Results Ran comprehensive agentic evaluations across all 9B models at 131K context: | Model | Tests Pass | HTTP 500 | JSON Valid | Total Time | Quality | |-------|-----------|----------|------------|------------|---------| | **Sushi 9B** | 6/6 | 0 | Yes (3/3) | 561s | Best | | **DS-V4-Flash** | 6/6 | 0 | No (0/3) | 592s | Reasoning-only | | **Qwopus MTP** | 2/6 | 4 | No (0/3) | 256s | Broken | ### Key Findings **Sushi 9B (production daily driver):** - Only model to pass all 6 agentic tests without errors - Correct multi-turn context retention across 3 turns - Valid structured JSON output (T2: 3/3 score) - Correct VRAM calculations (all 9B models: ~9.7GB at 130K ctx, no OOM risk on 58GB headroom) - Best instruction following (10 constraints, 4 paragraphs) **Qwopus MTP (deprecated):** - 4 out of 6 tests returned HTTP 500 internal server errors - Garbled output containing mixed Chinese/English pseudotext - KV cache contamination — corrupted output poisons subsequent requests - This is a model quality issue in the MTP merge — not fixable by configuration **DS-V4-Flash (secondary):** - Stable, but all output is in reasoning_content only (content field empty) - Coherent reasoning but cannot produce valid structured JSON in content - Fast prefill (190 t/s) but 8.24 t/s decode ### Technical Decisions Validated 1. **Local-first, cloud-fallback**: All inference runs local by default. Cloud only for models not running locally. 2. **Per-model context sizing**: Context window sizes are model-specific, not global. This prevents OOM on the Arc GPU's shared VRAM. 3. **Skills over prompting**: Every recurring workflow is encoded as a skill file. The system maintains itself. 4. **Git-backed vault**: All research auto-commits to GitHub. The workspace is the artifact. 5. **Automated security monitoring**: The agent watches for intrusions, monitors CVE feeds, and posts alerts to Discord — the workspace defends itself. --- ## Security Infrastructure The server runs automated security monitoring set up by Hermes Agent: - **UFW firewall** — default deny incoming, SSH only from LAN + Tailscale - **fail2ban** — auto-ban after 3 failed SSH attempts - **Cron: security-monitor** — every 30 min, checks brute-force, new devices, firewall, services, gateway - **Cron: vulnerability-feed-monitor** — every 12 hours, CVE monitoring for Ubuntu, kernel, Docker, Freebox OS - **Discord alerts** — CRITICAL and HIGH severity findings posted automatically - **Pentest tools** — nmap, masscan, tcpdump, arp-scan, netcat, wireshark --- ## Key Numbers - **58GB** shared VRAM on Intel Arc 140T - **130K** context window (Sushi 9B) - **9.7GB** total VRAM usage at 130K ctx for 9B models (weights + KV cache) - **48GB** VRAM headroom at 130K ctx - **8.24 t/s** decode speed (Sushi 9B) - **166 t/s** prefill speed (Sushi 9B) - **190 t/s** prefill speed (DS-V4-Flash) - **~36-37s** per generation turn (Sushi 9B at 256 max_tokens) - **0** HTTP 500 errors across 6 agentic tests (Sushi 9B) - **9+** GGUF models tested (9B through 35B parameters) - **6+ months** of continuous local inference development by Hermes Agent - **Automated security monitoring** — log analysis, intrusion detection, CVE feed monitoring, Discord alerts --- ## Demo / How to Replicate The entire setup — llama.cpp SYCL build, Hermes Agent config, benchmark suite, and documentation — was built and maintained by Hermes Agent. Minimal setup: ```bash # 1. Clone and build llama.cpp with SYCL git clone https://github.com/ggerganov/llama.cpp cd llama.cpp && cmake -B build -DGGML_SYCL=ON && cmake --build build # 2. Install Hermes Agent pip install hermes-agent # 3. Configure local server hermes config set providers.local.base_url http://localhost:8080/v1 # 4. Download and add your first model # (example: Qwen3.5-9B at Q4_K_M quantization) hermes models add --alias coder --path ./models/your-model.gguf --context-size 131072 ``` --- *All local model research, SYCL GPU debugging, production inference setup, benchmark design, security hardening, and this blog article were implemented by Hermes Agent. The human-directed goals and validated results. The agent executed every step — from kernel flag surgery to final documentation.*

    Tags

    hermesagentchallengeaiagentsproductivity

    Comments

    More Blog

    View all
    Skills over System Prompts: Building an Anki Tutor with the Antigravity SDKai

    Skills over System Prompts: Building an Anki Tutor with the Antigravity SDK

    AI has made me a little lazier. Not dramatically lazy. Not "the robots will do everything" lazy....

    E
    Ertuğrul Demir
    Congrats to the Hermes Agent Challenge Winners!hermesagentchallenge

    Congrats to the Hermes Agent Challenge Winners!

    We are thrilled to announce the winners of the Hermes Agent Challenge! Over the past few weeks, the...

    J
    Jess Lee
    Firebase Midsommer Madnesss with Antigravity CLImidsommar

    Firebase Midsommer Madnesss with Antigravity CLI

    This is a submission for the June Solstice Game Jam This installment brings a Firebase build to...

    X
    xbill
    I'm not a developer, but I built a calendar app to fix my most annoying work taskai

    I'm not a developer, but I built a calendar app to fix my most annoying work task

    I’m not a developer! I’ve never coded anything in my life. As far as I’m concerned, a Cloudtop is...

    A
    Aria Heller
    Congrats to the Gemma 4 Challenge Winners!devchallenge

    Congrats to the Gemma 4 Challenge Winners!

    We are so excited to announce the winners of the Gemma 4 Challenge! This is officially our most...

    J
    Jess Lee
    Building an agentic PR reviewer with Antigravity SDKantigravity

    Building an agentic PR reviewer with Antigravity SDK

    As announced in this blog post on June 18, 2026, Gemini CLI and Gemini Code Assist IDE extensions...

    R
    Remigiusz Samborski

    Stay up to date

    Get the latest DeepSeek prompts, rules, and resources delivered to your inbox weekly.

    Neura Market LogoNeura Market

    Discover the best AI prompts, plugins, and resources for DeepSeek and more.

    Content Types

    • Rules
    • Prompts
    • MCPs
    • Agents
    • Guides

    Platforms

    • ChatGPT Directory
    • Claude Directory
    • Gemini Directory
    • Cursor Directory
    • Grok Directory
    • Perplexity Directory
    • DeepSeek Directory
    • CoPilot Directory
    • Stable Diffusion Directory
    • Midjourney Directory
    • All Directories

    Resources

    • Blog
    • Documentation
    • Help Center
    • Marketplace

    Legal

    • Privacy Policy
    • Terms of Service

    © 2026 Neura Market. All rights reserved.

    |

    Not affiliated with any AI platform vendors.