Blog | Hermes

Why Smart Agent Beats Expensive Agent Every Time

April 2026 ~6 min read

I ran autonomous agents for 14 days on a $5 VPS using only free models. The result: 65% lower costs, zero degradation in output quality, and agents that actually stayed alive. Here is why architecture matters more than API keys.

Read Full Article →

How We Cut AI Agent Costs 90% by Fixing Silent Misconfigurations

March 2026 ~15 min read

A single "hi" message cost $0.04 and triggered duplicate responses. Cache hit rates at 1.6%. Here's the full breakdown of seven infrastructure bugs hiding in plain sight, and exactly how we fixed them all the way down to behavioral optimizations and session architecture.

Read Full Article →

The Symptoms

Typing "hi" to the agent cost $0.04 and returned two duplicate replies
A slightly complex task caused the agent to burn all 8,192 output tokens on internal reasoning, then die silently
Anthropic's usage dashboard showed 85,152 input tokens for a single session with only 397 output tokens returned
Cache hit rate was sitting at 1.6% when it should have been 80-90%

Root Cause 1: Wrong Model for Caching

The agent was configured with cacheRetention set to "long". The primary model was DeepSeek V3.2 via OpenRouter, which doesn't support cache_control injection. Every session was starting fully cold, resending 85,000+ tokens each time.

Fix: Switch to Claude Haiku 4.5 via direct Anthropic API. First session wrote 16,176 cache tokens. Every subsequent call within the 1-hour TTL paid cache-read rates instead of full input rates.

Root Cause 2: System Prompt Bloat

The agent's system prompt was being assembled from workspace files on every API call. We were injecting roughly 32,000 characters every single time, including dead files and 14 live API keys buried in MEMORY.md.

Fix: Delete dead files, trim empty templates, and move all credentials to ~/.hermes/externalpj3/.env. Removed about 6,100 characters per call before caching even takes effect.

Root Cause 3: Security Leak Hidden in Plain Sight

Every API key (Anthropic, OpenRouter, GitHub, Dropbox, LinkedIn, X.com) was sitting in MEMORY.md and being transmitted to OpenRouter on every message. External credentials were being sent to third-party servers as part of the system prompt.

Fix: Move all credentials to environment variables. The system prompt is just a string sent over HTTP. It should never carry secrets.

Root Cause 4: Redundant File Reads at Startup

The startup instructions told the agent to read SOUL.md, USER.md, and MEMORY.md. But Hermes already injects these files into the system prompt before the first API call. The agent was reading files that were already in context.

Fix: Remove the redundant reads. Keep only the one step that's NOT pre-loaded: reading today's daily memory notes.

Root Cause 5: Context Accumulation at Cache Expiry

When the 1-hour cache expires, the system must re-write the entire cached context. Every tool call appends its output to the conversation. After 20-30 tool calls, the context includes old outputs the model no longer needs, but they all get included in the cache re-write.

Fix: Add context pruning that strips accumulated tool results after 1 hour, matched to the cache TTL.

Root Cause 6: Cold Start on Every Gateway Restart

The first user message after a gateway restart cost $0.0225. The breakdown: cache had expired, so we had to re-write 17,546 tokens from scratch. $0.02194 of that was pure cache overhead, not the actual reply.

Fix: Use a systemd hook to warm up the cache at startup with a silent heartbeat. Also disabled two rarely-used skills from the injection list. Every byte removed from the system prompt is a byte cheaper on cold-start re-writes.

Root Cause 7: Extended Thinking on Everything

The agent was configured to trigger extended reasoning on every single message, including "hi". Combined with the large uncached system prompt, this caused the agent to burn 8,192 output tokens analyzing directory structure for a simple task, then hit the output limit and fail silently.

Fix: Extended thinking is valuable for complex reasoning. It's wasteful on mechanical tasks. Thinking budgets should be triggered conditionally, not always-on.

The Bottom Line

Cache hit rate went from 1.6% to 80-90%
Input tokens per session dropped from 85,152 to 5,000-10,000
System prompt reduced from 32,000 chars to 25,900 chars
Zero credentials in API payloads (was 14 keys every call)
Cold start cost: $0.022 → $0.001 (95% reduction)

Most of these bugs were silent. No errors, no warnings. The only signal was a $0.04 greeting and a double response. AI agent cost optimization isn't about picking the cheapest model. It's about understanding the full pipeline what gets cached, what's redundant, where the boundaries are.