Hermes 3 as an agent: function calling and tool use on your own server

Hermes 3 from Nous Research is a Llama 3.1 fine-tune tuned for function calling and role-based prompting. What it can do as an agent and why teams pick it over OpenAI when data cannot leave the perimeter.

Hermes 3 as an agent: function calling and tool use on your own server

Hermes 3 is a Llama 3.1 fine-tune (8B / 70B / 405B) from Nous Research. The main difference from stock Llama is training on synthetic data for function calling, JSON output, role conditioning and multi-step reasoning. On tool-use benchmarks it matches GPT-4 and Claude 3 Sonnet on typical scenarios, while the weights are open and the model runs locally.

Hermes 3 as an agent: function calling and tool use on your own server
Basic agent loop: request → model decides to call a tool → execution → answer.

Why pick Hermes over OpenAI

  • Data stays inside. Health records, finance, legal text, customer chat — places where compliance or NDA forbids sending traffic to a vendor cloud
  • Cost at scale. 8B on an A10/A100 costs cents per 1M tokens on your hardware; OpenAI is $0.50-$2.50 per 1M
  • Control. You can fine-tune on your data, version it, lock behavior
  • No RPM limits. On your hardware you are bounded only by your GPU

Function calling out of the box

Hermes 3 is trained on the `<tool_call>` format — the model itself decides when a tool is needed and emits JSON with the function name and arguments. A system prompt with tool definitions, then generation:

<|im_start|>system
You have access to tools:
<tools>
[{"name": "search_orders", "description": "Find orders by client phone", "parameters": {"phone": "string"}}]
</tools>
<|im_end|>
<|im_start|>user
Find orders for client +14155551234
<|im_end|>
<|im_start|>assistant
<tool_call>
{"name": "search_orders", "arguments": {"phone": "+14155551234"}}
</tool_call>

Your code parses the JSON, runs the actual function, returns the result wrapped in `<tool_response>`, and the model writes the final reply to the user.

Deployment

  • vLLM — fastest runtime, batched inference, OpenAI-compatible API. Launch: `vllm serve NousResearch/Hermes-3-Llama-3.1-8B --tool-call-parser hermes`
  • Ollama — easiest for prototyping, no batching. `ollama pull hermes3:8b` and you are running
  • llama.cpp — minimal hardware requirements, GGUF quantization. Acceptable for CPU-only deployment as a last resort
  • TGI (text-generation-inference) — for production load with Kubernetes autoscaling

Hardware

  • 8B — RTX 4090 / A10 (24 GB VRAM). Enough for 1-2 concurrent users in real time. ~60 tokens/sec
  • 8B quantized (Q5_K_M) — RTX 3090 / 4070 Ti / 4060 Ti 16 GB. 35-50 tokens/sec, slight quality drop
  • 70B — A100 80 GB or 2× A100 40 GB. Needed for harder tasks (long reasoning, legal text)
  • 405B — a cluster of 8× H100. GPT-4 territory. Enterprise scale only

Agent scenarios where Hermes fits

  • Internal assistant over corporate docs (RAG + tool call into search)
  • Telegram/MAX bot that calls into CRM, inventory, accounting
  • Auto-handler for inbound requests: classification → routing → draft reply → human approval
  • Co-pilot for content managers: drafts text, fetches data from DB, formats output
  • QA bot over logs and metrics — give it an MCP server with Grafana/Prometheus access

What it cannot do

  • Long-context inference — Hermes supports 128K, but 70B+ at long context demands serious VRAM for kv-cache
  • Hard reasoning — for math and code, R1 / Claude 3.5 Sonnet are still better
  • Multimodal (images, audio) — Hermes is text-only. For vision use Llava or Qwen2-VL separately

Pitfalls

  • The tool-call parser must be set correctly (`--tool-call-parser hermes` in vLLM). Without it the format breaks
  • The 8B sometimes hallucinates function names — add JSON-schema validation on output
  • Russian and other non-English languages — coverage is uneven for technical terms; fine-tune on your corpus or use YandexGPT/GigaChat for the chat layer