A typical agent turn isn't one decision. It's a dozen. Some of them are hard — picking the right tool, planning a refactor, reasoning about a failure. Most of them aren't — classifying the user's intent, summarizing a tool result, deciding if a call is safe enough to auto-approve.
You should not pay smart-model rates for the easy decisions. Liminal doesn't.
The split
Liminal runs two models per session:
- Main model (
AGENT_MODEL, defaultdeepseek/deepseek-v4-pro) — runs the full ReAct turn. Picks tools. Reasons. Produces the actual answer. - Fast model (
AGENT_FAST_MODEL, defaultdeepseek/deepseek-v4-flash) — handles a list of supporting jobs:- Intent classification — what kind of turn is this? Knowledge query? Code task? Introspection? Determines reasoning budget and which model to route to.
- Output distillation — shrink huge tool outputs to artifact pointers so they don't pollute context.
- Query rewriting — multi-query expansion before vector recall.
- Critic / verifier — sanity-check the final answer when it's code- or path-heavy.
- Safety judge — single-token 0/1 classifier on whether a tool call is safe enough to skip approval.
- Auto-dream — overnight memory consolidation.
Every one of those is a full LLM call. None of them needs the main model's reasoning power. Routing them to the fast model is a 5-10× cost reduction on supporting work — which on a long session adds up to most of the bill.
The routing layer
Per-turn intent classification (AGENT_INTENT_INFERENCE) runs on the fast model and decides whether the turn itself belongs on the fast or main model. Pure-knowledge questions ("what does the --bootstrap flag do?") and introspection ("what's in my memory about X?") route to fast. Code tasks, multi-tool plans, and anything ambiguous stay on main.
The threshold is tunable. AGENT_INTENT_FAST_THRESHOLD=0.8 is the default — only route to fast when the classifier is confident.
Prompt caching on top
On DeepInfra / GMICloud / Novita providers (we pin DeepInfra by default), cache_control: ephemeral markers let the harness's static 14k-token prefix bill at the cache-read rate on rounds 2+ of a turn. That's roughly a 13× discount on prefix tokens for the rest of the loop.
Cache hit rate is logged per turn to the JSONL trace:
[HARNESS] prompt_cache: cached=12k/14k (86%)
You can verify the discount is actually landing. Cache hits drift to zero if the provider falls back to a different upstream — which is why AGENT_PROVIDER_ALLOW_FALLBACKS=0 is on by default. Reliability of cache > availability of any individual call.
What it costs in practice
A typical multi-tool refactor (10-15 tool calls, two-tier routing, prompt cache hit on rounds 2+):
| Component | Tokens | Cost | |-----------|--------|------| | Main model — round 1 (cold prefix) | ~14k in, ~1k out | ~$0.022 | | Main model — rounds 2-12 (cached prefix) | ~80k cached, ~5k out | ~$0.011 | | Fast model — intent + distill + safety | ~20k in, ~2k out | ~$0.001 | | Total | | ~$0.034 |
Without two-tier routing and caching, the same turn lands closer to $0.18. The discount is structural — it's what you get for being clever about which decisions actually need a smart model.
What we didn't do
We didn't try to make a fancy router. The classifier is one LLM call returning a single label. The routing rules are about ten lines of code. The complexity is in the harness, not the routing — the routing is supposed to be boring.
We also didn't try to do single-token speculative decoding, multi-step rollouts, or any of the other things you see in research papers. Most of the cost in real sessions is supporting work, not main-model reasoning. Cut that first.
— The Vireon Dynamics team