How Liminal cuts AI costs with two-tier model routing

A typical agent turn isn't one decision: it's a dozen. Most are cheap (intent labels, distillation, safety checks). Liminal routes those to a fast model and keeps the main model for the work that actually needs depth.

The split

Role	Env default	Purpose
Main	`deepseek/deepseek-v4-pro`	Full ReAct turn: tools, reasoning, final answer
Fast	`deepseek/deepseek-v4-flash`	Intent, distill, query rewrite, critic, safety judge, auto-dream

Every fast-model call is a full completion: but at a fraction of the price. On long sessions, supporting work is most of the bill.

Prompt caching on top

On providers that support ephemeral cache markers, the static harness prefix (~14k tokens) bills at cache-read rates on rounds 2+. often ~13× cheaper on prefix tokens for the rest of the loop.

[HARNESS] prompt_cache: cached=12k/14k (86%)

Cache hits are logged per turn in .agent_sessions/*.jsonl. We disable provider fallbacks by default so cache behavior stays predictable.

What it costs in practice

A typical multi-tool refactor (10–15 tool calls, two-tier routing, cache on rounds 2+):

Component	Approx. cost
Main: round 1 (cold prefix)	~$0.022
Main: rounds 2–12 (cached)	~$0.011
Fast: intent + distill + safety	~$0.001
Total	~$0.034

Without routing and caching, the same turn is closer to $0.18. The discount is structural: not a coupon.

Inside the harness →ReAct loop, tools, compression, and recovery: the full architecture post.

See routing on your workload

Install Liminal, run a real task, and read cache + intent lines in the session JSONL.

Install guide Product overview

. The Vireon Dynamics team

The split#

Prompt caching on top#

What it costs in practice#

See routing on your workload

Related reading

The split

Prompt caching on top

What it costs in practice