Cutting LLM Costs: Model Routing, Caching and Right-Sizing

The fastest way to kill an AI project's ROI isn't a technical failure — it's a bill that grows linearly with usage while value doesn't. Almost every inflated bill we audit has the same root cause: every request goes to the most capable model available. Here is the cost discipline we build into production agents.

Route by difficulty, not by habit

In a typical operations workflow, most requests are routine: a standard invoice, a common support question, a well-formed record. A small, fast model handles these correctly at a fraction of the price. The hard cases — ambiguous documents, multi-step decisions — get escalated to a stronger model. The router itself is cheap: a classification pass that costs a fraction of a cent and decides where the real work runs.

Think of it as staffing. You don't send your most senior engineer to reset passwords. Model routing is the same judgment, encoded.

Cache the parts that never change

An agent's prompt is mostly static: instructions, tool schemas, policy text. Prompt caching lets the provider reuse that static prefix instead of reprocessing it on every call. For agents that act dozens of times per workflow, this alone is a large, free saving — it just requires structuring prompts so the stable content comes first.

Budget the output, not just the input

Output tokens cost several times more than input tokens, and unconstrained models are verbose. Production agents should emit structured, bounded output — a JSON decision, not an essay about the decision. Tight output schemas are simultaneously a cost control, a latency win and a reliability win, which is why they're the default in everything we ship.

Batch what isn't urgent

Nightly reports, backfills, re-indexing — none of it needs a real-time endpoint. Batch APIs run the same work at roughly half price with a relaxed deadline. The architecture decision is simply tagging each job as interactive or deferrable at design time.

majority

Of routine requests a small model handles correctly

~50%

Typical batch-API discount for deferred work

4×+

Output-vs-input token price gap to engineer around

The order to do it in

Caching first (a prompt-structure change, near-zero risk), output budgets second, routing third (needs an eval suite so you can prove the small model is safe on the easy tier), and batching wherever a deadline allows it. Each step is independently measurable — which means the bill goes down in increments you can show your CFO.