Clients arrive having read the leaderboards, asking which model is “best”. It's the wrong question. Benchmark rankings measure performance on benchmark tasks; your invoice routing workflow is not a benchmark task. Here's the process we actually use to pick models on client projects — it has four steps and none of them start with a leaderboard.

Step 1: Define the job before the model

“Process our invoices” decomposes into distinct LLM jobs: read a document, classify an exception, draft a message, decide a route. Each job has different demands. Extraction needs precision on structured output; drafting needs tone; routing needs calibrated uncertainty. The moment you decompose the workflow, “which model?” becomes several smaller, easier questions — and the answer is rarely the same model for all of them.

Step 2: Build the eval before the shortlist

Take 50–200 real examples from your own history and define what correct looks like. This is a day or two of work and it converts model selection from a debate into a measurement. Every candidate model runs the same exam; the winner is whichever passes at the lowest cost and latency. No eval, no opinion — ours included.

The leaderboard question is “which model is smartest?”. The business question is “which is the cheapest model that passes our eval?”. Those answers differ far more often than vendors would like you to know.

Step 3: Start at the cheap tier and escalate on evidence

Test the small, fast tier first. If it passes your eval, you're done — you just avoided paying frontier prices for work that didn't need frontier capability. Where it fails, check whether the failures cluster (often fixable with a better prompt or one retrieval improvement) before reaching for the bigger model. In production this becomes routing: the cheap tier handles the routine majority, the strong tier handles the escalations.

Step 4: Design for swappability

Whatever you choose will be the wrong choice within a year — prices drop, models improve, providers deprecate. The defence is architectural: keep prompts and tool schemas provider-agnostic behind one interface, and keep the eval suite as the permanent referee. When a new model ships, evaluating it should be an afternoon (run the suite, read the report), not a migration project.

50–200
Real examples make selection a measurement
cheapest-that-passes
The actual selection criterion
1 afternoon
To re-evaluate when a new model ships

What about open-source and self-hosting?

Worth it in two honest cases: hard data-residency requirements that rule out API providers, or sustained volume high enough that GPU economics beat per-token pricing. Below that bar, self-hosting trades a bill you can see for an ops burden you can't. Run the same eval either way — residency changes where the model runs, not the standard it has to meet.