Clients arrive having read the leaderboards, asking which model is “best”. It's the wrong question. Benchmark rankings measure performance on benchmark tasks; your invoice routing workflow is not a benchmark task. Here's the process we actually use to pick models on client projects — it has four steps and none of them start with a leaderboard.
Step 1: Define the job before the model
“Process our invoices” decomposes into distinct LLM jobs: read a document, classify an exception, draft a message, decide a route. Each job has different demands. Extraction needs precision on structured output; drafting needs tone; routing needs calibrated uncertainty. The moment you decompose the workflow, “which model?” becomes several smaller, easier questions — and the answer is rarely the same model for all of them.
Step 2: Build the eval before the shortlist
Take 50–200 real examples from your own history and define what correct looks like. This is a day or two of work and it converts model selection from a debate into a measurement. Every candidate model runs the same exam; the winner is whichever passes at the lowest cost and latency. No eval, no opinion — ours included.
Step 3: Start at the cheap tier and escalate on evidence
Test the small, fast tier first. If it passes your eval, you're done — you just avoided paying frontier prices for work that didn't need frontier capability. Where it fails, check whether the failures cluster (often fixable with a better prompt or one retrieval improvement) before reaching for the bigger model. In production this becomes routing: the cheap tier handles the routine majority, the strong tier handles the escalations.
Step 4: Design for swappability
Whatever you choose will be the wrong choice within a year — prices drop, models improve, providers deprecate. The defence is architectural: keep prompts and tool schemas provider-agnostic behind one interface, and keep the eval suite as the permanent referee. When a new model ships, evaluating it should be an afternoon (run the suite, read the report), not a migration project.
What about open-source and self-hosting?
Worth it in two honest cases: hard data-residency requirements that rule out API providers, or sustained volume high enough that GPU economics beat per-token pricing. Below that bar, self-hosting trades a bill you can see for an ops burden you can't. Run the same eval either way — residency changes where the model runs, not the standard it has to meet.