The Raven Group
AI
AI

Model selection isn't a model decision

September 28, 20253 min read

Teams new to building AI features tend to treat model selection as the central decision: which model is best? Should we use GPT or Claude or Gemini? Open source or hosted? This framing leads to long evaluations that conclude with "Claude is slightly better at this category and GPT at that one," and the team ships against whichever one was favored on the day the decision got made. Six months later, a better model from a different provider exists, and switching is a six-week project.

The more useful framing: model selection is an evaluation decision, not a model decision. The best AI feature you can ship is the one with a model swap that takes a day, not a quarter. The architectural pattern that gets you there is straightforward — abstract the model interaction behind a small internal API, run evals against the actual outputs you care about, and let the model choice be a config setting that points to whichever provider is winning this month.

What the evals should test is the thing your product actually does, not generic benchmarks. If your feature summarizes customer emails, your eval is fifty customer emails with hand-written gold-standard summaries, and a scoring function (LLM-as-judge, or a simple rubric) that tells you whether the model's output is acceptable. With this in place, every new model that comes out is a one-day experiment: swap the config, run the evals, look at the numbers. Switch or don't.

The companies that ship durable AI features in 2025 aren't the ones that picked the right model in 2024. They're the ones who built the swap-in/swap-out architecture in 2024 and have changed models four times since. Models will keep getting better, cheaper, and more specialized. The decision you're making isn't which model to use; it's whether your codebase can change its mind without a rewrite.

Want to talk about something in this post? Get in touch.More on AI
More on AI