Technical Strategy

You’re probably using the wrong model. Here’s how to choose.

Defaulting to the most capable model feels safe. It isn’t. Model selection is one of the biggest cost and latency levers in your AI product — and most founders never touch it.

The question comes up in almost every early conversation I have with a new founder: which model should we use? And the answer I hear them give themselves, before I’ve said anything, is almost always some version of: “GPT-4o, probably. It’s the best.”

It might not be the best for what you’re doing. And even when it is, using it for everything — every classification call, every formatting step, every retrieval rerank — is like hiring a senior engineer to sort your inbox. The capability is there. The judgment isn’t in question. But the cost-to-task ratio is way off, and at scale it starts to really hurt.

Model selection is a decision that most teams make once, casually, and then never revisit. That’s a mistake. It has more impact on your API bill and your product’s responsiveness than almost any other technical choice — and unlike your data model or your architecture, it’s relatively easy to change later if you plan for it from the start.

The actual decision factors

Four things should drive your model choice for any given task. Not benchmarks. Not which company released the most impressive demo last month. These four:

Reasoning complexity. Does this task require multi-step reasoning, nuanced judgment, or creative synthesis? Or is it essentially pattern-matching — extract this field, classify this input into one of five categories, reformat this JSON? The gap between “simple processing” and “genuine reasoning” is real, and smaller models close the gap significantly for the former. A well-prompted small model can classify a customer support ticket with essentially the same accuracy as a frontier model. It cannot write a nuanced legal memo with the same reliability. Know which one you’re building.

Data sensitivity. If your product handles healthcare records, legal documents, financial data, or anything where the privacy implications of sending data to a third-party API are real — this narrows your options considerably. Closed API models (OpenAI, Anthropic, Google) mean your data leaves your infrastructure. For many verticals, that’s either a legal problem or a sales problem the moment a large customer does due diligence. Self-hosted open models are the answer there, and the quality gap has closed enough that it’s a real option for most tasks.

Latency requirement. How fast does the user need a response? A background report that runs overnight has very different constraints than a conversational interface where someone is waiting for a reply. Larger models are slower. Context matters too — a 128k-token context window takes more time to process than a tight 2k prompt. If you’re building a real-time product, latency should be an input to your model choice, not an afterthought.

Cost at scale. This is the one that surprises founders most. The cost difference between frontier models and their smaller counterparts can be an order of magnitude or more per token. That’s barely noticeable during development. At a few thousand active users, it becomes meaningful. At tens of thousands, it can make a real dent in your margins. Do the math early: estimate your average prompt and response lengths, multiply by your expected call volume, and price it out. It’s a 15-minute exercise and it will inform your architecture.

A decision framework

This is how I walk through it with founders. Start with complexity, layer in privacy, then adjust for latency.

Model selection isn't one decision — it's a decision per task type. Most products have 2-3 distinct task profiles, each with different requirements.

Click to zoom

The key insight here: most AI products have more than one task type inside them, and those tasks have different requirements. A product that classifies incoming requests, retrieves relevant data, and then generates a response is doing three different things. There’s no rule that says all three have to hit the same model.

Model routing: using different models for different jobs

Once you accept that different tasks have different requirements, the natural conclusion is model routing: intentionally sending different types of requests to different models based on what they need. This isn’t exotic. It’s just applying the same common sense you’d apply to any other infrastructure decision.

A simple example: a small, fast model handles classification and routing. It reads the incoming request and decides what kind of task it is. Then, based on that classification, either a cheap small model (for simple extraction or formatting) or a more capable model (for generation that needs real quality) handles the actual work. You’re not sending every request through your most expensive model. You’re reserving it for requests that actually need it.

This pattern shows up in context engineering (posts earlier in this series touch on it) and in evaluation layers. The routing logic itself is often straightforward — sometimes it’s rules-based, sometimes it’s a small classifier. But the downstream cost and latency savings can be significant.

The abstraction you should build from day one

Here’s a mistake I’ve seen hurt teams: hardcoding model names directly in application logic. Scattered throughout the codebase: model: "gpt-4o", model: "gpt-4o", model: "gpt-4o". When a cheaper model becomes available — or when a new frontier model is released that changes what’s worth paying for — swapping it out means touching thirty files. Not ideal.

The alternative is simple: a thin wrapper around your model calls with a named model config. You define a small set of tiers in one place — something like FAST, STANDARD, BEST — and map them to actual model identifiers in a config file. Your application code calls the tier, not the model directly. When you want to swap in a new model, you change one config value and everything routes correctly.

This also makes it trivial to A/B test models later. You can split traffic between model versions and compare output quality using your evals. Without the abstraction, that experiment requires significant refactoring. With it, it’s a config change.

Open-source models: when they’re worth it

A few years ago, the quality gap between open models and API models made this an easy choice for most teams: just use the API. That gap has genuinely narrowed. Llama 3 70B, DeepSeek, Qwen — these are real options for production workloads, not just research curiosities.

But self-hosting has real costs that founders underestimate. GPU infrastructure, model serving, scaling, uptime — all of that becomes your problem. The economics only make sense if you’re calling the model very frequently (where the per-token savings justify the infrastructure overhead), or if data privacy requirements leave you no other choice.

For most V1 products: use the APIs. They’re faster to start with, require no infrastructure management, and the model quality is high. Plan your abstraction layer so you could route to a self-hosted model later if you needed to — but don’t pay the infrastructure tax before it’s justified.

The model is not the product

One more thing worth saying explicitly: the model is a component, not a strategy. I see founders talk about their model choice as if it’s a defensible moat — as if being the company that uses Claude gives you some lasting edge over the one that uses GPT. It doesn’t. Models are commodities, and they’re getting cheaper and more capable every few months.

What’s defensible is everything around the model. Your data. Your context engineering. Your evals that tell you when quality has degraded. Your routing logic that gets the right task to the right model at the right cost. The model is table stakes. The architecture on top of it is where the product actually lives.

So pick your model thoughtfully. Don’t overpay. Build the abstraction layer. Then focus the rest of your energy on the things that actually compound.

Founder Takeaway

List out every distinct type of LLM call your product makes. For each one, ask: does this require genuine reasoning, or is it classification and formatting? Does it touch sensitive data? How fast does the user need a response? That exercise — which takes maybe 30 minutes — will tell you whether you’re using a single model when you should be using two, and whether you’re overpaying for tasks that a cheaper model handles just as well.

Then build a thin abstraction layer — a named tier system — so model names never appear directly in your application logic. This costs an hour now and saves days later when you want to switch, experiment, or route different tasks differently. The best model for your product isn’t always the most capable one. It’s the one that meets your actual requirements at the lowest cost and latency.