Technical Strategy

Your prompts aren’t the problem. Your context is.

Founders spend hours tweaking prompt wording when the real issue is what surrounds the prompt. Context engineering is the discipline most AI products are missing — and it’s not complicated once you see it.

Here’s a thing I see constantly: a founder has an AI feature that’s giving mediocre outputs. They open the system prompt, change “You are a helpful assistant” to “You are an expert assistant with 20 years of experience,” run it again, and get roughly the same mediocre output. Repeat until frustrated.

The wording of your prompt almost never matters as much as you think. What matters is everything else you put in the context window — the stuff the model actually uses to reason from. That’s context engineering, and it’s where the real leverage is.

First: what is a context window, actually?

When you send a message to an LLM, the model doesn’t see just your message. It sees a chunk of text called the context window — everything you’ve chosen to include in that single API call. The model has no memory outside that window. It can only work with what you give it.

Prompt engineering, in the narrow sense, is about how you word the instructions inside that window. Context engineering is about the whole window — what you put in, what you leave out, how you structure it, and how it changes from one call to the next. The window is a limited resource (measured in tokens, which you pay for). Managing it well is what separates products that feel intelligent from ones that feel dumb.

The context window is everything the model sees. Your prompt is just one layer in it.

Click to zoom

Look at that diagram. The user’s current message is one piece. The system prompt is one piece. But there are three other layers that most early-stage AI products handle poorly or ignore entirely: memory, retrieved knowledge, and conversation history. Getting those right is where output quality actually comes from.

The five layers — and where founders go wrong in each one

Layer 1: System prompt. This is the standing instructions you give the model — what it is, how it should respond, what format to use. Most founders nail this eventually. The common mistake is stuffing it with too many edge cases and exceptions until it becomes a wall of text that contradicts itself. Keep it focused. One clear persona, a handful of firm rules, a defined output format. That’s it.

Layer 2: Memory and state. Does the model know anything about this specific user? Their preferences, what they’ve done before, what they told you in a previous session? Most V1 products: no. The model starts cold every single time. That’s fine for simple use cases, but the moment your product needs to feel like it knows the user, you need to inject a user-state summary into every call. This doesn’t have to be fancy — even a few sentences about what this user does and what they’ve asked before makes outputs dramatically more relevant.

Layer 3: Retrieved knowledge. This is the RAG layer — facts, documents, data the base model doesn’t know. I covered RAG vs. fine-tuning in the last post, so I won’t relitigate it here. But the context engineering piece that matters: don’t just dump everything you retrieved into the window. Rank it. Trim it. Put the most relevant chunk nearest the user’s question. LLMs exhibit what’s sometimes called “lost in the middle” behavior — they attend better to content at the beginning and end of a long context. Placement matters.

Layer 4: Conversation history. Multi-turn products — anything with a chat interface — have to decide how much prior conversation to include in each call. The naive approach: include everything. The problem: the history grows unboundedly, gets expensive fast, and can actually degrade quality as earlier turns become less relevant. A smarter approach: keep a rolling window of the last N turns, and for longer sessions, include a compressed summary of earlier conversation instead of the raw transcript. Summarize the history, don’t just truncate it.

Layer 5: The current input. The user’s actual message. You’d think this one is outside your control — they type what they type. But you can rewrite it before it hits the model. Normalize ambiguous queries. Expand abbreviations. Add implicit context (“the user is asking about their most recent invoice”). Augmenting the raw input before the LLM sees it is one of the most effective and underused techniques at the V1 stage.

The token budget problem

Context windows have gotten enormous. Models that can handle 128k or 200k tokens are common now. Founders see those numbers and conclude: great, I’ll just include everything. Don’t.

Larger contexts are slower and more expensive. But more importantly, quality often degrades as context grows. The model has to attend across more content, and irrelevant noise gets in the way of the relevant signal. A tightly curated 8k-token context usually produces better outputs than a bloated 80k-token one on the same task. You want to maximize signal-to-noise in the window, not tokens.

Think of it like briefing a contractor before a job. More pages don’t help if most of them are irrelevant. A one-page brief with the right information beats a twenty-page brief where they have to go hunting for the key parts.

A practical framework for your first pass

When I look at an early-stage AI product that’s producing mediocre outputs, this is the order I ask about. Not prompt wording — context layers.

Start with layer 2 (memory). Does the model know who it’s talking to? If not, add a user summary — even a stub. Then layer 3 (knowledge): what does the model need to know that it doesn’t? Retrieve that specifically. Then layer 4 (history): how much conversation context are you carrying, and is it the right amount? Last, revisit layer 1 (system prompt) and trim anything that’s ambiguous or contradictory.

Only after all of that do I look at prompt wording. Because usually by then, the output is already noticeably better, and the wording barely matters.

Why this matters more at V1 than later

The reason I push hard on context engineering early: it’s cheap to fix now and expensive to fix later. If you build a context strategy that works at 100 users, it scales gracefully. If you build something that just concatenates everything and hopes for the best, you’ll hit a wall — performance degrades, costs spike, and the fix requires rearchitecting the data layer.

And here’s the other thing: good context engineering is what makes your product feel like it’s paying attention. Not the model — the model is the same for everyone. The product that surfaces the right memory, the right knowledge, the right conversation history at the right moment feels genuinely intelligent. That’s your product doing that. The model is just the inference engine.

Founder Takeaway

Before you spend another hour rewriting your system prompt, do this: print out (or log) exactly what you’re sending to the model on a real user request. The full context window — system prompt, history, retrieved chunks, user message, all of it. Read it like you’re the model. Is there anything in there that would confuse you? Is there anything important that’s missing?

That exercise alone will tell you more about why your outputs are mediocre than any amount of prompt tuning. Context engineering isn’t a new framework or a tool you need to buy — it’s the discipline of being intentional about every token you send to the model. Curate the window. Maximize signal. The model will do the rest.