Intelligence is cheap. Context is expensive.

After thousands of hours working with frontier LLMs, I’ve come to realize that the ‘raw intelligence’ of the model isn’t holding it back anymore. Not for most tasks anyway. To go a step further, considering just intelligence, the top models today (GPT-5-Pro, GPT-5-Thinking, o3, Gemini-2.5-Pro, etc) are smarter than me. What they lack is context, and I suspect this will continue to be true for a long time.

Most of the focus on LLMs has rightly been on model capabilities: reasoning, factual knowledge, code generation. But I think we’ve reached a point where model intelligence is not the real bottleneck. Performance differences between users aren’t primarily about prompt engineering tricks or having access to the latest model. They’re about something much harder: extracting and encoding the implicit context that determines whether output is actually useful.

The context gap

Here’s a simple test. Ask Claude-4-Opus or GPT-5 to write a product requirements document for a feature at your company. Then ask a colleague to do the same task. The model will give you something that looks professional – proper sections, reasonable user stories, clean formatting. Your colleague will give you something that actually reflects how decisions get made at your company. They know that the CEO hates anything that increases support tickets. They remember that the last time someone proposed a similar feature, it got killed because of data privacy concerns. They understand that “quick win” in your organization means “ships in the current quarter without additional headcount”.

This isn’t because your colleague is smarter than GPT-5. It’s because they have access to context that’s nearly impossible to encode in a prompt without significantly more effort: the unwritten rules, competing priorities, and organizational dynamics that determine what actually gets built.

A simple framework

Performance varies because of three factors:

Performance = Model × Task Clarity × Context

Model capability keeps rising, but it’s well beyond what’s needed for many tasks. The variance in outcomes is increasingly dominated by how well you can specify what you want (Task Clarity) and surface the relevant information the model needs to succeed (Context).

This is why the same person can get dramatically different results from the same model on seemingly similar tasks. It’s not the model that’s inconsistent, it’s the quality of specification and context extraction.

Implicit context

Part of what makes this so hard is that there are many types of implicit context that would be useful to the model but are hard to define or encode in text. I’ve tried to outline a few, but these are neither comprehensive nor completely distinct (e.g., you might say that taste and format preferences are a form of tacit knowledge). Regardless, consider this a decent starting point for thinking about the things you want to try and provide the models for the best performance.

1. Tacit knowledge

The things insiders “just know” but never write down e.g., “in our company, everyone understands that customers under $10K ARR churn at completely different rates than enterprise clients and for different reasons, but this isn’t documented anywhere”. If you were to ask an LLM to analyze our customer retention without specifying this, the analysis would be fundamentally wrong.

2. Higher-order goals

The real objective that wins trade-offs when push comes to shove. You might ask for a marketing plan to “increase brand awareness,” but if your actual constraint is that every initiative needs measurable ROI within 90 days, the model needs to know that. Otherwise you’ll get a beautiful strategy that’s completely impractical.

3. Local constraints

The ‘physics’ of your specific environment. Budget limits, data availability, team capacity, regulatory requirements, technical debt. These are typically very locally-defined, they’re the specific limitations and resources you and your company are working within right now.

4. Taste and format preferences

How output needs to look and sound to be accepted by your audience. Some managers want one-page memos with clear recommendations. Others prefer detailed analysis. Some organizations use specific terminology or avoid certain jargon. The substance might be identical, but the wrong style or format can render it all irrelevant.

5. Organizational theory of mind

Your live mental model of key stakeholders: their goals, constraints, communication styles, and current priorities. This is often the most important and hardest to articulate. For example: “the head of sales cares about anything that might disrupt Q4 numbers”, “the engineering manager is under pressure to reduce technical debt”, “the board has been annoyed by missing deadlines, so the CEO won’t approve anything overly aggressive”.

A concrete example

Say you’re tasked with evaluating whether to build a new integration with a popular tool. Most people prompt something like:

“Help me analyze whether our software company should build an integration with [Tool X]. Include pros and cons, resource requirements, and a recommendation.”

That gets you a generic analysis that covers obvious points but misses what actually matters. Compare that to including just some of the relevant context:

“Help me analyze whether we should build an integration with [Tool X]. Context: We’re a B2B fintech SaaS company in the with 50 enterprise customers averaging $25K ARR. Our engineering team of 8 is already committed through Q2. Our biggest competitor launched this integration 6 months ago, and we’ve lost 2 deals specifically citing its absence. Our CEO’s top priority is reaching $2M ARR by year-end, but our Head of Engineering is concerned about technical debt from rushed integrations we built last year. The sales team wants this yesterday, but our product manager thinks we should focus on core platform stability. All things equal, the head of engineering will get the final say, but I report to the product manager so I need to give them something to work with, because they are being asked to give recommendation to the CEO on how to hit the revenue target.”

The second prompt produces analysis that’s actually useful because it optimizes for your real constraints and can consider the actual tradeoffs. The example above is by no means the limit on doing this well, it is simply an illustrative example of the types of context you should try to include in your prompts.

Agents as context extractors

In some domains, agents are increasingly doing the context extraction work for us. Systems like Deep Research, Claude Code, or ChatGPT with connectors (e.g., to Google Drive, Sharepoint, Dropbox, etc) represent a shift in how we solve the context problem in some cases.

Agents ofte work by automating context assembly. They iteratively search and plan, alternating between “think → look → compress → think”, improving the context with each pass. They delegate to sub-agents that parallelize discovery: one hunts primary sources, another extracts metrics, a third builds a timeline. Each returns compressed context to an orchestrator that synthesizes everything into a final response.

The magic isn’t just the final answer – it’s the system’s ability to go get what the answer depends on. When you point these agents at your Google Drive or SharePoint, performance jumps in part^[1] because the model no longer depends on you to manually ferry documents and tribal knowledge into the prompt. The system spends most of its tokens fetching and compressing the right internal context.

The partial solution

This automation solves a significant portion of the context problem. Agents excel at gathering what’s already documented. They can learn your organization’s writing style by reading past proposals, understand format preferences from old presentations, and possibly even infer some tacit knowledge by analyzing patterns across documents, emails, and Slack conversations.

But this is only a partial solution. Much of the most important context – especially organizational dynamics, unstated preferences, and recent informal decisions – isn’t expressed in text anywhere. The head of engineering’s growing frustration with technical debt, the unspoken understanding that certain customer segments aren’t worth pursuing, the CEO’s relationship with and mandate from the board. This context lives in conversations, body language, and institutional memory that is rarely written down.

This creates an interesting dynamic: while agents democratize access to documented knowledge, they also increase the premium on context that can’t be easily extracted. Skilled users who can surface this undocumented context will continue to extract significantly better performance from frontier models.

The capability paradox

As model capabilities expand, the frontier tasks become more context-hungry, not less. Better models don’t just make existing tasks easier, they raise our expectations and make us attempt more complex, nuanced work that requires even richer context.

When GPT-3 could barely write coherent paragraphs, we were impressed by any reasonable output. Now that GPT-5 can write sophisticated analysis, or complete a multi-hour task like building an entire website, we expect it to understand our specific industry, company dynamics, and strategic context. The ceiling rises, but the price of admission, clean specification plus the right evidence, rises with it.

This means that as models get more capable, the variance between average users and sophisticated users may actually increase. The most skilled users will push these systems to tackle increasingly complex tasks that require deeper context extraction, while the average user trying the same task without the hard work of curating the right context will see underwhelming results.

Why this matters

We’re entering a world where raw intelligence is increasingly commoditized, but the ability to effectively specify tasks and extract relevant context becomes the key differentiator. The people getting the most value from AI aren’t necessarily the most technically sophisticated users, they’re the ones who best understand their domain and can most clearly articulate the implicit context that shapes whether any solution will actually work. They’ve built a mental model of the LLM that helps them to see where it succeeds and fails, and how to provide the right harness to extract the best performance.

If you feel like LLMs are underwhelming for complex work tasks, the solution probably isn’t waiting for smarter models. It’s developing the ability to recognize and extract the implicit context that usually stays in your head. Try choosing a difficult task that you’d normally expect a colleague to spend a couple of hours on. Something entirely text-based. Spent some time asking yourself:

What do I know about this domain that the model doesn’t?
What constraints am I operating under that aren’t obvious?
Who are the stakeholders, and what do they actually care about?
What would make output useful versus just correct?
What are the unwritten rules that determine success in this context?

The models are already smart enough. The question is whether you can give them what they need to be useful.

The other reason performance jumps with agents is simply that they spend far more tokens than a standard chatbot ↩︎