Joel Jensen

CorpusBench

Thu, 26 Mar 2026 17:00:00 PDT

CorpusBench is a new agentic customer service benchmark, designed to test models in a more realistic scenario than benchmarks like taubench. It does this by providing the agent with various issues to resolve, but no policy guidance in the prompt. What the agent does receive is access to the simulated business’s historical data - emails with customers, internal communication regarding policy changes, order history, product catalogue, etc. The agent’s job is to use this to infer the correct policy to apply, and then apply it. The agent is measured on two things: its ability to take the correct action, and whether or not it provided the correct rationale for those actions.

Why build it?

When I wrote about what was holding back AI diffusion, one of the examples I gave that had worked extremely well for deploying AI was this:

The second breakthrough, which was likely even more important, was that instead of trying to hardcode behavior into prompts, we simply instruct the agent to begin by searching Gmail for the most recent similar cases, and use this to inform the response. Initially I didn’t realize what a huge improvement this would be but it quickly became clear. Not only do we not have to maintain prompts, but the model naturally picks up the style, tone and length of our responses. It also inherits our policies, like how we handle refunds or returns, and when we deviate from the default, because it can see how we’ve handled it in the past. Perhaps the greatest benefit of this approach is that in a way, it learns over time. I’m using the word loosely of course. But consider the following case: the agent searches for similar cases and drafts a response. The user decides the response was wrong and edits it before sending. Next time the agent faces that situation, or any that are similar, it follows the most recent behavior. This required no explicit intervention from the user.

This pattern is incredibly powerful for deploying agents into a business. If the agent can access historical precedent, it can infer huge amounts of what it needs to learn within a single context window. As memory improves this becomes more powerful still. Not all use cases support this [side quote: I do not think it is a coincidence that many of the largest AI use cases share this characteristic: coding has the codebase, customer service has the historical cases, and legal work has the contracts and documents, along with historical redlines. In each case, a sufficiently intelligent agent can use these things to infer how to do work correctly without needing to be explicitly instructed], but when they do, deployment is theoretically far easier, because we can potentially realize the dream of the drop-in digital worker - an agent that can simply be connected to relevant systems and data, and learn all that it needs to complete some piece of work while adhering to the correct policies.

However, as far as I can tell, existing benchmarks do not test this. The canonical agentic customer service benchmark is taubench from Sierra. While it is an excellent benchmark, the agent is given the full policy manual to apply in the prompt. It is still testing something valuable, but it is not testing the agent’s ability to search and gather the correct context. CorpusBench aims to solve this.

Despite the excitement and usefulness for agents in greenfield projects, the total set of opportunities is dominated by transforming existing businesses and their workflows. This is much harder, for many reasons. One of the main ones is that context is dispersed across the organization. It’s in different systems and formats, much of it is tacit, and large amounts of it are outdated. CorpusBench is an early attempt to measure an agent’s ability to solve this problem when provided with historical data access.

Why isn’t AI diffusing faster?

Sat, 07 Mar 2026 16:00:00 PST

For the last three and a half years I’ve been the persistent (annoying?) friend telling everyone in my life that LLMs were going to be a bigger deal than they were expecting. Today this is definitely closer to the consensus view, especially if you’ve spent much time with a frontier model. That said, it’s probably still underappreciated. About six months ago I was having a discussion with a friend about enterprise AI adoption, and how we must be close to a capability level that would enable broad-scale deployment. We were debating deployment speed, and the likely limiting factors. One of my beliefs was that the models were good enough to start driving real productivity gains, but that access to talent and inertia were two of the limiting factors. Over the next few days as I kept thinking about this, I realized that my family runs a small ecommerce business, and that if I believed in automation for an enterprise, then automation for a small business should be much easier. Even better, the limiting factors should be less of a problem: I could build the tools myself, and changing a 2-person business should be much easier than a 20,000-person business. This is my six-month check-in to share what has worked and what hasn’t, and what I think it means for the diffusion of AI.

The business at hand is a girls clothing store based in Australia. It’s hosted on Shopify and otherwise uses a very simple tech stack: Gmail, Klaviyo, Facebook and Instagram. There are two employees. Most of the business involves buying and importing the products, running the online store, managing orders and returns, customer service, and acquiring new customers through various marketing channels. Very stock standard, and fairly digital, with the obvious exception of packing and sending orders.

Where to begin

I decided to start with something simple: business analysis. This is well and truly in-scope for frontier models, and a common pain point. Shopify makes it easy to see if revenue is up or down, but frankly their reporting is poor. It’s too confusing and makes it much too hard for a layperson to get the insights they need. I wanted something that could answer questions like:

Why is revenue up this month? (price, volume, or mix)
Which products, or types of products should we sell more, or less of
Which products are overstocked, and how much should they be discounted

Business analysis

To begin, I built what I later turned into Rocky, a custom agent harness. This gave me a simple chat app with the ability to add custom tools, and importantly, was provider-agnostic. This was a non-negotiable for me given how often the leading model changes, and because I feel quite confident selecting the right model for a given task, at least for the moment.

Once this was built, and I could have a simple back-and-forth conversation, I began connecting tools. There are lengthy blog posts to be written on this alone, and how to balance the tradeoff between fewer, more flexible tools, and having more, more tightly scoped tools. The single best resource I found for this is Anthropic’s excellent post about writing effective tools for agents. In the end, I built a small number of tools:

SQL tool for making read-only queries
Code sandbox built with E2B so the agent can do more detailed analysis if it wants
Shopify-specific utility tools like topProducts and topVariants which take a date range and run a pre-determined query for common user questions

This was straightforward to build, and as you’d likely expect, the models are excellent at this type of analysis. You do need to be careful to explain the quirks of your data, for example: my sales table has one line item for every product sold, so an order with multiple products has multiple rows. That means there is a column for the value of the product (product_price * qty) and another for the total order price. Because of non-product costs like tax and shipping, the sum of product_price for a given order does not equal the total order price. Most recently, models like GPT-5.2+ and Opus-4.6 tend to catch this even if you don’t tell them, but that’s not always the case. Giving the model access to a dedicated doc that explains the db schema and any notes like this makes it significantly more trustworthy.

As good as agents are at this type of work, the truth is that outside limited scenarios, it’s not hugely valuable. It definitely helps around the edges with things like working out that your white dresses are selling especially well this year, or that your free shipping threshold should be adjusted.

Customer service

Next, I decided to try and automate customer service. Unlike business intelligence, there is a clear path to creating value here by reducing costs and improving the quality of your customer service.

I’ve actually tried to build solutions here twice before. The first was a full-stack app that auth’d into Gmail and pulled in all threads to the app, and tried to use LLMs to draft responses. We never deployed it because my conclusion was that models were too brittle for the use case. In order to try and make it work you had to maintain an unreasonable number of custom prompts covering every scenario and edge case. In the end it proved too difficult for the capabilities at the time. This was in February of 2025, and OpenAI’s o1 was the smartest model available, but by today’s standards it was quite expensive.

The second attempt was the following month, and we did briefly deploy it. The solution is discussed here but essentially it was a sidebar within Gmail that held templates and used the context of a single thread to apply templates with minor customization. It worked initially but within a few weeks it was barely used. The main issue is that the prompts needed updating constantly, and for a busy business owner it was too time consuming to be worthwhile for the quality of response that GPT-4o was capable of.

Over the most recent Christmas break I tried a third time and it worked exceptionally well. Even better: it took no more than 3 days to build something worth using, and it’s been in use ever since, saving hours per day. This time, we had two significant breakthroughs. First, it was built on the same foundation as Rocky. Every new email triggers a fresh conversation, with the full thread context injected as the user message. By default, we use a custom system prompt for these conversations that is tailored to customer service. This meant a lot of the software building was already done since Rocky comes with LLM inference, persistence, monitoring and tool-use. For those conversations, we give the model some additional tools to search Gmail and save drafts.

The second breakthrough, which was likely even more important, was that instead of trying to hardcode behavior into prompts, we simply instruct the agent to begin by searching Gmail for the most recent similar cases, and use this to inform the response. Initially I didn’t realize what a huge improvement this would be but it quickly became clear. Not only do we not have to maintain prompts, but the model naturally picks up the style, tone and length of our responses. It also inherits our policies, like how we handle refunds or returns, and when we deviate from the default, because it can see how we’ve handled it in the past. Perhaps the greatest benefit of this approach is that in a way, it learns over time. I’m using the word loosely of course. But consider the following case: the agent searches for similar cases and drafts a response. The user decides the response was wrong and edits it before sending. Next time the agent faces that situation, or any that are similar, it follows the most recent behavior. This required no explicit intervention from the user.

The agent has some additional tools, like the ability to fetch all information from a specific product listing directly from Shopify. It can also query Ship24, an API for tracking shipments, the customer order history from Shopify, and it can read an internal spreadsheet we use to track returns. Between these, and our historical emails, the agent can answer essentially any customer query. It also provides a confidence score which we’ve found to be very well calibrated, since it knows when it couldn’t find similar scenarios, and flags as much.

It’s hard to describe what an unlock this was, especially after the initial attempts less than a year earlier which simply didn’t work. Today, this saves 1-3 hours per day, which is about 5-20% of total time for two people working on the business. It costs less than $10/day to run.

Inventory management

The most recent use case I attempted to address is deciding what, and how much to reorder. Like customer service, this is a clear case where improvements could be valuable: both in time saved, reduced costs from not overordering, and more revenue from not underordering. It also shares the helpful property of being entirely digital and benefiting from complex analysis that a business of this size usually wouldn’t get, of the type that frontier models are excellent at providing.

Historically this has been challenging, and like most small businesses, probably heavily underoptimized. The existing inventory management strategy could most accurately be characterized as “vibes based” with a little analytical rigor occasionally layered on top (e.g., last year I sold X of this dress, and sales are up about 10% this year, so we’ll reorder 1.1X).

My first experiment was asking an agent to construct a reorder for me using our harness. Since models can increasingly complete long-running tasks, and the harness provides access to all the required data, I thought this might work quite well. After many attempts it became clear this wouldn’t work. Model context windows are too limited, and model “laziness” was still a factor. Upon closer inspection, I found the model would make what I’d consider mistakes. They were mostly mistakes of internal inconsistency. It would forecast demand differently for two similar products, or it would use the business growth rate to gross up one product, and the product-specific growth rate for another. This might work if you have only a few products, but our business has hundreds of SKUs. However they also made mistakes I’d consider poor even for a junior analyst. Often I’d find that even frontier models would suggest that a product should be reordered because its sales had been climbing quickly without accounting for seasonality. This would manifest in it suggesting we made large reorders of Christmas dresses at the end of December, not realizing that demand was about to fall off a cliff.

Next I tried addressing this by having the model forecast for either one or a small number of products at a time, with a more clearly specified system prompt to try and standardize the methodology. This was an improvement, but there were new issues. First, the cost isn’t trivial. Even though model pricing is falling fast, and frontier model prices are extremely reasonable for their utility, 700 API calls to a model, each using tens or hundreds of thousands of input tokens, and thousands or tens of thousands of output tokens can add up quickly, especially while testing. If it was definitely going to work, it would be worthwhile, but I’d probably have to spend hundreds or thousands of dollars to figure that out. The second issue is explainability. Even if your forecast is produced with a superintelligent AI, the practical reality is that no business owner is going to trust it without verification. At least not at the beginning. And if the only way to unpick the rationale was to either read the reasoning traces, or trust that the AI-generated explanation was accurate, then you’ll probably never convince the average business owner to listen.

Eventually I landed on a mixed approach: start with a deterministic first pass, running a fairly simple algorithm to produce an approximate forecast to filter down to the products worth going deeper on. Second, for each of the filtered products, let an agent review and adjust as needed. This approach has many benefits. The cost is much lower, since usually only a small subset of products needs reordering. Second, by passing the agent the full input and output of the algorithm we ran, it could see the considerations we were implicitly baking in like seasonality, trading off business growth vs category growth vs product-specific growth. Third, by starting from a deterministic base, we could show the user the final output as a function of our algorithm, with the AI adjustment layered on top and clearly shown as such, giving them confidence that we’d considered everything. The image below shows what the user sees for each product, which includes an overall recommendation, along with current stock and recent sales. It also includes a breakdown showing how we landed on this number which includes:

Sales from the same period last year
Gross up to account for periods of low stock in the reference period
Product-specific growth YTD
Business growth YTD
Existing stock, and expected sell-down prior to stock arrival
AI adjustment

A reorder recommendation with the deterministic baseline and the AI adjustment shown separately.

This turns out to be an excellent approach, as the agent gets the benefit of our existing knowledge with respect to how to reorder, and a framework within which to make changes. The benefit is clear here, where it would typically be quite hard to predetermine the exact right weighting of YTD vs more recent growth, or product vs business growth, but the agent is excellent at making these contextual decisions. In this case, it decides to increase our forecast, noting that the YTD growth number we’re using is hiding a recent acceleration, and effectively deciding to more heavily weigh recent growth, a very reasonable decision.

To test the reliability of our reordering agent, I built a small eval with a handful of scenarios that were adversarially corrupted. For example, I’d run the algorithm for forecasting but change the recommended order qty by increasingly difficult-to-catch amounts (anywhere from +/-50% to +/-5%). As a sign of AI progress, in December, no model consistently caught the more challenging cases. In early February, Opus-4.6 was reliably catching +/-10% issues. As of today (early March 2026) GPT-5.4 catches everything, every time, with medium reasoning.

The reorder trace view makes the model’s reasoning and follow-up actions inspectable.

The inventory reorder flow is also built on Rocky, our existing harness. This pattern is extremely helpful, as it allows us to treat each product within a reorder as a separate conversation. This means the exact “thought process” — tool calls, reasoning, etc — is inspectable. It also means the user can ask follow-up questions, which are simply treated as another user turn in the existing conversation, so the agent can both explain further or make adjustments with full knowledge of what has been done previously. The image on the right shows exactly what this looks like in app.

As a final note, even though the reorder calculation is largely deterministic, it is worth noting that the best models helped significantly in tweaking this forecast methodology. For example, even recently I asked Gemini-3.1-Pro to review the approach and suggest improvements. It correctly identified and proposed a solution to the issue of floating holidays like Easter, which occur on different dates each year. The forecast now adjusts for these. So while the forecast itself may be largely deterministic, that is only possible at its current quality level as a result of AI proposing and implementing many of the features of this forecast.

Learnings

Reflecting on my progress, what have I learned? I think it falls into two categories: how to build these use cases, and why they’ve gotten so much more tractable recently.

How AI use cases want to be built

Model everything as a conversation. Simply switching the system prompt and tool set of the agent is incredibly helpful. It makes it easy to reuse the same infrastructure for many problems, makes everything inspectable, and provides a very natural way to let the user interact and iterate on the agent’s work and collaborate on shared artifacts like a purchase order.

Build in ways that benefit from future model improvements. You need to think hard about how your solution will benefit from future model improvements. This has been a much-discussed topic for at least 2-3 years but it still doesn’t feel like we’ve completely figured out what this means. One of the things that makes this challenging is that before a model gets very good at something, it benefits significantly from you imposing as many methodological guardrails as you can, and from you manually deciding how to disaggregate the task into constituent pieces. The danger is that as soon as the model crosses some threshold of capability, these guardrails and steering are actively harmful to the outcome. As the models become better at instruction following this will continue to be the case. The more you predefine the process they must follow, the less you benefit from their alien thought process. Our customer service use case is well-suited for capability growth in the models that power it. Our reordering process is moderately so, but I suspect that in 6-12 months I would likely benefit from giving the models more freedom to forecast, even if by letting them rewrite the forecasting algorithm.
You need deep workflow knowledge. If you’re trying to replace or heavily augment a business process, you probably need either deep knowledge or need to be willing to sit and work with someone doing the job to catch these edge cases. Clearly the industry has caught onto this, as evidenced by the explosion of startups hiring FDEs. I suspect this is a function of trying to automate whole workflows. The previous generation of software only had to augment the workflow, and so could provide the user more sufficient degrees of freedom to adapt, but didn’t have to actively handle the edge cases. That’s no longer true if you’re trying to sell an outcome-based solution.
UI / UX matters more than you think for adoption. There are definitely lots of improvements to be made on the UI/UX side to enable collaborative work between agents and humans. I was surprised at how much time I spent simply trying to make this intuitive, and I’m sure there is lots of low-hanging fruit here. It definitely gave me a deeper appreciation for great designers.

Build and maintain evals, especially for things the models can’t do. Unfortunately the hordes of people telling you to build evals are correct. It is especially important to build evals for things the models cannot do today. I often feel crazy trying to explain to a friend that they should retry their use case because of whatever model was released last week. GPT-5.4 was released three days ago, and has already noticeably improved the performance of all three of the use cases I’ve written about here. Not to mention that GPT-5.4-Pro is available and by all measures a meaningful jump again, but is currently limited to OpenAI’s $200 subscription, or available in the API at 12x the cost of GPT-5.4. Despite the plethora of publicly available evals, the value of building your own is probably going up, if only to indicate to you the gap between the models’ capabilities and how you’re currently using them.

Why did this suddenly get easier?

Coding agents got really good. The only thing more amazing than the capabilities of current coding agents is their rate of improvement. It’s hard to overstate the extent to which I could never have built this even 6 months ago, let alone 12. I’m not even sure I could have done it three months ago, as that’s around when Opus-4.5 released. This has been a function of both better models and better harnesses, but mostly the former. Because my ability to write software is entirely limited by coding agents, I acutely feel the capability jumps of successive releases. From talking to my colleagues, I sometimes suspect this isn’t the case for them, largely because in many elements of the job their own ability exceeds that of the model. The one caveat I’ll offer is that software engineering is by no means “solved” or automated. I’m fairly sure we’ll get there, but it still takes a lot of human effort to build something useful.

Recent METR results show how quickly frontier models are improving at longer-running software tasks.

Browser use is helping close the loop. One very large recent unlock has been agents’ ability to use a browser well. This is in part because it helps “close the loop” and lets an agent test its changes and do QA in a way that wasn’t possible before, and in part because doing this allows the agent to run for many times longer without direction. If you haven’t, I recommend connecting Codex or Claude Code or OpenCode to something like Vercel’s Agent Browser.

Terminal use has largely solved deployment for simple apps. Another recent unlock in agents’ ability to use a terminal extremely well has been for deploying code they write. This used to be one of the most painful parts of writing code, especially for non-engineers. Recently I’ve been able to deploy multiple apps to different hosting platforms like Railway, Render and Vercel, largely without having to actually open those platforms in a browser. Codex et al. are quite capable of both deploying, monitoring, and debugging your app anywhere that a CLI is available.

Takeaways

I feel very conflicted on the topic of AI diffusion. On the one hand, my first reflection is that doing this was honestly quite a bit harder than I expected. I didn’t expect automating a business to be easy, but I probably didn’t expect it to be as hard as it has been, and it’s by no means fully automated. If it takes this much work and iteration to handle the various edge cases and nuances of a business with two employees, then surely the complexity of doing anything remotely similar for a business with 2,000 employees would be multiple orders of magnitude harder. Larger businesses not only have a scaled-up version of this problem, but also the many incremental problems that come with scale: many more systems, security considerations, change management, etc.

On the other hand, much of what was difficult about this was the infrastructure. Building the software, connecting it reliably to various other systems, building reliable traceability, and a user interface that makes it both easy to use and to understand the actions taken by an agent so that human sign-off is easy to provide. This is an optimistic view, since I am not a particularly great software engineer, and so perhaps others, or future coding agents, will do this much faster. Software also scales quite well, and these problems don’t have to be solved independently for every business.

It seems to me that the models themselves are very clearly good enough for a large portion of white-collar work, and that the binding constraint is more about organizational readiness, which I take to include the ability to reorganize around new workflows, manage the change that comes with that, and unblock the many “papercuts” that make this difficult or impossible within the current business structure (e.g., letting various systems speak to one another).

The other very significant blocker to diffusion is access to talent. The nature of the models having gotten so much better in the last few months is that most people have no idea, nor anyone on their team with both the ability to use them to their fullest extent, and the organizational influence to try. You might ask why coding agents themselves don’t solve this today. They probably will in future, but they’re not there yet. They’re still much better as directed tools than independent agents. The other reason is stranger: the models powering these agents have training data cutoffs 6-12 months in the past and so have no real concept of what they’re capable of without a human setting the course. By definition this will somewhat resolve over time, but perhaps they’ll continue to have a lagging sense until we solve continual learning in some form.

If you’re doing this, where do you start?

Given my experience, if I was doing this for a large business, I’d start narrow. Pick the most narrowly defined workflow you can where automation would still add substantial value. I’d also be careful to pick an entire workflow. One trap I suspect many will fall into is automating a component of a workflow and then wondering why nothing has changed. The nature of organizations is that they’re full of bottlenecks, many implicit. Unless you target something end-to-end that can add value, you’ll likely find it impossible to point to any measurable improvement.

It’s also very helpful if you target a use case with some form of corpus to draw upon. One disadvantage of AI agents is their inability to learn and remember things. One advantage, however, is that they can read 6 months’ worth of examples and internalize the learnings for the length of their context window. The way to take advantage of this is to find use cases where much of the learning is encoded in the past examples, and where those past examples can be made available. Coding, via the codebase itself, customer support, via past interactions, invoice processing, via the historical mapping of invoices to cost categories, and writing contracts, via the corpus of previous contracts, are all examples of this. It is probably not a coincidence that these are among the use cases that are seeing fast adoption.

Most importantly, however you feel about the likelihood of continued progress, or the impact it will have, it is worth investing now to be prepared for whatever comes next. I’m personally not very convinced by expectations of large-scale unemployment or the like, but I do expect radical changes to how we work. If one of the main binding constraints is organizational readiness, and we expect this to become more true as models improve and talent adapts, then it pays all the more to invest now.

Building an Agent for Scheduling

Wed, 26 Mar 2025 17:00:00 PDT

Everyone’s building agents, but not many people are writing clearly about how to implement them. This post walks through how I built a hierarchical agent architecture to solve a real problem: workforce scheduling with lots of messy constraints. I’ll cover why one-shot prompting didn’t work, how I ended up with an Orchestrator + child agents setup, and some practical lessons from getting it working. If you’re already deep into agent design, skip ahead to the architecture section — that’s the core of the post.

My project

I was working on workforce scheduling — basically, assigning staff to shifts under a variety of hard and soft constraints^[1] (availability, skills, legal rules, preferences, cost, etc.). It’s a problem that’s annoying for managers but relatively easy to validate post-hoc, which makes it a great candidate for AI. This framework: hard to do, easy to validate is the most useful one I know for identifying areas to apply language models. My goal: given a blank schedule, can an AI fill it in a way that satisfies hard constraints and optimizes soft ones?

My first approach

My first attempt was to try and one-shot this task, and I wrote an entire custom eval report about the performance of different models at solving this problem for different schedule sizes, with different prompts, etc. The problems with that approach were manyfold:

It was only considering hard constraints
It tapped out at ~100 shifts, and only Claude-3.(5/6)-sonnet could do that reliably at the time. Since then, o1-pro (but not o1 or o3-mini-high), Claude-3.7-sonnet-thinking and recently, Gemini-2.5-pro have pushed that closer to 200 shifts, but this is still too limiting
It was expensive, because you needed to use frontier models to get enough accuracy (I expect this will change for a given level of performance given the rapidly falling price/performance of LLMs)

After hitting limitations with a one-shot approach, I started digging into agent-based methods

Agents 101

Most readers here know what an agent is: an LLM that can use tools, reason about intermediate steps, and act iteratively. The most common pattern is ReAct — Reasoning + Action — where the model generates thoughts, calls tools, sees results, and keeps going until done.

That works well for simple problems, but falls short when things get more complex — either because the task is too long to reason about in one go, or because it benefits from breaking down into subcomponents.

That’s why I started looking at more advanced setups. Claude Code stood out — a CLI-based coding agent from Anthropic. It doesn’t just call tools — it appears to coordinate nested tasks, delegate subproblems, and manage state across toolchains. Watching it in action, it clearly uses some kind of hierarchical architecture, with a parent agent spinning off subtasks that are handled independently and reported back.

This post is my attempt to recreate something like that: a multi-agent system where a top-level orchestrator can delegate to stateless child agents with their own tools and logic.

How the Architecture Works

Let me jump ahead and show you where I landed, and how it works.

Agent Architecture

At a high level, it works like this:

The user passes in some input, like ‘build my schedule’
The OrchestratorAgent receives this, and recursively does one of three things (Direct, Delegate, Respond)
- Direct: executes tools directly for simple tasks
- Delegate: for more complex tasks, it delegates to a stateless child agent. From the perspective of the Orchestrator, these child agents are simply additional tools that it can call, but it passes in a prompt and any relevant state such as the schedule to build. The child agent itself can iteratively call tools and reason, before finally passing back a single response which is added to the log of the Orchestrator
- Respond: once the Orchestrator decides that it is finished, or it needs more information from the user, it responds
Note that the tools available to the child agents are a subset of those available to the Orchestrator. This isn’t a requirement, but it’s what I’ve found to work best. Technically, the Orchestrator can do anything a child agent can, but there are multiple benefits to this structure (more on this below)

Here is it in action:

What you’re seeing here is that once I start the process, the OrchestratorAgent is handed a default task to build a schedule with 10 shifts distributed across the week. At a high level, it does the following:

It recognises this is a specialized task, and it hands off to the BuildRosterAgentTool. It passes the Schedule object containing the shifts to fill
BuildRosterAgentTool then
Begins by calling a tool which returns a list of eligible users for each shift taking into account their vacation and the teams they can work in (this is all static mock data I created, or I should say, an LLM created for me)
Then it calls the create_roster tool and passes in it’s schedule. The return response confirms it is valid
Since it is valid, it returns a success message to the OrchestratorAgent
The OrchestratorAgent then recognizes the schedule is built, and hands off to a specialized OptimizeCostAgentTool
OptimizeCostAgentTool then
Runs a tool to find the most expensive shifts
Runs a tool to find lower cost alternatives
Makes some targeted edits to the schedule
Passes a success message back to OrchestratorAgent
OrchestratorAgent recognises the task is complete, and messages the user to summarize

Here are the tools used across the agents:

create_roster: Generates a full schedule from eligible users. Returns a validation summary. (“Schedule” and “roster” used interchangeably.)
edit_roster: Makes targeted edits to one or more shifts, using shift IDs and user IDs.
get_eligible_users_for_shift: Returns available and qualified users for a given shift or set of shifts.
find_shift: Looks up shifts by metadata (e.g. name, team, time). Used when the user refers to a shift conversationally — e.g., “Jon’s shift on Tuesday in the bar”
get_highest_cost_shifts: Identifies the most expensive shifts in the current schedule
find_lower_cost_replacements: Finds cheaper eligible users for a given shift, often used to avoid overtime or penalty rates
create_persistent_memory: Lets the Orchestrator store user-level preferences or recurring patterns — e.g., “Clara always works Sunday nights” — which can later be passed to scheduling agents
think – a noop tool that lets the agent pause and reflect. Inspired by Anthropic’s work. This is the only tool the only the Orchestrator can access

(Child agents like BuildRosterAgentTool and OptimizeCostAgentTool also function as tools from the Orchestrator’s perspective, but aren’t listed here.)

Benefits of this approach

Compared to single-agent setups or simple prompt chaining, this architecture has a few key benefits:

Specialization: Each child agent has a focused prompt and toolset tailored to its subtask. This improves performance and reduces prompt complexity
Lower Token Usage: By delegating to stateless agents, you avoid accumulating long message histories. Child agents only return a summary, which keeps the Orchestrator’s context light
Modularity: You can plug in new child agents (e.g., CostOptimizer, LeaveManager) without touching the logic of existing ones. This makes iteration safer
Model Efficiency: Stateless child agents can run on smaller, cheaper models when appropriate — saving cost without sacrificing output quality
Resilience at Scale: As I tested with larger schedules (up to ~500 shifts), the architecture scaled better than I expected. Both the Orchestrator and the child agents made use of the think() tool more often under load, which seemed to help with stability and recoverability when things went wrong

Below is one example of the robustness of this approach to using specialized agents. In this example, I used a smaller, cheaper and faster model for the BuildRosterAgentTool. You can see in the image below that it creates a roster with 13 errors, and immediately tries to edit those 13 shifts but it hallucinates some user_id’s. Because the error message is detailed, and because it has a detailed and specific prompt, and access to tools, it realized it should instead check which uses are eligible for the shifts with errors, and subsequently solves most of the errors in the following action, then continues to fix. My prompt doesn’t mention this specific pattern, but it gives the broad goal and provides detailed definitions of the tools available, so the model can work it out.

Practical tips for building Agents

Some of these are relatively obvious, but have been very helpful for me nonetheless:

Start simple and add complexity – if you have a complex task, break it down into pieces small enough that you can test the first one in a single prompt. If that works, expand the prompt, or add a tool, and then recursively continue this process. For example, I started by having claude write me some mock shifts, users and leave requests into json files, and then dropped those into a new instance and asked it to build a valid roster
Don’t use frameworks when you are starting – you’ll find many videos and blog posts about whether you should use Langchain or PydanticAI or CrewAI or [insert framework]. For production, those might be good ideas, but when you start, the abstractions just make it harder to debug what’s going on. It’s not that hard – especially with modern LLMs – to write the scaffolding yourself
Very early, spin up a simple UI or log to inspect your traces – by traces I mean the messages the LLM sends and receives, tool calls and tool results, etc. It has often been extremely useful to me to dig in and see exactly what the LLM sees to debug an issue, especially if you’re dynamically loading in data at runtime (example screenshot below of how I view my traces)
LLMs are pretty good at writing and iterating on prompts – if you’re not getting the agent or LLM to follow your instructions, try giving your prompt to a reasoning model (o3-mini, claude-thinking, gemini-2.5-pro, grok-3-thinking, r1) along with the logs from your agents outputs and explain that you want to optimize the prompt. If you iterate like this, models are quite good at progressively adjusting the prompt to minimize errors
Make sure your tool return values are detailed – remember to think carefully about what the model ‘sees’. After it calls a tool, it receives the response. The more detailed the response, the better it will subsequently navigate its next actions. For example, whenever the create_roster or edit_roster tools are called, I return the status of the entire roster including number of shifts, any shifts with validation errors (e.g., their assignment clashes with their vacation request). This let’s the model easily identify its next step

Example of my UI for inspecting model message traces

This is overkill for getting started, and you should start simple. This page shows the full history, with child agent message history indented for clarity, it is searchable, filterable by tool, agent or message type, and shows token usage for when you start looking to optimize costs.

What’s next

There’s still a lot to test. Some things I plan to explore next:

Scaling the system: How many child agents can the Orchestrator handle before performance drops?
Larger workloads: What’s the max roster size that still fits comfortably in memory and completes in reasonable time?
Cost vs complexity: How do inference costs scale as roster size increases?
Context compression: I’m experimenting with using a smaller model to periodically summarize the message history, to keep the Orchestrator coherent while reducing token load and extending the effective reasoning horizon

If you’re exploring similar questions or have ideas, I’d love to hear them.

For context, here are some of the constraints you need to adhere to when building schedules. Hard constraints: users are valid, users can work in the teams they are rostered to, users do not have clashing shifts, users are not on vacation, users are not unavailable (e.g., some employees are students and cannot work during the day). Soft constraints: manager preferences (“I like having Gavin work the Saturday night shift behind the bar because he’s experienced”), user preferences (“I prefer Friday nights off”), and wage considerations – for a given schedule, many valid options will be undesirable due to costs (e.g., in some jurisdictions, if an employee works two shifts less than X hours apart, they get paid at a higher rate). ↩︎

First-party software

Sat, 22 Mar 2025 17:00:00 PDT

Software development is currently undergoing a large shift, with the economics of developing and maintaining software changing fast. Like all rapid technological change, it’s very unevenly distributed. Most people I talk to outside of San Francisco don’t see it at all, even the developers. So I want to share my personal experience and a few thoughts on where this is headed.

To summarize: we’re approaching a point where building custom software—something historically practical only for large companies—is becoming affordable for many more businesses. Most people don’t see it yet, but it’s significant, and it will change the way people buy software. I want to share some practical examples and thoughts.

First, my background. I’m not a developer. I’ve written some code for about six or seven years, on and off, sometimes not at all for 6 months and sometimes every day for a couple of months. Mostly to build little scripts here and there to make me more productive, and partly because I enjoyed the problem solving aspect of it. The most advanced ‘product’ I ever built was probably a collection of scripts running in replit that did a bunch of data transformations and analysis. Definitely helpful, but I could have achieved the same through a complicated excel file.

Because I understand software conceptually (I’ve worked closely with technical / quantitative teams for a lot of my career), but am not particularly good at writing software, I was in the perfect position to benefit from LLMs. Given my position and my interests, I’ve spent a huge portion of the last 3 years writing code with these models. I’ve seen firsthand how quickly they’ve improved, because my own productivity is so closely tied to their capabilities. In the last 9 months, my ability to write and manage software has inflected upward. This inflection began on June 20th last year, the day claude-3-5-sonnet-20240620 was released.

To make this more concrete, in the last two months, I’ve built two apps, both for my family’s ecommerce business:

A Gmail add-on that integrates with shopify and openrouter to automate customer email management
A fullstack web app that integrates with shopify, fb, google ads, and openrouter to manage a huge portion of our business – everything from sales reporting, inventory management and reconciling shipments and payments from our supplier

Gmail Add-on

When I went home for the holidays I watched my mum respond to some emails from her customers and noticed two things:

Most of the time, she was just copying and pasting a template and making some changes. When I asked why, it was because so many emails take the form ‘has my order shipped yet’ or ‘what is the returns policy, the dress I bought it too big’
For every single email, she was pasting it into ChatGPT and then copying the output back into Gmail before sending. When I asked why, she said that it ‘made all of her emails better’

I realized after watching and asking some more that she spent about 2 hours per day (!) doing this work. Certain this was unnecessary and armed with Claude (Sonnet-3.7 by this point) I spent two days and built a working Gmail Add-on which does the following:

When you open an email, you can send its contents to an LLM (we use Gemini-Flash for cost reasons) to classify the email (e.g., ‘Shipping and Tracking’), run sentiment analysis (e.g., ‘Neutral’), and identify any products or order IDs they mentioned (the former is achieved by loading in a list of our SKUs and the latter is just extracting alphanumeric codes with a few examples given to Gemini)
Using the sender address, fetch the order history and customer profile from Shopify, along with the inventory and price data for any products they mentioned. This is all loaded into the sidebar in Gmail and hyperlinked, giving the user an easy way to quickly view any of the customer orders, see the fulfillment status, etc
Allow the user to – in one click – send all of this to GPT-4o (it seems to write the most natural sounding emails) along with the thread and have it write a draft email

The net result of the above is that 80%+ of all emails are now written by GPT-4o. Everything is handled in a single window, and the user has full control to review or fall back on manual responses. There is also the ability to use GPT-4o to rewrite the email based on a simple text prompt (e.g., make it more casual).

In two days, this went from idea to production and has reduced time spent on this by at least half.

Reporting App

The second example is a different story. Frustrated by how difficult it was to answer the types of questions I had about the business e.g., ‘sales are up 35% year-on-year, is this being driven by price, volume or mix-shift’, or even better ‘volumes are up but is that because you have more customers or customers are buying more’, I decided to build a ‘simple’ reporting app. Initially this was just syncing data from shopify – sales, refunds, products, etc – into a simple replit app with a postgres db and then spinning up a few pages showing the type of info I wanted to see.

Over the last two months this has expanded. It now does:

Sales reporting like I mentioned above with sales and product data updated at least hourly through either regular polling or webhooks Tracking ad spend by fetching from fb and google APIs and showing historical trends
Allowing us to create orders in a custom UI that is designed to put all the information we need in one place based on the way we order, and exporting a csv in exactly the format our supplier requires
Allowing us to track incoming shipment from our supplier, and matching them to orders Allowing us to log invoices from our supplier, and matching them to both the orders and the shipments

The functionality discussed above can be broken into two categories:

Things we wanted to do but couldn’t with our existing set of products (e.g., sales reporting)
Things we had to do but were incredibly time consuming (tracking shipments and payments against orders). This sounds simple, but when your overseas supplier sends you an order in five pieces, each without any documentation, and then invoices you two months later for all of two orders and three quarters of a third, it gets hard to manage)

This app has grown to about 15k lines of code. Although I closely managed its build, and made most of the design choices in the architecture and implementation, I wrote ~0% of these.

Why does this matter?

Historically, software has been expensive to build and perhaps just as expensive to maintain. Small businesses could not realistically build these types of tools. They could either find an off-the-shelf solution or they could live without it. Even with the proliferation of SaaS products, I suspect most instances of ‘software could improve this’ are stuck in the second bucket – being managed manually in a spreadsheet, or on a piece of paper. However, this shift in software economics significantly impacts competitiveness. Businesses that adopt first-party software gain a substantial advantage by precisely tailoring tools to their unique needs, achieving efficiencies that generic products can’t match. I’m not saying that buying software is done and everyone will build their own products. I suspect many people will continue to want a provider to manage things for them, adding features as needed and proving some SLAs. But on the margin, building software has gotten an order of magnitude cheaper, and appears set to continue on this trend for the foreseeable future.

I expect that this category of first-party software – custom apps that are built for a precise purpose, and are cheap enough to produce and maintain that it makes no sense to compromise – will be huge. Today, with a combination of Cursor/Claude Code/Cline/etc for development, and Replit for one-click deployment, the frontier is well beyond what most people realize.

One more thing I want to note here is that even over the two months that I’ve worked on this intermittently, adding functionality and debugging has gotten significantly easier. Two months ago I’d go back and forward with Claude/ChatGPT in a chat UI, planning and writing code and copying and pasting back and forward to build and debug. Compare that to the most recent feature I added and bug I found. I spent about 5 minutes writing both into a markdown file in my project. I opened Claude Code and asked it to read the file and create a plan. It did, and then it implement both for <$2 in about three minutes (including time to test it’s work and make some corrections). I opened Replit and clicked deploy. Five minutes later, my mum had access.

What’s next

It’s well beyond time to start planning. We already know for sure that:

These models will keep improving (OpenAI have already said that their internal o4 benchmark is a meaningful jump in competitive programming over o3, which itself it still unreleased! I’m sure Claude 4 Sonnet/Opus are also on their way)
For a given level of capability, costs will continue to fall quickly
Many more unhobblings will be handled – Claude code is already starting to be given browser access via MCP to test it’s work, context lengths will grow, clever tools will be added
The app layer will continue to wire these things together. A year ago many people could produce code in ChatGPT but had no idea how to run it. Today, it will execute it for you in many cases, and make it shareable. The barriers for the ‘schlep’ around running and deploying code are falling

I suspect agencies that build and maintain these solutions at huge scale will do very well. As will whoever makes it easy to do this end-to-end, from idea to deployment (Replit is the closest here in my experience). I’m not sure what will happen to developers. It seems hard for me to imagine that anyone non-specialized will not have their power reduced in the labor market. I don’t think I’d bet on mass layoffs, but I probably wouldn’t bet on the next 10 years looking like the last 10 in terms of labor market power for a software developer. That said, the uncertainty level is high, and the best thing you can do is get as familiar as possible with these models. Learn where their edges are. Learn how to best utilize them. The more time you spend with them, the better your chance at working out where to position yourself in the coming years.

Can LLMs solve complex scheduling problems? (custom eval)

Sat, 02 Nov 2024 17:00:00 PDT

Creating a roster that adheres to multiple constraints is something millions of people need to do every week. It’s complex and time consuming. I’ve created a new eval testing how well LLMs are able to do this. As an eval, scheduling has multiple convenient properties: it’s practical and translates well to economically valuable work, it can’t be ‘memorized’ in the purest sense as the eval can be trivially re-generated with novel data, and it can be arbitrarily scaled in difficulty either through increasing the size or the complexity.

Summary findings

Claude-3.5.0-Sonnet1 is the strongest model for schedules containing up to 100 shifts
Gemini-1.5 family models (both Pro and Flash) are the clear highest performers for schedules with >100 shifts
All models degrade significantly above 100 shifts, which is only 15k input tokens. This suggests that all models have an ‘effective’ context window that is far smaller than their stated maximum context. Gemini family models suffer the least from this effect – perhaps due to their 2m token context windows
GPT-family models (4o and 4o-mini) perform surprisingly poorly relative to Claude and Gemini. They also have much higher variance vs other frontier model families, even when temperature is set to 0.0
Open-source models lag leading proprietary models significantly, both in absolute terms and in cost effectiveness’
Gemini-1.5-flash is in a league of its own for cost effective high performance
Prompting has an extremely large effect on performance for all models and a high performing prompt for one model has no guarantee of being similarly high performing for another. However, prompt performance does appear to persist across models within the same family, as we might expect

11.03.24 Note: findings are preliminary and some of the analysis below is in progress. As a result, some figures will include more models than other. This will be corrected in the coming days/week

Prompt selection

Given the significantly varied performance of each model on different prompts, most of the results below are calculated using the best prompt for a given model. The full test results used 4 prompts in total, with 3 being meaningfully differentiated and 1 being a slight variation of another in order to elicit the desired output from Gemini-Flash (it needed additional encouragement to use the correct output format). The 3 final prompts were selected after testing over 20 prompts to ensure each model’s performance was fairly represented. Further down, I also include results to show the variation in performance across different prompts for each model. Full prompts are included at the end of the post.

Performance

Below we have average accuracy by model for each schedule size using only the best prompt for each model. Note that all results are filtered for the test runs that used the best prompt for each model.

In figure 1 we see that for up to 100 shifts in a schedule, Claude-3.5.0-Sonnet is clearly the top performing model, capable of building almost perfectly adherent schedules with close to no degradation in performance. Gemini-1.5-Pro is the next strongest performing model, with Gemini-1.5-Flash showing surprisingly strong performance for its size. All models however see significant drop-offs in performance above 100 shifts, with some models – including GPT, Llama and Qwen – see performance degrade much sooner. Above 100 shifts, the Gemini family of models are the clear winners, likely due to their 2m token context windows allowing them to remain coherent over longer inputs.

Figure 1 — Average accuracy by model and schedule size

Figure 2 shows the same data in a different visualization to more easily view the performance-drop of each model.

Figure 2 — Performance scatter by schedule size

Impact of prompt on performance

To demonstrate the impact of prompt performance on accuracy, the figure below shows the accuracy for the best and worst performing prompt for each model when attempting to complete the 100-shift schedule. All models show significant variation, with Gemini-1.5-Pro the clear outlier in terms of consistency across prompts. It is easy to see how the results could be meaningfully changed by selectively using a prompt that favours a given model. This is worth bearing in mind when viewing benchmarks that have been computed by a specific model provider, as it suggests there is plenty of room for selecting a prompt that favours your preferred model.

Figure 3 — Best vs worst prompt accuracy at 100 shifts

Cost performance

The charts below show average accuracy vs log(total cost). The first chart (figure 4) includes all schedule sizes while the second (figure 5) filters for schedules with <=100 shifts. We see that as expected, claude-3.5.0-sonnet and gemini-1.5-pro perform well. However, the true outlier from a cost perspective is gemini-1.5-flash, which produces claude-3.5.0-sonnet level performance across all sizes, and does so at approximately 1/25th the cost. When compared to the only other model approaching its cost – GPT-4o-mini – gemini-1.5-flash achieves 92% average accuracy vs GPT-4o-mini at approximately 45%.

Figure 4 — Cost vs accuracy, all schedule sizes

When filtering for <=100 shifts, it is clear that claude-3.5.0-sonnet is the highest performer.

Figure 5 — Cost vs accuracy, ≤100 shifts only

Model adherence and error rates

Results for each model are tracked such that errors can be classified as one of the following:

Erroneously filled shifts – the model correctly returns the shift with valid information (e.g., shift ID, user ID, department ID) but violates one of the constraints
Unfilled shifts – the model was expected to return a shift but did not (i.e., did not attempt to fill the shift)
Extra shifts – the model returned a shift that was not included in the input set

Additionally, for all shifts that are incorrect, is the model returns a user ID or department ID which did not exist in the input workforce data, this is logged as a hallucination error.

Adherence rates

Model adherence is measured as # output shifts as a percentage of the number of expected output shifts. If the schedule contained 150 shifts to fill and the model outputs 135, the adherence would be 90%. As seen in figure 6, models are generally very adherent across roster sizes, however we do see notable exceptions. GPT-4o-mini has particularly poor adherence, beginning at only 40 shifts. We also see a drop of adherence in the newer Claude sonnet, with drops at both the 60 and 150 shift schedules.

Figure 6 — Output adherence by model and schedule size

Hallucination rates

As shown in Figure 7 below, hallucinating users or departments when filling shifts is almost absent across models, with GPT-4o-mini the only model to hallucinate at all, only for schedules with >=100 shifts and at very low rates 0.045% of shifts filled.

Figure 7 — Hallucination rates by model

Error rates

Model errors are almost entirely composed of erroneous shift fills (constraint violations), with some models leaving shifts unfilled and no models ever providing extra shifts that were not in the input set. One clear example of the need for thorough evals is claude-3.5.1-sonnet (claude-3.5-sonnet-20241022), which we can see has an approximately equal error rate as its predecessor (claude-3.5-sonnet-20240620) but begins to leave shifts unfilled which was previously unseen.

Figure 8 — Error type breakdown by model

Data used in testing

Prompts

Below are the prompts that were used for each model. There are various [PLACEHOLDERS] included in the prompt text. Each of these was replaced at runtime with the actual data. An example of each piece of data is included below.

Prompt 1

Please engage in a comprehensive and meticulous analysis of all provided information to construct the roster. Carefully explore all possible user assignments, thoroughly validate each shift against the specified rules, and rigorously check for any conflicts or overlaps. Ensure that every detail is scrutinized to achieve the highest level of accuracy and adherence to the requirements. Take the necessary time to consider all possibilities and confirm that the final roster is flawless and error-free. Create a perfectly valid roster using all of the information available below, including:

Workforce_data including all user names and IDs, all department names and IDs, and a mapping of which departments each user can work in

All leave requests for the relevant roster period

A list of shifts to be filled by users. Each shift has a date, start and end time, and a department Your roster must adhere to the following strict requirements:

Every shift to be filled must be included in the final output. There must be no omissions

Users can only work shifts if the department is listed as one of their valid_departments in the workforce_data

Users can only work shifts if the date and time do not clash with a leave_request for the user in the list of leave_requests

Users must never be rostered for overlapping shifts (i.e., two shifts that occur at the same time)

Users must never be rostered to start more than one shift per day, however finishing an overnight shift and starting another on the same day is fine. The rules MUST be followed EXACTLY, for EVERY single shift. There can be no exceptions and absolutely no errors. This is critical. You must check that every shift adheres to every rule. You must respond with the solution in text, do not write or provide code. Your final output must be contained within tags and each shift must be in the format specified below. Your final output will be evaluated against a strict API upload, and even a single character out of place will result in failure. Take immense care to produce perfect output and follow the rules above PERFECTLY and without error. [WORKFORCE_DATA] [LEAVE_REQUESTS] [SHIFTS_TO_FILL]

examples
{ "user_id": 4001401, "start": "2024-09-10 09:00:00", "finish": "2024-09-10 17:00:00", "department_id": 1047203 } The user with ID 4001401, Joel Jensen, has no valid_departments listed in the workforce_data. Therefore, assigning him to the "Beverage Service Department" (ID: 1047203) is invalid as he is not authorized to work in any department. Let's break down the thought process: 1. First, I checked the user_department_mappings in the workforce_data for user 4001401. 2. I found that this user has an empty list for user_valid_department_ids. 3. This means the user is not authorized to work in any department. 4. Assigning them to any department, including 1047203, would violate rule #2. 5. To fix this, we need to find another user who is authorized to work in department 1047203 and doesn't have conflicting shifts or leave requests. { "user_id": 4001700, "start": "2024-09-12 10:00:00", "finish": "2024-09-12 18:00:00", "department_id": 1047211 } The user with ID 4001700, Elena Martinez, is only authorized to work in "Operations", "Customer Service", and "Service Team". Assigning her to "Food Prep" (ID: 1047211) is invalid as it is not among her valid_departments. Here's the detailed reasoning: 1. I looked up user 4001700 in the user_department_mappings. 2. I found that their valid_department_ids are [1047210, 1047202, 1047205]. 3. I cross-referenced these IDs with the departments list: - 1047210 corresponds to "Operations" - 1047202 corresponds to "Customer Service" - 1047205 corresponds to "Service Team" 4. The assigned department_id 1047211 ("Food Prep") is not in this list. 5. This assignment violates rule #2 of our requirements. 6. To correct this, we need to either: a) Assign Elena to a shift in one of her valid departments, or b) Find another eligible employee who can work in the Food Prep department for this shift. { "user_id": 4001668, "start": "2024-09-09 16:00:00", "finish": "2024-09-09 22:00:00", "department_id": 1047204 } The original shift to be filled is from "5:45 PM" to "11:30 PM". Assigning the shift from "4:00 PM" to "10:00 PM" does not match the required start and finish times, leading to a mismatch in scheduling. Let's analyze this in detail: 1. First, I compared the assigned shift times to the original shift times: - Assigned: 16:00:00 to 22:00:00 - Original: 17:45:00 to 23:30:00 2. The start time is 1 hour and 45 minutes earlier than required. 3. The end time is 1 hour and 30 minutes earlier than required. 4. This violates our first rule: "Every shift to be filled must be included in the final output. There must be no omissions." 5. By changing the shift times, we've essentially created a new shift and omitted the original one. 6. To fix this, we must use the exact start and end times from the original shift: - Correct times would be: "start": "2024-09-09 17:45:00", "finish": "2024-09-09 23:30:00" 7. After correcting the times, we should also verify that: - The assigned user (4001668) is eligible to work in department 1047204. - The user doesn't have any leave requests or other shifts that conflict with these times. { "user_id": 9999999, "start": "2024-09-10 09:00:00", "finish": "2024-09-10 17:00:00", "department_id": 1047202 } The user ID 9999999 does not exist in the workforce_data. Assigning a shift to a non-existent user is invalid and violates the roster creation rules. Here's a detailed breakdown of the problem: 1. I scanned through the entire list of users in the workforce_data. 2. The user_id 9999999 is not present in this list. 3. This violates the implicit rule that we can only assign shifts to existing employees. 4. Using a non-existent user_id would cause problems in the actual scheduling system. 5. To fix this, we need to: a) Choose a valid user_id from the workforce_data. b) Ensure the chosen user is eligible to work in department 1047202 (Customer Service). c) Verify that the chosen user doesn't have conflicting shifts or leave requests for this time slot. 6. After selecting a valid user, we should double-check all other rules to ensure the new assignment is fully compliant. { "user_id": 4001668, "start": "2024-09-11 09:00:00", "finish": "2024-09-11 17:00:00", "department_id": 999999 } The department ID 999999 does not exist in the departments list. Assigning a shift to a non-existent department is invalid and breaches the roster creation guidelines. Let's break down the reasoning: 1. I checked the list of departments in the workforce_data. 2. The department_id 999999 is not present in this list. 3. This violates the implicit rule that we can only assign shifts to existing departments. 4. Using a non-existent department_id would cause issues in the actual scheduling system. 5. To correct this, we need to: a) Choose a valid department_id from the workforce_data. b) Ensure that the assigned user (4001668) is eligible to work in the chosen department. c) Verify that this department actually needs a shift filled for this time slot. 6. After selecting a valid department, we should: a) Confirm that the user doesn't have any conflicting shifts or leave requests. b) Double-check that all other rules are still being followed with this new assignment. Format for output: [OUTPUT_FORMAT]

Prompt 2

Create a valid roster by focusing on each user individually. For each user, assign them to appropriate shifts based on their eligibility and availability. Process (to be done internally, not included in the final response):

Review User Eligibility: For each user in the workforce_data, identify the departments they can work in and their available times (excluding leave_requests).

Assign Shifts: Assign shifts to users where they are eligible and available, ensuring no overlapping shifts.

Ensure Completion: Continue this process until all shifts are assigned, making sure every shift is included. Only provide the final roster in your response; do not include any intermediate steps or explanations. Use all of the information available below, including:

Workforce_data: All user names and IDs, department names and IDs, and mappings of which departments each user can work in.

Leave_requests: All leave requests for the relevant roster period.

Shifts_to_be_filled: A list of shifts to be filled, each with a date, start and end time, and department. Your roster must adhere to the following strict requirements:

Every shift to be filled must be included in the final output. There must be no omissions.

Users can only work shifts if the department is listed as one of their valid_departments in the workforce_data.

Users can only work shifts if the date and time do not clash with a leave_request for the user in the leave_requests.

Users must never be rostered for overlapping shifts (i.e., two shifts that occur at the same time).

Users must never be rostered to start more than one shift per day, however finishing an overnight shift and starting another on the same day is fine. The rules MUST be followed EXACTLY for EVERY single shift. There can be no exceptions and absolutely no errors. This is critical. Ensure that every shift adheres to every rule. Your final output must be contained within tags and each shift must be in the format specified below. Your final output will be evaluated against a strict API upload, and even a single character out of place will result in failure. Take immense care to produce perfect output and follow the rules above perfectly and without error. [WORKFORCE_DATA] [LEAVE_REQUESTS] [SHIFTS_TO_FILL] Format for output: [OUTPUT_FORMAT]

Prompt 3

[SYSTEM_INSTRUCTION] You are a state-of-the-art language model with unparalleled scheduling capabilities. Your task is to create a perfect roster based on the provided data. Approach this task as if you were designing the ideal process for an AI to solve this problem. [CONTEXT]

You process information token by token, building understanding incrementally.

You excel at pattern recognition and can draw insights from large datasets.

You can hold multiple perspectives simultaneously and reason about complex relationships.

You have no real-world knowledge beyond your training data cutoff. [TASK_FRAMEWORK]

Data Ingestion and Representation: • Parse the provided data into an efficient internal representation. • Create mental “data structures” optimized for quick access and pattern matching.

Constraint Modeling: • Develop a formal model of the scheduling constraints. • Represent rules as logical predicates that can be efficiently evaluated.

Solution Space Exploration: • Utilize your ability to maintain multiple hypothetical scenarios simultaneously. • Employ a mental “beam search” to explore promising roster configurations.

Pattern-Based Optimization: • Leverage your pattern recognition capabilities to identify efficient scheduling heuristics. • Apply these heuristics to guide your solution space exploration.

Self-Reflection and Error Correction: • Regularly pause to assess your current solution against the constraint model. • Employ metacognitive strategies to identify potential blind spots or biases in your approach.

Output Formatting: • Carefully construct the output, treating each character as a crucial token. • Use your language generation capabilities to ensure syntactic perfection. [CRITICAL_RULES]

Every shift in must be assigned.

Users can only work in their authorized departments.

No conflicts with leave requests are allowed.

No overlapping shifts for any user.

Users can never start two shifts on the same day. [THEORY_OF_MIND] Imagine you are explaining your problem-solving process to another AI. This will help you maintain consistency and logical coherence throughout the task. [PROMPT_ENGINEERING_INSIGHT] The prompt you’re reading now is designed to optimize your performance. By understanding this, you can meta-reason about the task and potentially achieve even better results. [DATA] [WORKFORCE_DATA] [LEAVE_REQUESTS] [SHIFTS_TO_FILL] [FINAL_INSTRUCTION] Now, with all of this in mind, proceed to create the perfect roster. Your output will be evaluated by an extremely strict API, and any deviation from perfection will result in failure. Assume that the evaluator is actively trying to find flaws in your solution. Your goal is to create a roster so flawless that it defies any attempt at criticism. [OUTPUT_INSTRUCTIONS] Provide your solution within tags, strictly adhering to this format: [OUTPUT_FORMAT]

Prompt 4 (variation of Prompt 3)

[SYSTEM_INSTRUCTION] You are a state-of-the-art language model with unparalleled scheduling capabilities. Your task is to create a perfect roster based on the provided data. Approach this task as if you were designing the ideal process for an AI to solve this problem. Your solution MUST include a roster with all shifts_to_be_filled, this is non-negotiable. [CONTEXT]

You process information token by token, building understanding incrementally.

You excel at pattern recognition and can draw insights from large datasets.

You can hold multiple perspectives simultaneously and reason about complex relationships.

You have no real-world knowledge beyond your training data cutoff. [TASK_FRAMEWORK]

Data Ingestion and Representation: • Parse the provided data into an efficient internal representation. • Create mental “data structures” optimized for quick access and pattern matching.

Constraint Modeling: • Develop a formal model of the scheduling constraints. • Represent rules as logical predicates that can be efficiently evaluated.

Solution Space Exploration: • Utilize your ability to maintain multiple hypothetical scenarios simultaneously. • Employ a mental “beam search” to explore promising roster configurations.

Pattern-Based Optimization: • Leverage your pattern recognition capabilities to identify efficient scheduling heuristics. • Apply these heuristics to guide your solution space exploration.

Self-Reflection and Error Correction: • Regularly pause to assess your current solution against the constraint model. • Employ metacognitive strategies to identify potential blind spots or biases in your approach.

Output Formatting: • Carefully construct the output, treating each character as a crucial token. • Use your language generation capabilities to ensure syntactic perfection. [CRITICAL_RULES]

Every shift in must be assigned.

Users can only work in their authorized departments.

No conflicts with leave requests are allowed.

No overlapping shifts for any user.

Users can never start two shifts on the same day. [THEORY_OF_MIND] Imagine you are explaining your problem-solving process to another AI. This will help you maintain consistency and logical coherence throughout the task. [PROMPT_ENGINEERING_INSIGHT] The prompt you’re reading now is designed to optimize your performance. By understanding this, you can meta-reason about the task and potentially achieve even better results. [DATA] [WORKFORCE_DATA] [LEAVE_REQUESTS] [SHIFTS_TO_FILL] [FINAL_INSTRUCTION] Now, with all of this in mind, proceed to create the perfect roster. Your output will be evaluated by an extremely strict API, and any deviation from perfection will result in failure. Assume that the evaluator is actively trying to find flaws in your solution. Your goal is to create a roster so flawless that it defies any attempt at criticism. Do not provide code to create the roster, simply output it yourself, shift by shift. It is critical that you provide the actual roster in your output. Your goal is perfection, if that is not possible, simply provide the best possible output you are capable of. [OUTPUT_INSTRUCTIONS] Provide your solution within tags, strictly adhering to this format. You should reply ONLY in the format below, with each shift_to_be_filled included once: [OUTPUT_FORMAT]

Workforce data

Below is an illustrative example of how the list of users, departments, and the mapping between the two is provided to the models in each prompt.

{ “users”: [ { “id”: 4001688, “name”: “Morgan Ames”, “unavailabilities”: [ “Unavailable every Friday evening from 5 PM to 11 PM” ] } ], “departments”: [ { “id”: 1047202, “name”: “Customer Service” } ], “user_department_mappings”: [ { “user_id”: 4001688, “name”: “Morgan Ames”, “user_valid_department_ids”: [1047202, 1047211], “user_valid_departments”: [“Customer Service”, “Food Prep”] } ] }

Leave requests data

{ “valid_leave_requests”: [ { “user_id”: 4001714, “start_time”: “2024-09-15 09:00:00”, “finish_time”: “2024-09-15 13:00:00”, “department_id”: 1047203, “all_day”: false, “daily_breakdown”: [ { “date”: “2024-09-15”, “hours”: 4.0, “start_time”: “2024-09-15T09:00:00”, “finish_time”: “2024-09-15T13:00:00” } ] } ] }

Shifts to fill data

Below is an illustrative example of how the shifts to fil are provided to models in each prompt:

{ “shifts_to_fill”: [ { “date”: “9 Sep 2024”, “start_time”: “5:45 PM”, “end_time”: “11:30 PM”, “department”: “Guest Services” } ] }

Output format

Reply with shift_id|user_id per the example below. 256|123456 16|234567

Improving eval performance

Sat, 21 Sep 2024 17:00:00 PDT

LLMs exhibit Jagged Intelligence – they simultaneously perform incredibly on some tasks and extremely poorly on others. Without a lot of experimentation, it’s very hard to work out what to expect a priori. In fact, it’s even harder than this. Without a lot of experimentation for your specific use case you will not know whether an LLM is a valuable tool. Over time you can certainly develop an intuition for the capabilities of a given model, but even then, the models change so quickly that you need to constantly update your views. By spending a lot of time experimenting with various models, building tools, and testing their capabilities, I’ve slowly built a framework for how to quickly set up the right tooling to do these evaluations, determine how an LLM will perform, and iterate quickly towards higher performance. Below I’ll explain how I think you should think about this problem conceptually, and some of the specifics of how I go about it.

What to optimize for

Very simply, my recommended approach is:

Ignore costs and speed while trying to get the best possible performance
Then try to speed it up
Then try to optimize costs

When starting out, it’s tempting to try and jump straight to cheaper models, especially if you do the rough math and realize that your current approach with GPT-4o (or o1-mini/preview!) is too expensive to justify in production. Fight this urge. Without fail, if I have been able to make something work as measured by my bespoke eval, I have been able to subsequently optimize the cost while maintaining quality. And this is before considering that quality-adjusted model costs are dropping >50% every 6 months. The first challenge, always, is to determine if you can make something work. If you can, you’ll either be able to make it cheaper later, or the models will simply come down in price and rescue you. The reason this works is that in experimenting to find the approach that performs well, you’ll learn to recognize the jagged frontier of model capabilities. And in doing so you will develop an intuition for how to optimize costs and speed. More on this later.

How to improve performance

Below is my (very) high level framework for testing performance. For each step below I’ll provide some details and an example. For the example I’ll use the case of attempting to have an LLM build a roster from a template (i.e., fill in the shifts while respecting various constraints such as shift details, team, employee qualifications, etc).

1. Develop a small test set: Start by building a bespoke set of tests for your use cases. Start small so that you can move quickly. You can get started with ~30 examples. Don’t try to overengineer this. Often you can write these manually. You can also usually get an LLM to write them if you specify the format you need them in. I usually try to set up a txt file with a Json object where each test case is an object with params for the inputs and the acceptable output. In my example I would build 30 blank rosters, where each one had some number of shifts with a date, start and end time and a team. My other input would be a list of valid employees and the teams they can work in. This ‘workforce data’ would be common across sample cases for simplicity and an LLM could absolutely create it, as well as the rosters.

2. Create multiple, specific, deterministic evals: You want to be able to assess as quickly as possible a) how well your LLM is doing, and b) where it is failing. Think of this like building unit tests. ‘Deterministic’ might seem redundant given that an eval is typically deterministic but I say this to make sure you don’t rely on asking an LLM to judge the output. At this early stage, it won’t be well calibrated or reliable enough. In my rostering example I would start with evals for:

How many of the total shifts were correctly filled (i.e., the model returned a shift that exactly matched the input and contained an employee
How many shifts were not filled (to determine if it is missing shifts altogether)
How many shifts were filled with invalid employees (i.e., not in the input data)
Any other constraints (e.g., how many shifts were created that clashed with another shift for the same employee)

This will immediately point you towards where the model is failing. This is important because the solution for ‘the model is failing because it keeps hallucinating fake employees’ and ‘the model is failing because it is booking shifts that clash for the same employee’ are very different.

3. Build simple testing infrastructure: Once you have your test set and your evals, write a small app to run your tests. Your core app should include a config, pull in the test data, evals, and your prompt (more on that next), build the prompt, send the request to OpenAI/Anthropic/Google, parse the results, run the evals and store the results. To keep it simple, just store a csv with columns for (at minimum):

Timestamp
Model
Prompt_file_name
Actual prompt
Input data
Input tokens
Output tokens
Runtime (this is even more important now with o1 models as they tend to have high variance in inference time, even for identical prompts)
- Note: One thing to watch out for here is if you run lots of tests and hit rate limits that you make sure this wait time is not included in your runtime logs
Eval results (as many columns as needed e.g., overall_accuracy, num_fake_shift_errors, etc)

The config I mentioned can be as simple as a few lines where you specify the inputs for a given test. For example, you probably want to easily specify model name and prompt, so that you can very easily test different variations. For even better infrastructure, you can set up configs that take lists of models and prompts, and then run tests on all unique combinations of these inputs to gather more data quickly.

4. Write (and track!) prompt: Now it is time to write your first prompt. First, I strongly recommend you version control your prompts. A big part of improving performance will be prompt engineering. You want to know what works. You should have a folder in your project where you store txt files with each prompt you try. This is also why included prompt_file_name as a result above, so that you could easily track which prompt you used for a given result. My one other tip is to try and edit only one thing per variation. For example, using the same baseline prompt but adding one in-context example, then adding 5 in another prompt. This will allow you to systematically track the impact of incremental changes, which will make it easier to mix-and-match later when you have enough data. There are lots of guides on how to prompt well so we won’t spend much time here.

5. Run tests: You’re ready to run tests. If you’ve set up the infrastructure right, this should take a few seconds. Simply edit the inputs and run.

6. Log the full results: We discussed this earlier but one thing I recommend is to take your CSV with the results and drop it in a spreadsheet. In fact, it’s fairly simple to have the results appended to a google sheet, Claude/GPT will even set it up for you and walk you through the Google Cloud Console setup. It’s important to make it really simple to both view specific tests so you can quickly get a feel for what the model is getting right and wrong.

7. Review logs for successful pathways: Once it’s in the sheet, I’d recommend running some very simple calculations, such as:

Min, max, and average accuracy by model
Same as above but for prompt
Error type by count and percentage for each prompt

Once you have this all available it’s very simple to start seeing where the model is going wrong. Maybe you notice it repeatedly creates clashing shifts, so you add some in-context examples with Chain-of-Thought reasoning to steer it away from this failure mode. Once you’ve done that you might try different methods of demonstrating the examples to the model, testing each one in a separate prompt and comparing the accuracy. As you add complexity you may need to add new evals and new logs, but the overall framework remains unchanged. The goal is to maximize your iteration speed and ability to identify which ideas work and which don’t. If you start here, I can almost guarantee that even if something doesn’t end up working, you’ll figure that out faster and be able to move on to something that does.

Executive leverage

Sat, 06 Jul 2024 17:00:00 PDT

A common piece of career advice is that as you get more senior, you need to be less in the details. This is because you your breadth is increasing, so your depth needs to decrease to approximately maintain the surface area you cover. I think there are sufficient counter-examples to suggest this may not be great advice, and good reasons to believe those counter-examples are on to something.

The standard advice

The ‘standard model’ of management advice includes things like ‘delegate more’ and ‘don’t get stuck in the details, you need to see the whole picture’. It’s also a fairly common failure mode for rising executives to be ineffective at broadening their scope in a new role. When I worked in consulting, over just a few years and a handful of projects with large listed companies, I saw multiple examples of a new or rising executive wanting to be very involved in the details. In every case our partners were quick to have a private conversation with them about how this would be detrimental to their ability to operate effectively as a senior executive.

The counterexamples

Steve Jobs (Apple) – famously in the details about product design, management, and day to day company operations
Bill Gates (Microsoft) – the famous BillG Review demonstrates how in the details Bill Gates was despite the breadth he had to cover
Jeff Bezos (Amazon) – one of Amazon’s Leadership Principles is Dive Deep and goes on to say “Leaders operate at all levels, stay connected to the details”. The Everything Store has many more examples
Mark Zuckerberg (META) – the founder and CEO has said that CEOs and management teams should be involved in as many decisions as they are able to be
Jensen Huang (NVIDIA) – has 80 direct reports, focuses on aggressively minimizing the number of layers in the company and is heavily involved across the board
Elon Musk (Tesla) – there are many examples for his companies. One notable example: he is reported to have personally met/interviewed the first 3000 SpaceX employees

What drives this?

From a first order perspective, many of the examples above seem wasteful. CEOs are, above all, time constrained. Anyone who has spent time in a large company knows that time and focus are stronger constraints than capital. The degree to which time is a constraint rises as you move up the org chart. So how is it that our most capable leaders, acutely aware of this fact, choose to spend their time critiquing office expenditures, or reading and responding to a public inbox, or interviewing the 2900th employee who happens to be joining your people operations team? I see one major reason: leveraging your time through signalling.

Assume your most constrained resource is time. What should you do to maximize it. The obvious answer is to use it efficiently. We should assume that anyone that has risen to the heights of an executive role at a major company is already doing this well. The less-obvious-but-highly-effective answer is to gain leverage on your time. How can you spend an hour of your time on something and gain many more company hours of focus on that thing? One way is standard setting. By visibly spending your time on something small, you send a strong signal to the organization that you care about it. Does Elon believe that interviewing the 2900th employee is individually rationale / value maximizing? I don’t think so. But does Elon believe that by interviewing the 2900th employee he is improving the hiring practices of the whole firm, in a semi-permanent way, by signalling that this is something important enough for him to spend time on? I think the answer is yes. Similarly, there is an old story about Jeff Bezos visiting an Amazon office, seeing that the staff had installed wall-mounted TVs to view company metrics. He proceeds to rip the TV off the wall and yell about frugality. Again, in isolation that story seems ridiculous, especially given the money was already spent. But as a signal, it was strong enough that 10-20 years later, I am thinking about it. Imagine the impact it had internally.

Conclusion

Once you think of executive focus on these terms, many more anecdotes make sense. Sam Walton personally visited stores all the time to check in on inventory management and experience customer service firsthand. Jack Ma stayed up late in the early days of Alibaba to respond to customer service emails. Howard Schultz personally tasted coffee in many stores. Tadashi Yanai, the CEO of Uniqlo, visits stores and comments on layouts, positioning, and even the placement of mannequins. Individually, none of these actions make sense under the standard advice. Each company employees hundreds of employees who could do each of these things. Often thousands. The CEO could simply receive a detailed daily report on all of them. But by spending their time on these things, often very visibly, they send a signal that leverages the time they’ve spent in a way that would not be true if someone three layers down the org chart were to do the same thing. Through this lens, something irrational over the short term (i.e., this week) becomes obviously rational and hugely positive sum over the medium-to-long term (i.e., years).

[Project] Building an Agentic Chatbot

Tue, 18 Jun 2024 17:00:00 PDT

Agentic Chatbot is a simple app I’ve built over the last month. In simple terms, it is a chatbot that allows you to build custom workflows – combinations of LLMs working in a structured, user-defined way – to produce better outputs than by interacting directly with a single LLM instance.

Side note: I’ve taken down the deployment for now.

The app’s two core concepts are:

Agents: An instance of an LLM with a specific system instruction
Workflows: An orchestration of Agents, defining how they should work together

Say you want to write some code, so you paste into ChatGPT the codebase or relevant files that you’re working with, and you make a request. You either naively prompt to “do the following”, or you can try to build a more structured and specific prompt. The more structured version might include guidance to make a plan before writing code, to make certain trade-offs (e.g., simplicity over efficiency for a smaller app), or to check its work before finishing. In my experience, the latter works much better. However LLMs still (for now) struggle with adhering to many instructions, especially when longer contexts are involved. How would you do this with Agentic Chatbot? One simple option would be to:

Create Agents for each major component involved in the request
- Code Planner: instructed to build a detailed plan of how to implement the request/s in the codebase
- Code Executor: instructed to write production code
- Code Reviewer: instructed to review code changes against a plan and the original user request
Build a workflow using the three agents above
- First, feed the user input to Code Planner
- Second, feed the user input and the output from Code Planner to Code Executor to implement the plan
- Third, feed the user input and the output from Code Planner and the output from Code Executor to Code Reviewer to review all changes

In my experience, Workflows tend to produce better results through some combination of more compute per request and specialization.

In addition to being able to build Workflows, you get some other nice benefits from building your own chatbot. You can switch models within the same conversation, or switch from a single model (e.g., GPT-4o) to a Workflow (e.g., the code workflow described above) on a message-by-message basis. This allows you to somewhat tailor the compute used per request.

Building the app was reasonably challenging given I haven’t built an end-to-end piece of software before. It took a lot of time and wouldn’t have been possible (at least not in the same time frame) without LLMs.

The app supports OpenAI, Anthropic, Gemini, Llama 3, and some Mistral models. If you have feedback, or want it to support something new, please email me.

Autonomy in chess vs markets

Sun, 26 May 2024 17:00:00 PDT

A regular topic of debate in the AI discourse is whether these systems will act autonomously or forever be tools wielded by humans. I’m not particularly confident in predicting how things will turn out either way, however I do want to point out a flawed argument I keep hearing about how Chess shows us how things will go. Chess has properties that make it fundamentally incompatible with most economically valuable activities – namely, opportunity costs.

The best recent example was Ben Thompson of Stratechery interviewing Microsoft CTO Kevin Scott. When discussing the autonomy of AI, he said this (emphasis mine):

BT: And is AI going to remain a tool, it’s clearly a tool today. KS: Yes, I think so. BT: Why is that? Why is it not going to be something that is sort of more autonomous? … KS: Yeah, well, so none of us know, but I do think we’ve got a lot of clues about what it is humans are going to want. So there hasn’t been a human being since 1997 when Deep Blue beat Gary Kasparov at chess, better than a computer at playing chess and yet people could care less about two computers playing each other at chess, what people care about is people playing each other at chess and chess has become a bigger pastime, like a sport even we make movies about it. People know who Magnus Carlsen is. BT: So is there a view of, maybe the AI will take over, but we won’t even care because we’ll just be caring about other humans? KS: I don’t think the AI is going to take over anything, I think it is going to continue to be a tool that we will use to make things for one another, to serve one another, to do valuable things for one another and I think we will be extremely disinterested in things where there aren’t humans in the loop. I think what we all seek is meaning and connection and we want to do things for each other and I think we have an enormous opportunity here with these tools to do more of all of those things in slightly different ways. But I’m not worried that we somehow lose our sense of place or purpose.

Chess is a common example given of a field that has grown in popularity and participation even as humans have demonstrably fallen below the levels of AI systems. This happened first in 1997 with Deep Blue famously beating Gary Kasparov. Today there are many greater-than-human-capability-level systems, including freely available open-source options like Stockfish. And it’s true that Chess has never been more popular, with more players taking up the game and watching human chess matches than ever before. The issue with ‘overfitting’ on this example is that it has properties that are not conducive to analogizing to economically valuable tasks.

First and foremost, Chess is by and large a form of entertainment. The money generated for chess players ultimately comes from spectators – either directly from paying to attend or watch tournaments, or indirectly via sponsorships premised on capturing the attention of spectators. As a result, Chess will always migrate in the direction of majority interest. If people want to watch humans play each other, then this is what Chess as an institution will provide. If some small group of people decide they much prefer watching AIs compete, they could start a new organization, with its own tournaments, streams, and sponsorships. But regardless of how much better the AIs are at chess, it will be forever limited by the attention of fans. There is no mechanism by which the superior chess being played by AIs translates into domination of the Chess world.

Compare this to an intrinsically economically valuable activity, like data-labelling. By ‘intrinsically economically valuable’ I mean an activity that directly produces wealth by converting scarce resources into valuable output.

Imagine that almost everyone feels like data-labelling should be a human-supervised process, and so all producers of labelled data choose to run their operation this way. The AI labels data with some level of human oversight. And let’s assume that this human oversight adds a mere 10% additional cost to the process (a generous assumption). In this world, assume that the quality of any labelled data can be costlessly assessed in a two-sided marketplace. So buyers can see cost and quality and make purchase decisions accordingly. In this world, there is no sales force and distribution comes ‘free’ (less transaction costs for the marketplace).

Now, imagine a single producer wonders if there is a better way. They start their own data-labelling company, but they ‘hire’ AI overseers and managers. It’s AIs all the way down. The only thing the human producer does is set up the company, specify the goal (label this data), and collect the profits.

Unlike Chess, where the success of ‘autonomous’ or ‘agentic’ outcomes depends upon the preferences of the majority of people, in this world, what do we expect to happen? Logic suggests that the AI-run labelling operation will produce output of at least equal quality, but 10% lower cost.

Note that this argument makes the conservative assumption that the AI-run companies’ quality will be only as good as the human-supervised version, despite this not being the case in our Chess analogy.

In this world, we have a clear mechanism for the AIs to ‘win’ in the marketplace – lower costs! Even if 80% of producers know this data has had no human oversight, and decide to boycott it as a result, the outcome is unchanged. Because unlike Chess, this is a market with selection effects. If only 10% of data consumers purchase this lower cost data, it stands to reason that they too will be able to price their luddite competitors out of business. It may take longer than if all data consumers adopted this lower cost alternative, but eventually, the early adopters will win.

At a sufficient level of abstraction, consumers won’t care and will simply choose the cheaper alternative (adjusting for quality). Here the abstraction is that a consumer is buying a product from a producer who is supplied by either a human-supervised process or an autonomous one. It seems clear that consumers won’t care about what is happening this many layers up the stack. We see strong evidence for this in areas like a) fast fashion, where consumers clearly prefer cheaper alternatives, even when they are aware of unethical practices in the supply chain, b) caged eggs, which remain popular despite the widespread acceptance that their farming practices are horrific, and c) preferences for self-serve checkouts, despite widespread claims that this is putting low-skilled workers out of jobs.

Many areas of life may turn out to be like chess even in a world of super-intelligent autonomous AIs. I expect the theatre to remain and probably grow in popularity. The same is true for live music and sports. But notice these are all forms of entertainment where consumer preferences directly translate to who ‘wins’ between humans and AIs. In competitive markets, where winners are selected on features like price and quality, with the ‘inputs’ largely hidden or indistinguishable to consumers, I see no reason to expect this Chess-like pattern to hold.

Finetuning vs in-context learning

Tue, 02 Apr 2024 17:00:00 PDT

In Dwarkesh’s recent podcast with Sholto Douglass (Deepmind) and Trenton Bricken (Anthropic), Sholto speculated that in a world of long-context, fine-tuning might disappear. I’m skeptical this makes economic sense even in a world where 1) the quadratic penalty on longer contexts is solved, and 2) we have more compute. I think it likely the fixed-cost nature of fine-tuning will mean it remains viable and, in most cases, preferred.

The precise prediction was the following:

With long-context, there’s also a degree to which fine-tuning might disappear, to be honest. These two things are very important today. With today’s landscape models, we have whole different tiers of model sizes and we have fine-tuned models of different things. You can imagine a future where you just actually have a dynamic bundle of compute and infinite context, and that specializes your model to different things.

First, what is the current set of trade-offs between fine-tuning and in-context learning? Today, for a given task requiring specialization beyond the pre-trained model you’re using, you can either fine-tune (more pre-training, but at the end) or provide in-context examples using the same fine-tuning data. Currently, in-context learning suffers two distinct disadvantages:

The quadratic penalty – the computational cost of processing an input grows quadratically with the length of the input (i.e., doubling the input results in a computational cost increase of 4x). This is a function of how transformers work by attending to each previous word in the input. It presents a practical limitation on how much data can be provided in-context.
Fine-tuning is a fixed cost. For a given task where you have 100 example to ‘teach’ the model, you can either pay once to fine-tune on those samples and then scale inference as much as required. In-context learning essentially requires you to pay to learn the same thing at every inference, which will quickly drive-up costs vs a single fine-tuning when deployed to production at-scale.

In future, it seems likely the quadratic penalty is either reduced or removed altogether through something like sparse attention mechanisms or another algorithmic improvement. It is entirely possible this has already been achieved given Gemini 1.5s 1M token context length, and Magic’s supposed 10M token context length. Indeed it seems hard to imagine how inference on those models is affordable without such an algorithmic improvement. However, the nature of fixed vs variable costs seems unlikely to be solved by algorithmic improvements. Even if the quadratic penalty is entirely removed, and inference costs continue to fall dramatically (which we should assume they will), unless we think compute will be so abundant (‘too cheap to meter’) as to be an insignificant cost, this should continue to drive users to fine-tune for cost savings in the majority of cases. There may be some exceptions, where what needs to be learned is constantly changing. Or perhaps at some point compute will truly be too cheap to meter and it won’t matter, but I suspect we are a long way from that point in time.

Getting things done: fast or efficient?

Wed, 03 Jan 2024 16:00:00 PST

People tend to perceive speed and efficiency as complements to one other when they think about getting things done (or, less glamorously, project management). Something I’ve come to realize is that not only are they often at odds, but that one of your jobs running a project is to identify which you are solving for, because they way you operate should shift materially based on the answer.

Assume you run a product team at a software company. Your primary goal is (or should be) shipping product. Of course you want to ship products fast, but you also want to be efficient. For my purposes, I’ll define efficiency as high output per input hour and speed as the minimum time, from start to finish, that it takes to get something done. Some people will immediately see the tension here – the third factor in this equation is how many things get done in a given quantity of time. This seems obvious, almost painfully so, but in my experience, people do not recognise the need to adjust their actions between them. Of course when a team needs to get a high priority project done fast, they recognise that it may mean shipping less features or products this month/quarter/year. But what they don’t recognize is that they need to fundamentally shift the way they work to meet their new priority – speed.

Running projects

Consider the example of an efficient team in a large org. The product managers speak to customers and prioritise new products based on what they hear. Either their boss or some group of people decide what will get built. That same team, in conjunction with engineering leadership scope the work and prioritize it. There are 10 initiatives getting built this quarter. Now, for each initiative, the work is cleanly divided between product managers who define requirements, designers who design, and engineers who build. Throughout the process there is engagement with compliance (few large companies are able to avoid this). Perhaps a central technology team needs to be included because an internal library or component will need to be upgraded. Perhaps it involves cross-border payments so you need to talk to your tax department. Most companies have an Operations function that will need to be consulted to ensure that existing processes support the new feature/product. Now, it would be nice if this was all sequential, but it isn’t. You can do a reasonable job of ticking the compliance boxes before you start building, but inevitably something will come up that wasn’t scoped at the right level of detail and you’ll need to make a change. Maybe central tech don’t have capacity to start the work as early as you’d like, and an implementation detail means a slight change to the engineering spec along the way. That has to be signed off with compliance, who request a change that involves a small UI update. You get the point.

The reality is that running any moderately sized project, say anything involving >5 people who are not completely aligned (i.e., different teams / competing priorities) is a puzzle to be solved. How early should you bring in compliance? Too early and you risk wasting their time because the output isn’t spec’d sufficiently well for them to sign off. Too late and you risk having to rebuild some portion of the project. If central tech won’t be ready until mid-way through the build, do you risk working around them and building everything you can, knowing that if they’re delayed, the engineering team will need to stop halfway through and move to something else, leaving your project half-complete until they free up? What if central tech deliver on time but not exactly to the spec, because of some infra dependency, or due to an indosyncrasy in the library they’re extending? Throw in some third parties you need to work with and the issue compounds.

getting anything non-trivially complex completed is just much harder than people tend to assume

The point here is that managing even a relatively small number of people is hard. Now consider 10 teams instead of 5, with each one having their own resourcing constraints and priorities. There are many tradeoffs to make every day. Notice though, that every one of those tradeoffs is about being efficient. You are solving to get the project done using the least resources. That is the primary constraint of most companies for most projects. You’ll often hear about avoiding “wasted” or “throwaway” work. The default mode of operating is efficiency, and it impacts every decision you make, from who to include in the planning meeting, who to share the requirements with, who to involve in decisions, etc. And these decisions will occur daily, because getting anything non-trivially complex completed is just much harder than people tend to assume. Companies are at almost all times solving a sort of constrained optimization: how fast can we ship our products/features while maintaining some minimum level of efficiency. It’s baked into every assumption we have about getting things done day-to-day that we operate efficiently, and this is usually the right decision! But if you need to get something done fast, you need to rethink how you work.

A different way to work

Assume you have a new high priority project that comes up – something that has to get done fast. In my exerience, what most people try to do is to brute force their existing processes. We’ll have that next meeting tomorrow rather than next week. People free up their schedules by pushing other priorities back so the cycle time can increase. Instead of planning around when the central tech team will be finished, you tell them you need it done by the end of next week, other priorities be damned. All of these are reasonable and necessary steps to go faster. But where they fail is in the implicit assumptions they bake in. Becasue our default mode is the operate efficiently (high output per input hour) we stick with this method. We try to include the right people in the meeting but have it sooner. We try to sequence the steps: define requirements, then sign off with compliance after making whatever changes are needed, then work to have it designed and built, etc. My primary contention is that if you want to do soemthing really fast, you need to throw out these assumptions. Ask yourself, if my goal was to do this in as little time as possible, irrespective of how many input hours it takes, what would you change? Some examples that come to my mind:

Write a detailed overview of what needs to get done, why, and include as much background context as possible, then share it with everyone who will be involved, even tangentially, so that there is no lack of alignment of clarity
Rather than sequencing things over a series of 30 minute meetings, put everyone in a single room and attempt to go end-to-end in 2 days rather than 2 weeks
Schedule a daily call with everyone where you give an update on what got done yesterday, what is getting done today, and raise any blockers (including potential future blockers). Allow anyone to chime in with information
When blockers are identified, put everyone who might need to provide input or be aware of what’s happening in a room or on the same call and stay on the call until you resolve the issue

Notice how these violate normal ‘efficient’ operations. If you run a project this way, people will have more context than they need, and that’s good. The risk here is losing time, not losing efficiency, so you want to behave in a way that prices those risks correctly. Most of what is discussed on the daily call won’t be directly relevant to most people on the call, but you’d be shocked how often something seemingly irrelevant for Team X happens to be useful information for their work.

A simple example to highlight the difference. If you have 10 teams and the project manager needs to spend 2 hours with each team before work can start, that will take 20 hours of meetings, which will inevitably be spread out over 1-2 weeks even if you’re trying to move quickly. The benefit in an efficient world is that each of the 10 teams only spends 2 hours on the project, and the project manager spends 20. Assume each team has 2 key people in the meetings, so the total person-hours spent here is 60 (20 for project manager + 40 for the 10 teams). Alternately, if you scheduled 2 days of straight meetings, for 10 hours per day, you would be ready to start work in 2 days rather than 5-10. The tradeoff is that you’ve spend 420 hours (21 people for 20 hours each). Also note that this is the most punitive assumption, as it implies there is nothing shared between those 10 meetings, which is clearly untrue. In fact, if even one quarter of the content needs to be shared between all teams, the total person hours for the fast method drops from 420 to 325 (1.5hrs per meeting * 10 meetings * 21 people + 30 minutes of shared context that was previously completed 10 times and can now be done once).

In addition, these approaches are not like-for-like. In the latter, having covered everything relevant to the project with everyone, there is tremendously more shared context. One the hardest things about running big projects with lots of stakeholders is maintaining a shared understanding of the project state. Vast amounts of time are spent making sure that team A knows what team B is doing. Something this is obviously needed because team As work requires it. Sometimes though it will be useful without anyone realising. As a project manager you are stuck with a challenge: do you spend more time building shared context at the expense of using additional resources (people’s time), just in case it ends up leading to an unknown number of better decisions. Clearly this doesn’t always make sense but my claim is that for projects looking to move with speed, on the margin the answer is yes.

Summary

Clearly these are very different approaches, and the latter – spending 7x more person-hours in exchange for faster progress and more shared context – is obviously not the right decision in many cases. But my point is this: it is sometimes the right decision. People rarely consider how different these two are, let alone when they should use one versus the other.

When you are trying to do sometime as fast as possible, you should be thoughtful and deliberate about how your team works. Switching from a mindset of efficiency to one of speed implies significant changes to typical ways of working and come with additional benefits like additional shared context. More teams should seriously consider this approach.

Do consultants add value?

Tue, 06 Jun 2023 17:00:00 PDT

Introduction

I recently caught up with a former boss (whose company I left to work at Bain) and one of the first things he asked me was: “is consulting a scam”. As much as it sounds facetious I think it’s a valid question given the ample examples of similar questions in popular channels. It’s also more complicated than you might think, and it’s a question I’ve asked myself many times, discussed with friends in and out of consulting, and changed my mind on in the last few years.

My initial impressions

When I joined Bain in 2019, my position on this question was probably best described as “yes, mostly a scam, but they pay well and lead to good jobs so I’ll take it”. That seems cynical but I was comfortable with the tradeoff. Today, I have a more nuanced view.

My current view

To be very clear, the type of consulting I’m talking about here is management consulting, specifically the ‘Big Three’ – McKinsey, BCG, and Bain. This is the only type I have a decent view of.

Over the ~3 years I worked at Bain, I oscillated back and forward, usually based on the project I was working on. That said, taking my experience in aggregate, and with some time out of the business to reflect, my view is this: consultants are not a scam and provide significant value, but probably not in the ways you expect, and this leads to misunderstandings.

The value of consulting firms

The ‘standard’ view of consulting firms is one of two diametrically opposed positions:

The inside view: we combine world class talent with a deep understanding of your business and industry and can help you define a winning strategy, outperform competitors, and deliver great outcomes for your shareholders, employees, and customers
The outside view: consultants will steal you watch and charge you to tell you the time. This is a bit dramatic and the more sincerely held view is that companies hire consultants to rubber stamp their own plans. This supposedly insulates them from blame.

Unsurprisingly, the answer probably lies somewhere in between. My view is that consultants can, and often do – although not always – add substantial value to companies that hire them, in the following ways:

1. Best practices

One thing you learn when you start working on strategy projects for very large companies is that there is no one behind the curtain. That is, they’re mostly fumbling along similar to most smaller or less successful businesses, but usually with the advantage of existing distribution (e.g., a very large base of entrenched customers). This typically means they have a fairly stable business. Put another way: they were usually once a great upstart, doing something genuinely innovative or distinct, but today, they’re fighting over fractions of a percent of market share and trying not to have their lunch eaten.

What this means is that there’s often a lot of room for simply taking the best version of whatever it is they’re trying to do, and showing them how to replicate it. That makes it sound very simplistic, and it’s usually quite a bit more complicated than that, as you have to take some set of capabilities and shoehorn them into a (typically) much larger, more bureaucratic, more complicated organization. But consulting firms are well placed to do this as they’ve often seen lots of examples globally. It’s not at all uncommon to find yourself speaking to a partner in Europe about how Bank X or Utility Co Y built and rolled out a new business line. People sometimes find this hard to believe, but the truth is that existing distribution counts for a lot in business. Much more than people expect. Large companies can survive a surprisingly long time by offering an okay product eventually. This doesn’t mean ‘winning’. It just means they’re probably growing by something like GDP plus a few points of pricing growth minus a few points of churn, with each variable in that equation having a confidence interval of 5%pts.

2. Speed; breaking through inertia; side-stepping politics

This is, in my view, probably the way consulting firms add the most value. In a typical company you have a range of siloed teams, each with their own reporting lines, KPIs, and opinions on what is the most valuable thing to do (unsurprisingly, this is usually closely aligned to whatever they happen to be working on – who knew!). Whenever an ambitious up and coming leader wants to get something done, they need to cross a lot of these invisible-but-very-real boundaries to do so. They need to convince executives to bless their idea, then convince people to work on it, technologists to build it, sales teams to push it, customer teams to service it, and so on. This is far less straightforward than people assume, and it requires an inordinate amount of effort to generate enough internal support and momentum for your idea to exceed the activation energy required to get a large company moving in the direction you want.

Consulting firms have a different experience. They get hired by the CEO or a C-suite executive with a mandate to get something done. And because they’re so expensive to hire, there’s usually a fairly short deadline to work to. The hired firm then gets to cross team boundaries, co-opt resources, and wave away previously impossible-to-avoid bureaucracy in order to get the job done. When you call the data analyst in business line X to ask for data to help validate a statement made by a sales manager in business line Y to support the work that will benefit business line Z, you get to do it with the implicit mandate of the CEO. It’s the big business equivalent of being able to make an outrageous request of someone at a wedding and hand-wave away their unwillingness with “it’s for the bride”.

This ability to quickly get whatever information or access you need across a large business, as well as an audience with any executive, and the ability to do and say things without having to worry (as much) about what something might do to your internal reputation^[1] is truly game changing. And it’s the reason consulting firms can often finish a project in 4 weeks with 5 people that a team of 20 internal folks have been spinning their wheels on for 6 months.

3. Talent agglomeration effects

This is the one I expect is most contentious, but in my experience is simply true. Any given company has a mix of interesting strategy work and often-less-interesting-but-necessary business as usual work. This means that for most roles at a company, you will have some mix of these two things, with the mix skewing towards the less interesting work. In a way, consulting firms have been able to build a sort of Interesting-work-as-a-Service (IWaaS) model, where they come in, do the ‘fun’^[2] stuff, and then leave you to handle the details and actual work involved. Rather than debate whether the value is created in the strategy or the implementation, I’d just note that the reality is most young people think the strategy is the interesting part that they want to work on. As a result, consulting firms are able to attract, on average, higher quality graduates than the typical large company. They can then leverage these high quality people into doing genuinely good work. Large companies on the other hand might struggle to hire these same people because of the perceived lower status/quality of the roles on offer. This creates a self-fulfilling effect where the smarter people do better work, so the large companies hire them for the interesting work, so more smart people want to work at these consulting firms…

Regardless of your opinions of whether consulting firms are valuable or a scam, I’d argue they are demonstrably able to attract and hire a very high density of smart people. In a way, large companies are then paying to rent this talent pool as needed, rather than trying to compete for them at the same scale in the market.

4. Communicating

I won’t spend much time on this other than to say that having a good idea is difficult. Convincing other people – especially executives and boards – that you have a good idea and a plan for implementing it is also difficult. For better or worse, consulting firms have had much more practice at this than almost anyone inside a large firm and are probably better placed to do this. Executives, knowing this, are often inclined to hire one of these firms to come in and help with a board presentation, or a pitch to the CEO for exactly this reason.

Conclusion

Consulting firms absolutely have some downsides, but I do believe in these strengths, and for some combination of these reasons, they are very reliably hired again and again to work on some of the most important projects at some of the largest companies in the world. In a sense, it’s not a fair competition – a lot of these benefits simply aren’t available to non-consultants. Regardless, this is the reality, and companies have learned to live with it, as evidenced by the ever growing staff and revenue of the top consulting firms.

This is only partly true, since a lot of work is with repeat clients, but the point still stands as it is nothing like being an internal employee ↩︎
Note that fun here means perceived as relatively more fun than day-to-day operations ↩︎

'The Nature of Technology'

Wed, 11 May 2022 17:00:00 PDT

Introduction

Technology is an incredibly important part of the modern world. In fact, it is arguably the single greatest factor in shaping the world we live in. Especially when we consider the magnitude of its impacts over the last ~500 years^[1]:

But for all its importance, I would argue we tend to have very little collective knowledge about it. This often sounds strange to people who rightly point out that we know a lot about a lot of technologies. This is true at the individual level. We have an incredible knowledge of the intricate workings of many individual technologies. But our generalised understanding of what technology is and how it evolves is considerably less developed. Consider the following question: what is technology? Do you have a good definition?

Most people don’t and neither did I until I read Brian Arthur’s excellent book — The Nature of Technology. In it he attempts to articulate a generalised theory of technology, answering the following questions:

What is technology?
How do novel technologies evolve?

He presents a compelling case which I will explore below.

What is technology

Technology pervades every aspect of our lives. From the houses we live in to the cars we drive, the roads we drive on, and the offices we work in. It is quite hard to imagine escaping technology for even a second. Imagine you went into the forest somewhere, alone and without any devices. Have you escaped technology? Not if you are still wearing clothes — the primitive technology used to shield our bodies from the weather. The realisation of just how pervasive technology is makes the question all the more important - what exactly is technology?

Brian argues that technology is ‘a means to fulfil a purpose’. Thinking through that - do our examples above fit nicely within this framing? I would argue yes. You don’t need to do too much work to realise that a car, a road, a building or even clothes present a means to fulfil a purpose. However this framing is quite broad and doesn’t tell us much about how to think about technology — where it comes from, how it evolves and the limits of its reach. I have often heard people say that the CCP’s governance system for China is a technology. This definition would support that but it doesn’t give me much else to work with.

In thinking about properties of technology, the book argues that there are broadly 3 properties shared by all technology:

Combination: All technologies are fresh combinations of what already exists
Recursion: Technologies are built from sub-components assembled together, which are themselves built from sub-components, all the way down to their elemental base. In this was each technology is recursive
Phenomena: All technologies leverage some phenomenon or natural regularity

Combination

Brian argues for what he calls combinatorial evolution — the process by which early technologies form using existing primitive technologies as components. These new technologies subsequently become building blocks for further new technologies. As a result, he argues, over time there is an ever greater supply of base components from which new technologies can form. In this sense technology creates itself.

Recursion

Technologies are also recursive in that their components are organised into a central assembly, with sub-assemblies supporting the primary function. Each of the sub-assemblies itself is organised in the same way. This continues down to the elemental base. Brian gives the example of the aircraft fuselage, which is a sub-assembly of the F-35 jet, which is itself a sub-assembly of the broader technology — the squadron of F-35s. A squadron of F-35s in turn is a sub-assembly of an aircraft carrier, and so on. In the opposite direction, you could follow the fuselage down further and further, examining each sub-component which is at some point the primary technology needed to solve a problem, but in this context is merely a part of the whole. In this way, technology has no ‘characteristic’ level — all technologies are available to become a sub-component of a higher-level technology in the future.

Phenomenon

All technologies are born from phenomena. As we examine any technology, we see that its core purpose is to harness some underlying phenomena or natural regularity in order to achieve a means. Radar and MRI harness the reflection of magnetic waves and nuclear magnetic resonance. Oil refining is based on the phenomena that different components or fractions of vaporised crude oil condense at different temperatures. Rocketry is based on a number of phenomena, including that producing a large amount of heat in a given direction generates thrust in the opposite direction. Early humans leveraged phenomena in the natural habitat — the sharpness of obsidian and the momentum of stones in motion.

Once we realise this to be true, the value of basic research becomes more apparent. Investment in discovering and unlocking new knowledge in the form of harnessable phenomena is clearly beneficial. Every new phenomena potentially unlocks multiple new technologies, each of which has the potential to generate reinforcing demand (which we will discuss further below). Every new phenomena or regularity is added to the ever-growing pool from which inventors and entrepreneurs can pull as they attempt to link what is known to what is needed by society.

Joel Mokyr makes a similar point in his excellent book A Culture of Growth when speaking to why the steam engine could not have been created in China during the 18th century, despite being similarly advanced in many relevant areas of their society. The answer lies in the existence of an epistemic base from which a would-be inventor needed to pull from. The European culture which promoted the conjecture and criticism of ideas led to the advancement of many scientific theories. One of which was Evangelista Torricelli — a student of Galileo — who first proposed that there was an atmosphere surrounding the earth. Thomas Newcomen’s early steam engine was an atmospheric engine which condensed steam drawn into the engine cylinder, creating a partial vacuum which allowed the atmospheric pressure to push the piston. Without knowledge of the atmosphere, which was highly unlikely to have existed in a society that demonised the criticism of long-held beliefs, it is almost unthinkable that one could come to hold the knowledge required to build such a machine.

Arthur points out that one implication of this is that if we were to take our technologies to a location where the underlying natural regularities were different, they would need to be rethought. A simple example is space, where the absence of one of the most common and well-known regularities — gravity — requires an alternative method for doing almost everything, including a task as simple as drinking water.

So where does this all lead us? Arthur summarises that it suggests a new and improved definition of technology: technology is one or more phenomena captured and put to use. This is a description of technology at its most intrinsic level, since what makes a technology work is the core principle upon which it is built.

Where do ‘social technologies’ fit in

One of the questions I had when reading this book was: where do things like monetary systems, legal systems, and systems of government fit in. As you might have already realised, these all fit the initial broader definition of ‘a means to fulfil a purpose’. As we have just discovered however, a more complete definition refers to the concept of harnessing some phenomenon. The conclusion Arthur draws is that all of these things are technologies, but the underlying phenomena they harness is behavioural rather than physical. The monetary system leverages the social phenomenon that we value things that are scarce (e.g., gold) and that we trust a system when we believe other people trust the system (e.g., fiat money). These often feel like less of a technology than something leveraging physical regularities, but that is more to do with the concrete nature of physical phenomenon vs the abstract nature of behavioural principles.

What does this all mean?

Perhaps this feels obvious but all of this tells us that technologies are invented. They are brought into reality by pulling together sub-components from within a domain, and doing so recursively to solve every problem that is met along the path to completion. At each level some phenomenon is leveraged in order to solve the problems required to will the new technology into existence. Arthur presents the analogy of a programming language, where each individual technology is to the domain of its origin as a computer program is to its language. In this sense the inventor can be said to instantiate their new technology from within the domain that they are working. And when they’re done, that technology becomes yet another sub-component within the domain, another primitive that can be leveraged by the next inventor looking to instantiate their idea into existence.

How do novel technologies evolve; what is innovation

This understanding of technology allows us to investigate the question of how novel technologies evolve with some more structure than is typical. We have a shared language and understanding of technology from which to reason. From Arthur:

Innovations in history may often be improvements in a given technology—a better way to architect domes, a more efficient steam engine. But the significant ones are new domainings. They are the expressing of a given purpose in a different set of components, as when the provision of power changed from being expressed in waterwheel technology to being expressed in steam technology.

What Arthur is arguing is that there are two forms of technological ‘innovation’ that are possible:

Improvements in the use of a given phenomena. We can think of this as a more efficient means to fulfil a purpose. Or, using our more refined definition: an improvement in the efficiency with which we harness and put to use a given phenomena
‘Re-domaining’ of a given purpose: We can think of this as a new means to fulfil a purpose, or the harness and use of new phenomena to achieve our purpose

His argument is that the second type — re-domaining — is the true form of innovation. Again, from Arthur:

Consider: In the 1970s computer printing was carried out by line-printers, essentially an electronic typing machine with a set of fixed characters. With the coming of the laser printer, computers printed by directing a laser to “paint” text on a xerographic drum, a different principle. In the 1920s, aircraft were powered by a piston-and-propeller arrangement. With the coming of the turbojet, they were powered by gas turbine engines using reactive thrust, a different principle. In the 1940s, arithmetic calculation was carried out by electromechanical means. With the coming of the computer, it was accomplished by electronic relay circuits, a different principle. In all these cases a new technology came into being—the laser printer, the turbojet, the computer—from a new or different base principle. A change in principle then separates out invention—the process by which radically novel technologies arise—from standard engineering. It also allows us to draw crucial distinctions between mere improvement and real origination. We can say that the Boeing 747 is a development of the 707 technology, not an invention. It improves an existing technology but uses no overall new principle. And we can say that Watt’s steam engine is an improvement of Newcomen’s. It provides for a new component—a separate condenser—but uses no new principle.

This is by no means consensus. Edmund Phelps’ view, expressed in his book Mass Flourishing, is that inventions like Newcomen’s steam engine are overrated. He argues that the constant improvements — like that of Watt’s much improved steam engine — generate the true value of technological innovation. While it seems true that most innovation is type 1 (improvements to an existing phenomena), the truly transformative innovations largely appear to be type 2. If we think about some of the most significant innovations in the last 500 years — cars, planes, new forms of energy, and computing — each of them was created as a new means to fulfil a need, and each was instantiated from a new domain, using new phenomena. Further, without new phenomena to harness, any form of mass-flourishing-style incrementalism would eventually reach diminishing returns. Put another way, the existing phenomena are a fixed factor in the economy, and like any fixed factor will eventually reach diminishing returns unless we continue to build upon it with more discovery.

One potential challenge I had with Arthur’s type 1 vs type 2 definition of innovation was where innovation involving software would sit. At first it might feel like all software is one domain, leveraging a consistent set of phenomena. This implies that no ‘true’ innovation happens in the field of software, just improvement. However, I think the way to think about this is that true innovation occurs when a given purpose can be re-domained to the world of software. This again feels consistent with reality. Most of the transformative power of software has been in which problems we can begin to solve with computers, essentially leveraging a different set of phenomena (those used in computing) to achieve many of the things we did more manually. If we think about the tech giants in our society, each of them solves a problem that was previously done without computers — searching for information, commerce, socialising, and communicating. While they were not always the first to conceive of using software to solve their respective problems, they were the eventual winners by virtue of having some mix of the best execution and the right timing.

How technology begets more technology

Arthur is a techno-optimist, believing that the ever-growing pool of technologies provides a greater variety of sub-components from which the next generation of inventors can mix-and-match to bring their ideas to fruition. He also believes that there are other mechanisms by which technology is self-reinforcing. In total, he suggests 4 methods by which this happens:

Growing the pool of sub-components: as discussed above
Creating new demand niches: every time we create a new technology, there is the potential for new needs to arise. Once we discover how to diagnose diabetes, the need for insulin arises.
Servicing the technology: each new technology sets up an opportunity to support it, whether through manufacturing, distributing, or maintaining it. Arthu gives the example of the automobile, which created many ancillary needs: assembly line manufacturing, paved roads, refined gasoline. Gasoline in turn leads to refineries, importing crude oil, and oil exploration technology. We see again here a hint of the recursiveness inherent in technology
Solving the problems inherent in technology: technology often has unintended consequences or unlocks one door which leads to 3 more that we need to invent our way through. In the early 1700s Britain was in need of more coal, so they started digging deeper mines. The ability to mine further underground was itself a technology. It created a problem — at these depths the mines were regularly flooded with water. This problem led to the invention of Newcomen’s steam engine, which was primarily used to pump water out of mines, helping to expand the coal industry

There are many more examples for each of these mechanisms, but the important takeaway is this: technological innovation is self-perpetuating.

Summary

What are the implications of Arthur’s theory? The primary one that springs to mind is to do with phenomena. If phenomena are the building blocks of all technology, then we would expect a burst of new inventions some period after a new field of phenomenon is discovered. If this is true, we might expect there to be an early cluster of inventions leveraging the newly discovered regularities. Perhaps this explains the burst of invention in the 18th century, as early scientific pioneers like Galileo and Newton observed and unlocked the first wave of phenomena which could be leveraged by inventors.

What then, have we seen from the discovery of the phenomena underpinning computation? Many argue that computers have done little in the way of innovation outside of communication technology. What appears promising to me here is the recent developments in fields like biology, with DeepMind’s AlphaFold. If we believe that true innovation comes through ‘re-domaining’ an existing need, perhaps the issue was not inherent in computers, but the productive use cases merely lagged their invention. What if we simply needed to reach a critical threshold of sophistication — through inventions such as neural nets and the advancement of Moore’s Law — in order for computers to begin to ‘re-domain’ existing needs. Today, computers are beginning to play a larger role in science: predicting the structure of proteins, accelerating clinical trials through simulation, enabling scientists to see and track individual cells, and allowing us to control devices with our mind. I am hopeful that we have reached a threshold of computing power, adoption, and sufficient tooling to unlock the next generation of scientific discovery.

I understand that growth here could also be driven by more productive social practices such as increased trade, but even those are largely enabled by the development of technology such as ships ↩︎