Editor's Note: While "more powerful models" have become the default answer in the industry, this article offers a different perspective: what truly creates a 10x, 100x, or even 1000x productivity gap is not the model itself, but the entire system design built around it.
The author of this article is Garry Tan, the current President and CEO of Y Combinator, who has long been deeply involved in AI and the early-stage startup ecosystem. He proposes the "fat skills + thin harness" framework, breaking down AI applications into key components such as skills, execution harness, context routing, task division, and knowledge compression.
Within this system, the model is no longer the entirety of capability but merely an execution unit; what truly determines the output quality is how you organize context, solidify processes, and delineate the boundary between "judgment" and "computation."
More importantly, this method is not just conceptual; it has been validated in real-world scenarios: faced with the task of processing and matching data for thousands of entrepreneurs, a system achieved capabilities close to a human analyst through a "read-organize-judge-write back" loop, and continuously self-optimized without rewriting code. This kind of "learning system" transforms AI from a one-time tool into infrastructure with compound effects.
Thus, the core reminder from the article becomes clear: in the AI era, the efficiency gap no longer depends on whether you use the most advanced model, but on whether you build a system capable of continuously accumulating capabilities and evolving automatically.
Below is the original text:
Steve Yegge says that people using AI programming agents are "10x to 100x more efficient than engineers who only use Cursor and chat tools to write code, roughly 1000x more efficient than Google engineers in 2005."
This is not an exaggeration. I've seen it with my own eyes, and I've experienced it myself. But when people hear such a gap, they often attribute it to the wrong reasons: a stronger model, a smarter Claude, more parameters.
In reality, the person achieving a 2x efficiency boost and the one achieving a 100x boost are using the same model. The difference isn't in "intelligence," but in "architecture," and this architecture is simple enough to be written on a card.
The Harness (Execution Framework) Is the Product Itself.
On March 31, 2026, an accident at Anthropic led to the full source code of Claude Code being published on npm—totaling 512,000 lines. I read through it all. This confirmed what I've been saying at YC (Y Combinator): the real secret isn't the model, but the "layer that wraps the model."
Real-time code repository context, prompt caching, tools designed for specific tasks, compressing redundant context as much as possible, structured session memory, sub-agents running in parallel—none of these make the model smarter. But they give the model the "right context" at the "right time," while avoiding being flooded with irrelevant information.
This layer of "wrapping" is called the harness (execution framework). And the question all AI builders should really ask is: what should go into the harness, and what should stay out?
This question actually has a very specific answer—I call it: thin harness, fat skills.
Five Definitions
The bottleneck has never been the model's intelligence. Models have long known how to reason, synthesize information, and write code.
They fail because they don't understand your data—your schema, your conventions, the specific shape of your problem. And these five definitions are precisely meant to solve this problem.
1. Skill file
A skill file is a reusable markdown document that teaches the model "how to do something." Note, it's not telling it "what to do"—that part is provided by the user. The skill file provides the process.
The key point most people miss is: a skill file is actually like a method call. It can receive parameters. You can call it with different parameters. The same set of processes, because different parameters are passed in, can exhibit completely different capabilities.
For example, there is a skill called /investigate. It contains seven steps: define the data scope, build a timeline, diarize each document, synthesize and summarize, argue from both positive and negative sides, cite sources. It receives three parameters: TARGET, QUESTION, and DATASET.
If you point it at a security scientist and 2.1 million forensic emails, it becomes a medical research analyst, judging whether a whistleblower was suppressed.
If you point it at a shell company and FEC (Federal Election Commission) filing documents, it becomes a forensic investigator, tracking coordinated political donations.
It's the same skill. The same seven steps. The same markdown file. A skill describes a judgment process, and what grounds it in the real world are the parameters passed during the call.
This isn't prompt engineering; it's software design: except here, markdown is the programming language, and human judgment is the runtime environment. In fact, markdown is even more suitable for encapsulating capabilities than rigid source code because it describes processes, judgments, and context—precisely the language models "understand" best.
2. Harness (Execution Framework)
The harness is the program layer that drives the LLM's operation. It only does four things: run the model in a loop, read/write your files, manage context, and enforce security constraints.
That's it. This is "thin."
The anti-pattern is: fat harness, thin skills.
You must have seen this kind of thing: 40+ tool definitions, with descriptions eating up half the context window; an all-powerful God-tool, taking 2 to 5 seconds per MCP round trip; or, wrapping every REST API endpoint as a separate tool. The result is triple the token usage, triple the latency, and triple the failure rate.
The ideal approach is to use purpose-built, fast, and narrowly focused tools.
For example, a Playwright CLI where each browser operation takes 100 milliseconds; not a Chrome MCP that takes 15 seconds for one screenshot → find → click → wait → read sequence. The former is 75x faster.
There's no need for software to be "over-engineered to the point of bloat" anymore. What you should do is: only build what you truly need, and nothing more.
3. Resolver
A resolver is essentially a context routing table. When task type X appears, prioritize loading document Y. Skills tell the model "how to do"; resolvers tell the model "when to load what."
For example, a developer changes a prompt. Without a resolver, they might just deploy after the change. With a resolver, the model first reads docs/EVALS.md. And this document says: run the evaluation suite first, compare the scores before and after; if accuracy drops by more than 2%, roll back and investigate the cause. This developer might not even have known an evaluation suite existed. The resolver loaded the correct context at the correct moment.
Claude Code has a built-in resolver. Each skill has a description field, and the model automatically matches user intent with the skill's description. You don't even need to remember if the /ship skill exists—the description itself is the resolver.
Frankly: my previous CLAUDE.md was a full 20,000 lines. All quirks, all patterns, all lessons I'd ever encountered, all stuffed in. Absurd. The model's attention quality noticeably declined. Claude Code even told me directly to cut it down.
The final fix was about 200 lines—just keeping a few document pointers. When a specific document is truly needed, let the resolver load it at the critical moment. This way, the 20,000 lines of knowledge are still available on demand, but don't pollute the context window.
4. Latent & Deterministic
In your system, every step belongs to one category or the other. And confusing these two is the most common error in agent design.
· Latent space is where intelligence resides. The model reads, understands, judges, and makes decisions here. This handles: judgment, synthesis, pattern recognition.
· Deterministic is where reliability resides. Same input, always the same output. SQL queries, compiled code, arithmetic operations belong on this side.
An LLM can help you seat 8 people for a dinner party, considering each person's personality and social relationships. But ask it to seat 800 people, and it will confidently generate a "seemingly reasonable, actually completely wrong" seating chart. Because that's no longer a problem for the latent space, but a deterministic problem—a combinatorial optimization problem—forced into the latent space.
The worst systems always misplace work on either side of this dividing line. The best systems draw the boundary very coldly.
5. Diarization (Document Organization / Topic Profiling)
The diarization step is what truly gives AI value for real knowledge work.
It means: the model reads all materials related to a topic and then writes a structured profile. It condenses the judgments from dozens or even hundreds of documents onto one page.
This is not something an SQL query can produce. This is not something a RAG pipeline can produce. The model must actually read, hold conflicting information in its mind simultaneously, notice what changed and when, and synthesize this into structured intelligence.
This is the difference between a database query and an analyst briefing.
This Architecture
These five concepts can be combined into a very simple three-layer architecture.
· The top layer is fat skills: processes written in markdown, carrying judgment, methodology, and domain knowledge. 90% of the value is in this layer.
· The middle is a thin CLI harness: about 200 lines of code, takes JSON input, outputs text, read-only by default.
· The bottom layer is your application system: QueryDB, ReadDoc, Search, Timeline—these are the deterministic infrastructure.
The core principle is directional: push "intelligence" up into the skills as much as possible; push "execution" down into deterministic tools as much as possible; keep the harness thin and light.
The result is: whenever model capabilities improve, all skills automatically become stronger; while the underlying deterministic system remains stable and reliable.
The Learning System
Let me use a real system we are building at YC to show how these five definitions work together.
July 2026, Chase Center. Startup School has 6000 founders attending. Everyone has structured application materials, questionnaire responses, transcripts of 1:1 conversations with mentors, and public signals: posts on X, GitHub commit history, Claude Code usage (which can indicate their development speed).
The traditional approach is: a 15-person project team reads applications one by one, makes intuitive judgments, and updates a spreadsheet.
This method works at a scale of 200 people but completely fails at 6000. No human can hold so many profiles in their mind and realize: the three strongest candidates in the AI agent infrastructure direction are a dev tools founder in Lagos, a compliance entrepreneur in Singapore, and a CLI tool developer in Brooklyn—and they described the same pain point using completely different expressions in different 1:1 conversations.
The model can do it. Here's how:
Enrichment
There is a skill called /enrich-founder that pulls all data sources, performs enrichment, diarization, and flags discrepancies between "what the founder says" and "what they actually do."
The underlying deterministic system handles: SQL queries, GitHub data, browser testing of Demo URLs, social signal scraping, CrustData queries, etc. A scheduled task runs daily. 6000 founder profiles are always up to date.
The output of diarization captures information that keyword searches completely miss:
This kind of "stated vs. actual behavior" discrepancy requires simultaneously reading GitHub commit history, application materials, and conversation transcripts, and integrating them mentally. No embedding similarity search can do this, nor can keyword filtering. The model must read completely and then make a judgment. (This is exactly the kind of task that belongs in the latent space!)
Matching
This is where "skill = method call" shows its power.
The same matching skill, called three times, can produce completely different strategies:
/match-breakout: processes 1200 people, clusters by domain, 30 people per group (embedding + deterministic assignment)
/match-lunch: processes 600 people, cross-domain "serendipitous matching," 8 people per table with no repeats—LLM generates themes first, then deterministic algorithm assigns seats
/match-live: processes live, real-time participants, based on nearest neighbor embedding, completes 1-to-1 matching within 200ms, excluding people already met
And the model can make judgments that traditional clustering algorithms cannot:
"Santos and Oram are both in AI infrastructure, but not competitors—Santos does cost attribution, Oram does orchestration. Should be in the same group."
"Kim's application said developer tools, but the 1:1 conversation shows they're doing SOC2 compliance automation. Should be re-categorized to FinTech / RegTech."
This re-categorization is something embeddings completely capture. The model must read the entire profile.
Learning Loop
After the event, an /improve skill reads NPS survey results, performs diarization on those "just okay" feedbacks—not the bad ones, but the "almost good" ones—and extracts patterns.
Then, it proposes new rules and writes them back into the matching skill:
When a participant says "AI infrastructure," but 80%+ of their code is billing modules:
→ Categorize as FinTech, not AI Infra
When two people in a group already know each other:
→ Reduce matching weight
Prioritize introducing new relationships
These rules are written back to the skill file. They take effect automatically on the next run. The skill is "rewriting itself." In the July event, "just okay" ratings were 12%; in the next event, it dropped to 4%.
The skill file learned what "just okay" means, and the system got better without anyone rewriting code.
This pattern can be migrated to any domain:
Retrieve → Read → Diarize → Count → Synthesize
Then: Investigate → Survey → Diarize → Rewrite skill
If you ask what the most valuable loop in 2026 is, it's this one. It can be applied to almost all knowledge work scenarios.
Skills Are Permanent Upgrades
I recently posted an instruction for OpenClaw on X, and the response was bigger than expected:
This content received thousands of likes and over two thousand bookmarks. Many thought it was a prompt engineering trick.
Actually, it's not; it's the architecture described earlier. Every skill you write is a permanent upgrade to the system. It doesn't degrade, doesn't forget. It runs automatically at 3 AM. And when the next generation of models is released, all skills instantly become stronger—the latent judgment capabilities improve, while the deterministic parts remain stable and reliable.
This is the source of the 100x efficiency Yegge talks about.
Not a smarter model, but: Fat Skills, Thin Harness, and the discipline to solidify everything into capabilities.
The system grows with compound interest. Build it once, run it long-term.








