AI Agent Outputs Garbage? The Problem Is You're Not Willing to Burn Enough Tokens

marsbitDipublikasikan tanggal 2026-03-23Terakhir diperbarui pada 2026-03-23

Abstrak

The core argument is that the quality of an AI Agent's output is directly proportional to the number of tokens invested in the process. More tokens lead to fewer errors, as they allow for deeper reasoning, multiple independent attempts, self-critique from fresh contexts, and verification through testing. This approach can solve problems of scale and complexity but fails when facing novel problems not present in the model's training data. For such novel challenges, human domain knowledge and guidance are essential. Two practical, immediate solutions are proposed: implementing an automatic review cycle (WAIT) for the Agent to repeatedly critique and fix its work, and establishing frequent verification checkpoints (VERIFY) where a separate Agent validates outputs to catch errors early. The key takeaway is that insufficient token investment is often the primary reason for poor Agent performance, not the underlying framework.

Author: Systematic Long Short

Compiled by: Deep Tide TechFlow

Deep Tide Intro: The core argument of this article is just one sentence: The quality of an AI Agent's output is directly proportional to the number of Tokens you invest.

The author isn't speaking in general theoretical terms; instead, they provide two specific methods you can start using today and clearly define the boundary where throwing more Tokens won't help—the "novelty problem."

For readers currently using Agents to write code or run workflows, the information density and practicality are very high.

Introduction

Alright, you have to admit the title is quite eye-catching—but seriously, it's no joke.

In 2023, when we were using LLMs to run production code, people around us were stunned because the common belief at the time was that LLMs could only produce unusable garbage. But we knew something others didn't: the output quality of an Agent is a function of the number of Tokens you invest. It's that simple.

Run a few experiments yourself and you'll see. Have an Agent complete a complex, somewhat niche programming task—for example, implementing a convex optimization algorithm with constraints from scratch. First, run it at the lowest thinking level; then switch to the highest thinking level and have it review its own code to see how many bugs it can find. Try the medium and high levels. You'll visually see: the number of bugs decreases monotonically as the number of Tokens invested increases.

This isn't hard to understand, right?

More Tokens = Fewer errors. You can take this logic a step further; this is essentially the (simplified) core idea behind code review products. In a completely new context, invest a massive number of Tokens (for example, have it parse the code line by line, judging whether each line has a bug)—this can basically catch the vast majority, if not all, bugs. This process can be repeated ten times, a hundred times, each time examining the codebase from a "different angle," and you can eventually unearth all the bugs.

There's another empirical support for the view that "burning more Tokens improves Agent quality": those teams that claim to use Agents to write code from start to finish and push it directly to production are either the foundational model providers themselves or extremely well-funded companies.

So, if you're still struggling to get production-level code from your Agent—to be blunt, the problem lies with you. Or rather, with your wallet.

How to Tell If You're Burning Enough Tokens

I wrote an entire article saying the problem definitely isn't your framework (harness), that "keeping it simple" can still produce excellent results, and I still stand by that view. You read that article, followed the advice, but were still greatly disappointed by the Agent's output. You sent me a DM, saw I read it but didn't reply.

This article is the reply.

Your Agent performs poorly and can't solve the problem, most of the time, simply because you're not burning enough Tokens.

How many Tokens are needed to solve a problem depends entirely on the problem's scale, complexity, and novelty.

"What's 2+2?" doesn't require many Tokens.

"Write me a bot that scans all markets between Polymarket and Kalshi, finds markets that are semantically similar and should settle around the same event, sets no-arbitrage boundaries, and automatically trades with low latency whenever an arbitrage opportunity arises"—this requires burning a huge pile of Tokens.

We found something interesting in practice.

If you invest enough Tokens to handle problems caused by scale and complexity, the Agent *will* solve them, no matter what. In other words, if you want to build something extremely complex, with many components and lines of code, as long as you throw enough Tokens at these problems, they will eventually be completely resolved.

There is one small but important exception.

Your problem cannot be too novel. At this stage, no amount of Tokens can solve the "novelty" problem. Enough Tokens can reduce errors from complexity to zero, but they cannot make an Agent invent something it doesn't know out of thin air.

This conclusion actually came as a relief to us.

We spent enormous effort, burned—a lot, a lot, a whole lot—of Tokens, trying to see if an Agent could reconstruct an institutional investment process with almost no guidance. This was partly to figure out how many years we (as quantitative researchers) have before being completely replaced by AI. It turned out the Agent couldn't get anywhere close to a decent institutional investment process. We believe this is partly because they have never seen such a thing—meaning, institutional investment processes simply don't exist in the training data.

So, if your problem is novel, don't count on solving it by stacking Tokens. You need to guide the exploration process yourself. But once you've defined the implementation plan, you can confidently stack Tokens for execution—no matter how large the codebase or how complex the components, it's not a problem.

Here's a simple heuristic: the Token budget should grow proportionally with the number of lines of code.

What Are the Extra Tokens Actually Doing?

In practice, additional Tokens typically improve the Agent's engineering quality in the following ways:

Allowing it to spend more time reasoning in the same attempt, giving it a chance to discover flawed logic itself. Deeper reasoning = better planning = higher probability of success on the first try.

Allowing it to make multiple independent attempts, exploring different solution paths. Some paths are better than others. Allowing more than one attempt lets it choose the best one.

Similarly, more independent planning attempts allow it to abandon weak directions and keep the most promising ones.

More Tokens allow it to critique its previous work with a fresh context, giving it a chance to improve instead of being stuck in a certain "reasoning inertia."

And, of course, my favorite: more Tokens mean it can use tests and tools for verification. Actually running the code to see if it works is the most reliable way to confirm the answer is correct.

This logic works because engineering failures of Agents are not random. They are almost always due to choosing the wrong path too early, not checking if this path actually works (early on), or not having enough budget to recover and backtrack after discovering a mistake.

That's the story. Tokens are literally the decision quality you buy. Think of it like research work: if you ask a person to answer a difficult question on the spot, the quality of the answer decreases as time pressure increases.

Research, at its core, is what produces the foundational "knowing the answer." Humans spend biological time to produce better answers; Agents spend more compute time to produce better answers.

How to Improve Your Agent

You might still be skeptical, but there are many papers supporting this, and honestly, the very existence of the "reasoning" adjustment knob is all the proof you need.

One paper I particularly like: researchers trained on a small, carefully curated set of reasoning examples, then used a method to force the model to keep thinking when it wanted to stop—specifically by appending "Wait" where it wanted to stop. This single change raised a certain benchmark from 50% to 57%.

I want to be as clear as possible: if you've been complaining that the code written by your Agent is mediocre, the single highest thinking level is likely still not enough for you.

I'll give you two very simple solutions.

Simple Method One: WAIT

The simplest thing you can start doing today: set up an automatic loop—after building, have the Agent review its work N times with a fresh context, fixing any issues found each time.

If you find this simple trick improves your Agent's engineering results, then you at least understand that your problem is just a matter of Token quantity—welcome to the Token burning club.

Simple Method Two: VERIFY

Have the Agent verify its own work early and often. Write tests to prove that the chosen path actually works. This is especially useful for highly complex, deeply nested projects—a function might be called by many other downstream functions. Catching errors upstream can save you a lot of subsequent compute time (Tokens). So, if possible, set up "verification checkpoints" throughout the entire build process.

Finished writing a piece? The main Agent says it's done? Have a second Agent verify it. Unrelated thought streams can cover sources of systematic bias.

That's basically it. I could write a lot more on this topic, but I believe just realizing these two things and implementing them well can solve 95% of your problems. I firmly believe in doing simple things extremely well, then adding complexity as needed.

I mentioned that "novelty" is a problem that can't be solved with Tokens, and I want to emphasize it again because you will eventually hit this pitfall and come crying to me saying stacking Tokens didn't work.

When the problem you want to solve isn't in the training set, *you* are the one who really needs to provide the solution. Therefore, domain expertise remains extremely important.

Pertanyaan Terkait

QWhat is the core argument of the article regarding AI Agent output quality?

AThe core argument is that the quality of an AI Agent's output is directly proportional to the number of tokens you are willing to invest in the process.

QAccording to the article, what is the one type of problem that cannot be solved by simply using more tokens?

AProblems that are 'novel' or not present in the model's training data cannot be solved by any amount of tokens; they require human guidance and domain expertise.

QWhat are the two simple methods suggested in the article to immediately improve an AI Agent's performance?

AThe two simple methods are: 1. WAIT - Set up an automatic loop for the Agent to review its work multiple times with a fresh context and fix any issues found. 2. VERIFY - Have the Agent (or a second one) verify its work early and often by writing tests to prove the chosen path works.

QHow does the article suggest thinking about the relationship between tokens and decision quality?

AThe article suggests thinking of tokens as literally 'buying' decision quality, analogous to how human researchers spend biological time to produce better answers, an AI Agent spends computational time (tokens) to produce better answers.

QWhat heuristic does the article provide for determining a sufficient token budget for a task?

AThe article provides a simple heuristic: the token budget should grow proportionally with the number of lines of code required for the task.

Bacaan Terkait

'Equitas dengan backup 1:1 memiliki skalabilitas lebih baik,' klaim Base di tengah persaingan dengan Robinhood

Pendiri Base, Jesse Pollak, mengakui bahwa Robinhood Chain memiliki keunggulan dengan meluncurkan ekuitas yang ditokenisasi di lingkungan EVM lebih dulu, sebuah langkah yang menurutnya Base tertinggal. Namun, Pollak meremehkan ancaman Robinhood. Ia menyatakan bahwa Base, bersama Coinbase, akan segera memperbaiki ini dengan menawarkan ekuitas token yang didukung 1:1 oleh saham asli. Menurutnya, model ini akan lebih unggul dalam hal kepercayaan, efisiensi modal, dan penerimaan institusional dibandingkan produk derivatif seperti milik Robinhood. Meski baru beroperasi sekitar tiga minggu, Robinhood Chain sudah menyaingi Base dalam metrik kunci seperti pengguna aktif mingguan (sekitar 1 juta), volume DEX, dan pendapatan. Analis menyoroti keunggulan distribusi Robinhood yang memiliki jutaan akun funded, yang dapat memberikan posisi kuat dalam perdagangan *on-chain*. Sementara itu, detail mengenai hak suara dan dividen dalam penawaran ekuitas token Base masih belum jelas.

ambcrypto30m yang lalu

'Equitas dengan backup 1:1 memiliki skalabilitas lebih baik,' klaim Base di tengah persaingan dengan Robinhood

ambcrypto30m yang lalu

Prediksi harga TON/GRAM – Peluncuran dompet Gram Telegram mengangkat token: Apakah rally bisa berlanjut?

TON (yang akan direbranding menjadi GRAM) naik lebih dari 7% setelah pendiri Telegram, Pavel Durov, mengumumkan rencana untuk meluncurkan dompet kripto non-custodial asli, Gram, di seluruh aplikasi pesannya. Durov menyebutnya sebagai "peluncuran dompet kripto non-custodial terbesar dalam sejarah manusia", yang akan memungkinkan lebih dari satu miliar pengguna Telegram melakukan transaksi kripto instan tanpa biaya. Pengumuman ini memperjelas ambisi kripto Telegram, dengan memisahkan The Open Network (TON) sebagai blockchain dan Gram sebagai mata uang digital untuk pengguna. Dompet bawaan ini diharapkan dapat mengurangi hambatan penggunaan kripto bagi pengguna baru. Secara harga, TON/GRAM diperdagangkan di sekitar $1,53, terpantul dari support dekat $1,40. Namun, token masih berada di bawah puncak Mei sekitar $2,80 dan menghadapi resistensi kuat di level $1,60. Indikator RSI harian berada di 45, menunjukkan momentum bearish telah berkurang tetapi masih lemah. Breakout di atas $1,60 bisa membuka jalan menuju $1,80, sementara kegagalan mempertahankan level saat ini dapat menyebabkan harga kembali menguji $1,40.

ambcrypto56m yang lalu

Prediksi harga TON/GRAM – Peluncuran dompet Gram Telegram mengangkat token: Apakah rally bisa berlanjut?

ambcrypto56m yang lalu

Bagaimana 40,8 Juta ETH yang Di-Staking Dapat Memperkuat Keunggulan Ethereum terhadap Bitcoin

Dua perbedaan utama menunjukkan bahwa kinerja Ethereum (ETH) yang lebih baik dibandingkan Bitcoin (BTC) mungkin baru dimulai. Pertama, paus-paus (whale) terlihat mengakumulasi ETH dalam jumlah besar dan langsung memasukkan 100% aset yang dibeli ke dalam staking, mengurangi pasokan likuid dan menandakan keyakinan jangka panjang. Tren ini selaras dengan data staking Ethereum secara keseluruhan, di mana jumlah total ETH yang di-stake mencapai rekor 40,8 juta ETH (33,5% dari total pasokan), dengan antrian masuk validator yang tinggi. Kedua, ekosistem DeFi Ethereum menambah momentum dengan menunjukkan peningkatan signifikan dalam transaksi besar (whale transaction) dan Total Value Locked (TVL), yang mengindikasikan likuiditas dan aktivitas on-chain yang semakin kuat. Dampak dari dinamika pasokan yang ketat dan aktivitas on-chain yang hidup ini mulai terlihat pada performa teknis. Rasio ETH/BTC telah berhasil naik di atas level resistensi 0,025 dan mendekati zona kunci 0,03, menandakan potensi breakout untuk kelanjutan outperformance ETH terhadap BTC. Kombinasi dari penguncian pasokan melalui staking dan pertumbuhan ekosistem DeFi menciptakan landasan yang kuat untuk pergerakan harga Ethereum.

ambcrypto1j yang lalu

Bagaimana 40,8 Juta ETH yang Di-Staking Dapat Memperkuat Keunggulan Ethereum terhadap Bitcoin

ambcrypto1j yang lalu

Pusat Badai Pasar Saham Global: Deleveraging Pasar Saham Korea Selatan Telah Selesai Secara Dasar

Pasar saham Korea Selatan mengalami volatilitas signifikan baru-baru ini, dengan indeks KOSPI mengalami penurunan hingga 32% dari puncaknya pada Juni. Sebagai pusat badai dalam tren AI global, fluktuasi ini terutama diperparah oleh struktur pendanaan dengan leverage tinggi. Pemicu utamanya adalah ekspansi dan likuidasi besar-besaran ETF berleverage, yang skalanya pernah mendekati $500 miliar. Mekanisme rebalancing harian ETF ini menciptakan siklus umpan balik negatif. Namun, proses de-leveraging ETF ini telah mencapai sekitar 75%, turun dari sekitar $500 miliar menjadi $260 miliar, dengan ruang penyesuaian yang tersisa diperkirakan hanya 25%. Selain itu, hedge fund yang menggunakan transaksi swap juga telah mengurangi leverage lebih dari 50%. Sementara itu, pembiayaan margin oleh investor ritel Korea relatif terbatas (sekitar 0.5% dari kapitalisasi pasar) dan tidak menjadi sumber risiko sistemik utama. Secara keseluruhan, tahap de-leveraging paling keras yang dapat memicu penjualan beruntun telah berlalu. Pasar mulai beralih dari penurunan yang didorong likuiditas ke penetapan harga berbasis fundamental. Dengan tren AI yang diyakini tidak dapat diubah, penyesuaian ini lebih mirip likuidasi perdagangan yang padat daripada akhir dari tren AI. Intinya bukan pada volatilitas jangka pendek, tetapi pada partisipasi dalam arah teknologi yang benar.

链捕手1j yang lalu

Pusat Badai Pasar Saham Global: Deleveraging Pasar Saham Korea Selatan Telah Selesai Secara Dasar

链捕手1j yang lalu

92,9% Token Kripto yang Diluncurkan Sejak 2024 Diperdagangkan di Bawah Harga TGE: CryptoRank

Data dari CryptoRank menunjukkan bahwa 92,9% token kripto yang diluncurkan sejak 2024 dengan kapitalisasi pasar di atas $100 juta kini diperdagangkan di bawah harga saat Token Generation Event (TGE). Hanya 8 dari 113 proyek yang tetap berada di atas harga peluncurannya, dengan tingkat pengembalian median sebesar -95,7%. Hyperliquid (HYPE) adalah penampil terbaik dengan kenaikan 1.519%, diikuti oleh Ondo Finance (ONDO), EverValue Coin (EVA), dan Midnight Network (NIGHT). Temuan ini mengindikasikan bahwa pasar telah menjadi sangat selektif; modal kini terkonsentrasi pada sedikit proyek yang terbukti memiliki adopsi produk dan pertumbuhan ekosistem yang nyata. Data tersebut juga merefleksikan pergeseran di kalangan investor yang kini lebih memperhatikan aspek seperti tokenomics, pasokan yang beredar, jadwal unlock, dan utilitas jangka panjang. Tren ini dapat memengaruhi peluncuran token di masa depan, dengan kemungkinan penekanan lebih besar pada model distribusi yang berkelanjutan dan pertumbuhan ekosistem jangka panjang, bukan sekadar valuasi awal yang agresif.

ambcrypto1j yang lalu

92,9% Token Kripto yang Diluncurkan Sejak 2024 Diperdagangkan di Bawah Harga TGE: CryptoRank

ambcrypto1j yang lalu

Trading

Spot

AI Agent Outputs Garbage? The Problem Is You're Not Willing to Burn Enough Tokens

Abstrak

Introduction

How to Tell If You're Burning Enough Tokens

What Are the Extra Tokens Actually Doing?

How to Improve Your Agent

Simple Method One: WAIT

Simple Method Two: VERIFY

Pertanyaan Terkait

Bacaan Terkait

'Equitas dengan backup 1:1 memiliki skalabilitas lebih baik,' klaim Base di tengah persaingan dengan Robinhood

Prediksi harga TON/GRAM – Peluncuran dompet Gram Telegram mengangkat token: Apakah rally bisa berlanjut?

Bagaimana 40,8 Juta ETH yang Di-Staking Dapat Memperkuat Keunggulan Ethereum terhadap Bitcoin

Pusat Badai Pasar Saham Global: Deleveraging Pasar Saham Korea Selatan Telah Selesai Secara Dasar

92,9% Token Kripto yang Diluncurkan Sejak 2024 Diperdagangkan di Bawah Harga TGE: CryptoRank

Trading

Kategori Populer

Tag Populer