Editor's Note: As generative AI rapidly integrates into software engineering, industry sentiment is shifting from "awe at capabilities" to "anxiety about efficiency." Not writing fast enough, not using it enough, or not automating thoroughly enough seem to create pressure to avoid being left behind. But as coding Agents truly enter production environments, more practical issues emerge: errors are amplified, complexity spirals out of control, systems become increasingly incomprehensible, and efficiency gains do not translate proportionally into quality improvements.
Based on firsthand practice, this article offers a sober reflection on the current "agentic coding" frenzy. The author points out that Agents do not learn from mistakes like humans do; without bottlenecks and feedback mechanisms, minor issues are rapidly magnified. Furthermore, in complex codebases, their local perspective and limited recall capabilities exacerbate the chaos of the system structure. The essence of these problems lies not in the technology itself, but in humans, driven by anxiety, prematurely relinquishing judgment and control.
Therefore, rather than succumbing to the anxiety of "must we fully embrace AI," it's better to recalibrate the relationship between humans and tools: let Agents handle local, controllable tasks, while firmly keeping system architecture, quality control, and key decision-making in our own hands. In this process, "slowing down" becomes a capability—it means you still understand the system, can make trade-offs, and still retain a sense of control over your work.
In an era of constantly evolving tools, what is truly scarce might not be faster generation capabilities, but the judgment to handle complexity and the fortitude to make choices between efficiency and quality.
The original text follows:
About a year ago, coding Agents that could genuinely help you "complete entire projects from start to finish" began to appear. Earlier tools like Aider and the early Cursor existed, but they were more like assistants than "agents." The new generation of tools is extremely attractive, and many people spent a lot of their free time doing all those projects they always wanted to do but never had time for.
I think that's fine in itself. Working on things in your free time is inherently enjoyable, and most of the time you don't really need to worry about code quality and maintainability. It also gives you a path to learn new tech stacks.
During the Christmas holidays, Anthropic and OpenAI even gave out some "free credits," sucking people in like a slot machine. For many, this was the first real experience of the magic of "Agents writing code." More and more people got involved.
Now, coding Agents are also starting to enter production codebases. Twelve months on, we are beginning to see the consequences of this "progress." Here are my current thoughts.
Everything is Broken
While this is mostly anecdotal, software today gives a feeling of being "fragile and ready to break." 98% availability is becoming the norm rather than the exception, even for large services. User interfaces are filled with outrageous bugs, the kind that QA teams should catch at a glance.
I admit this situation existed before Agents appeared. But now, the problem is clearly accelerating.
We can't see what's happening inside companies, but occasionally information leaks out, like the rumored "AI-induced AWS outage." Amazon Web Services was quick to "correct" the story, but then immediately launched a 90-day remediation plan internally.
Satya Nadella (Microsoft CEO) has also recently emphasized that more and more code within the company is written by AI. While there's no direct evidence, there is a feeling that Windows quality is declining. Even from blogs published by Microsoft themselves, they seem to tacitly acknowledge this.
Companies that claim "100% of the product code is AI-generated" almost always outputting the worst products you can imagine. No offense, but memory leaks measured in GB, chaotic UIs, incomplete features, frequent crashes... these are hardly the "quality endorsements" they think they are, let alone positive examples of "letting the Agent do everything for you."
Privately, you hear more and more, from both large companies and small teams, saying one thing: they have been backed into a corner by "Agent-written code." No code reviews, handing design decisions to Agents, piling on features nobody needs—the outcome is predictably bad.
Why We Shouldn't Use Agents This Way
We have almost abandoned all engineering discipline and subjective judgment, instead falling into an "addictive" way of working: the sole goal is to generate the most code in the shortest time, with no consideration for the consequences.
You're building an orchestration layer to command an army of automated Agents. You install Beads, completely unaware that it's essentially almost uninstallable "malware." Just because the internet says "everyone is doing it." If you don't, you're "not gonna make it" (ngmi).
You're consuming yourself in a constant "recursive loop."
Look—Anthropic used a group of Agents to make a C compiler. It has problems now, but the next-gen model will fix it, right?
Look again—Cursor used a large group of Agents to make a browser. It's basically unusable now and needs manual intervention from time to time, but the next-gen model will handle it, right?
"Distributed," "divide and conquer," "autonomous systems," "lights-out factory," "solving software in six months," "SaaS is dead, my grandma just built a Shopify with Claw"...
These narratives sound exciting.
Sure, this approach might "still work" for your side project that almost no one uses (including yourself). Maybe, just maybe, there exists a genius who can use this method to create a non-garbage, actually-used software product. If you are that person, I sincerely admire you.
But at least in my circle of developer acquaintances, I haven't seen a case where this method actually works. Of course, maybe we're all just too incompetent.
Errors Compound Without Learning, Without Bottlenecks, with Delayed Explosions
The problem with Agents is: they make mistakes. That's fine in itself; humans make mistakes too. They might be correctness errors, easy to identify and fix, and adding a regression test makes it more stable. Or they might be code smells that linters can't catch: an unused method here, an unreasonable type there, some duplicate code, etc. Individually, these are harmless; human developers make these minor mistakes too.
But "machines" are not people. After making the same mistake a few times, humans usually learn not to repeat it—either scolded into awareness or through genuine process improvement.
Agents lack this learning capability, at least by default. They will repeat the same mistakes over and over, and might even "create" wonderful combinations of different errors based on training data.
You can certainly try to "train" it: write rules in AGENTS.md telling it not to make this mistake; design a complex memory system for it to query historical errors and best practices. This can work for certain specific types of problems. But the prerequisite is—you must first observe it making this error.
The more critical difference is: humans are a bottleneck, Agents are not.
A human cannot spit out twenty thousand lines of code in a few hours. Even with a non-trivial error rate, only a limited number of errors can be introduced per day, and their accumulation is slow. Usually, when the "pain from errors" accumulates to a certain level, humans (instinctively averse to pain) will stop to fix them. Or the person is replaced, and someone else fixes it. In short, problems get handled.
But when you use a whole orchestrated army of Agents, there is no bottleneck and no "pain sensation." These originally trivial minor errors compound at an unsustainable rate. You have been removed from the loop, unaware that these seemingly harmless small issues have grown into a behemoth. By the time you truly feel the pain, it's often too late.
Until one day, you want to add a new feature and find the current system architecture (essentially a pile of errors) cannot support the change; or users start complaining frantically because the latest release has problems, or even lost data.
That's when you realize: you can no longer trust this code.
Worse, the thousands of unit tests, snapshot tests, and end-to-end tests you had the Agent generate are also no longer trustworthy. The only way left to determine if "the system is working properly" is manual testing.
Congratulations, you've screwed yourself (and the company).
Purveyors of Complexity
You have completely lost track of what's happening in the system because you handed control to the Agent. And Agents, by nature, are "purveyors of complexity." They have seen tons of terrible architectural decisions in their training data, and these patterns are reinforced during their RL process. Letting them design the system leads to predictable results.
What you end up with is: an extremely complex system, a mishmash of poor imitations of "industry best practices," which you failed to constrain before the problems got out of hand.
But the problem goes further. Your Agents do not share execution context with each other, cannot see the entire codebase, and do not understand the decisions you or other Agents made previously. Therefore, their decisions are always "local."
This directly leads to the problems mentioned earlier: massive code duplication, structures abstracted for abstraction's sake, various inconsistencies. These problems compound, eventually forming an irredeemably complex system.
This is actually very similar to human-written enterprise codebases. Except that kind of complexity is usually the result of years of accumulation: the pain is distributed across many people, no single person reaches the "must fix" breaking point, and the organization itself has high tolerance, so complexity "co-evolves" with the organization.
But in a human + Agent combination, this process is greatly accelerated. Two people, plus a bunch of Agents, can reach this level of complexity in weeks.
Agentic Search Has Low Recall
You might pin your hopes on the Agent to "clean up the mess," to help you refactor, optimize, and clean the system. But the problem is: they can't do it anymore.
Because the codebase is too large, the complexity too high, and they can only ever see locally. This isn't just about the context window being too small, or long-context mechanisms failing against millions of lines of code. The problem is more subtle.
Before the Agent attempts to fix the system, it must first find all the code that needs modification, as well as existing implementations that can be reused. This step we call agentic search.
How the Agent does this depends on the tools you give it: it could be Bash + ripgrep, a queryable code index, an LSP service, a vector database...
But no matter the tool, the essence is the same: the larger the codebase, the lower the recall. And low recall means: the Agent cannot find all relevant code, and therefore cannot make correct modifications.
This is also why those minor "code smell" errors appeared in the first place; it didn't find the existing implementation, so it reinvented the wheel, introducing inconsistency. Eventually, these problems spread and compound, blooming into an extremely complex "flower of rot."
So how do we avoid all this?
How We Should Collaborate with Agents (For Now)
Coding Agents are like sirens, luring you in with extremely fast code generation speed and that "intermittent yet occasionally stunning" intelligence. They can often complete simple tasks with astonishing speed and high quality. The real problems start when you get the idea—"This is so powerful, computer, do my work for me!"
There's nothing wrong with assigning tasks to Agents per se. Good Agent tasks typically have several characteristics: the scope can be well-defined, not requiring understanding of the entire system; the task is closed-loop, meaning the Agent can evaluate the result itself; the output is not on the critical path, just some temporary tool or internal software, not affecting real users or revenue; or you just need a "rubber duck" to aid thinking—essentially taking your ideas and colliding them with the compressed knowledge of the internet and synthetic data.
If these conditions are met, then it's a task suitable for an Agent, provided that you, the human, remain the final quality gatekeeper.
For example, using Andrej Karpathy's auto-research method to optimize application startup time? Great. But you must be clear that the code it spits out is absolutely not production-ready. Auto-research works because you give it an evaluation function, allowing it to optimize around a specific metric (like startup time or loss). But this evaluation function only covers a very narrow dimension. The Agent will righteously ignore all metrics not in the evaluation function, like code quality, system complexity, and even correctness in some cases—if your evaluation function itself is flawed.
The core idea is simple: let Agents do the boring things that don't teach you anything new, or the exploratory work you never had time to try. Then you evaluate the results, pick out the parts that are actually reasonable and correct, and complete the final implementation. Of course, you can also use an Agent for this final step.
But what I want to emphasize more is: really, slow down a bit.
Give yourself time to think about what you are actually doing and why. Give yourself a chance to say "no," "No, we don't need this." Set a clear upper limit for the Agent: how much code it is allowed to generate per day, an amount that should match your actual ability to review it. All parts that determine the "overall shape" of the system, like architecture, APIs, etc., should be written by hand. You can use autocomplete to get a "feel of handwritten code," or pair program with an Agent, but the key is: you must be in the code.
Because, writing code yourself, or watching it being built step by step, brings a sense of "friction." It is precisely this friction that makes you clearer about what you want to do, how the system works, and the overall "feel." This is where experience and "taste" come into play, and this is precisely what the most advanced models currently cannot replace. Slowing down, enduring a bit of friction, is exactly how you learn and grow.
In the end, what you get will be a system that is still maintainable—at least no worse than before Agents appeared. Yes, past systems weren't perfect either. But your users will thank you because your product is "usable," not a pile of slapped-together garbage.
You will do fewer features, but more correctly. Learning to say "no" is a capability in itself. You can also sleep soundly because you at least still know what's happening in the system; you still hold the initiative. It is this understanding that allows you to compensate for the recall problems of agentic search, making the Agent's output more reliable and requiring less patching.
When the system has problems, you can step in and fix it; when the design was不合理 from the start, you can understand the issue and refactor it into a better form. Whether there's an Agent or not isn't really that important.
All of this requires discipline. All of this depends on people.







