Dwarkesh Patel, a famous tech podcast host in Silicon Valley, recently posed a question: What will be the next paradigm for AI training?

Dwarkesh Patel is a tech podcast host and writer who has rapidly gained popularity in Silicon Valley in recent years. At just 25 years old, he has already entered the core circles of AI discussion with the Dwarkesh Podcast. His interview subjects include AI and tech luminaries such as Ilya Sutskever, Andrej Karpathy, Dario Amodei, Demis Hassabis, and Mark Zuckerberg. TIME included him in the 2024 TIME100 AI list, stating that his podcast has become essential listening for many AI practitioners.

In his latest podcast episode, he summarized the direction leading AI labs are currently betting on with a single keyword: RLVR, or Reinforcement Learning with Verifiable Rewards.
Simply put, it involves letting models repeatedly trial and error on a large number of tasks where correctness can be automatically judged, training them to develop planning, error correction, iteration, and long-term execution capabilities. The rapid progress in fields like coding and mathematics today largely stems from this approach.
But what Dwarkesh really wants to explore is: If the next generation of AI relies solely on this kind of 'verifiable task training,' will it be enough?
His answer: Probably not.
Because a task being 'verifiable' is not sufficient; it must also be 'grindable.'
The key concept here is grindability. In the context of AI training, it refers to the ability to be practiced repeatedly or 'massively rolled out.'
Coding tasks are typical grindable tasks. You can prepare a software repository, a bug to fix, a test case, then replicate the same environment into thousands of copies, letting thousands of agents attempt it simultaneously. Whoever passes the test scores points. This process is parallelizable, reproducible, resettable, and particularly suitable for RLVR.
Math problems are similar. Answers can be verified, and the training environment is easy to replicate.
But Dwarkesh asks a very interesting question: Why is AI's progress slower in 'using computers' compared to coding and math?
Superficially, computer use is also verifiable. For example, whether an item was successfully purchased, an event venue was booked, or a tax form was submitted—these outcomes can be judged. The problem, however, is that it's difficult to replicate and replay these tasks at scale. You cannot have a thousand agents simultaneously run the same checkout process repeatedly on Amazon, because real websites detect bots, ban accounts, and change states. You could, of course, clone applications like Slack, Gmail, or Amazon to create simulators, but at this stage, that remains high-cost, low-scalability engineering.
Dwarkesh points out: AI progresses quickly in a particular domain not just because answers are verifiable there, but because the domain can be packaged into a replicable, replayable, parallelizable training environment.
This also explains why code, math, and game-like tasks are natural breeding grounds for RLVR, while many real-world tasks struggle to fit directly into this training paradigm.
Next, he pushes the question into the more complex real world.
- What if we want to train an AI to start a company from scratch?
- What if we want to train it to win a lawsuit?
- What if we want to train it to make steady profits in the market, or help a candidate win an election?
These tasks, of course, also have outcomes. Whether a company succeeds, a lawsuit is won, a trade is profitable, or an election is secured—all can be judged in the end.
But their problems are: feedback is too slow, variables are too many, the world is not resettable, and it cannot be replicated a thousand times in a data center.
A startup may last for years. A political campaign depends on specific districts, candidates, voter sentiment, media environment, and chance events. A legal case also cannot be copied from the same starting point into a thousand parallel universes for different agents to experiment with.
Such environments in reinforcement learning resemble so-called reset-free, non-stationary environments: they cannot be easily reset, and the environment itself is constantly changing.
Dwarkesh therefore asks: Can agents trained by RLVR in verifiable, grindable environments truly generalize to these real-world tasks?
This is not a question that can be answered with slogans; it's an empirical question.
Optimists would say that if RLVR environments are sufficiently numerous and complex, models will eventually learn general agent capabilities. The planning and trial-and-error abilities honed in code, math, web navigation, and tool use will ultimately transfer to domains like entrepreneurship, organizational management, politics, law, and scientific research.
But Dwarkesh remains skeptical of this.
Because in the real world, the most valuable knowledge often does not appear in clear, verifiable, repeatable forms. It may come from vague customer feedback, a failed meeting, an implicit organizational process, a failure mode that only emerges during real tasks. For models to learn these things, they cannot rely solely on 'grinding problems'; they must possess true sample efficiency.
This leads the discussion to the most crucial point of the entire article: learning back to the weights.
Today's large models are already very good at in-context learning. They can read a lot of material in a long context, understand a project background, and temporarily adapt to a user's or organization's needs. The problem is, this learning mostly stays within the context window. After a session ends, the model doesn't necessarily truly 'remember.'
Dwarkesh believes this is a huge waste.
Because the most valuable training signals for a model actually appear after deployment. When the model is used by real users, enters real organizations, participates in real tasks, and exposes real mistakes. It will see how companies actually operate, what people actually do with it, where failures often occur, and which suggestions simply don't work in reality.
But if these experiences cannot be condensed back into the model's weights, then it's just a temporary adaptation within one session, not long-term growth in capability.
He uses human learning as an analogy: People don't become capable by memorizing verbatim everything that happens every day. An employee becomes useful after six months on the job not because they remember every email and meeting note, but because they compress those experiences into judgment, intuition, process understanding, and problem patterns.
Models should be the same.
True continual learning is not infinitely expanding the KV cache, nor stuffing all historical records into the context, but distilling a small amount of truly useful knowledge from real experiences and compressing it into the weights.
This is precisely the problem Dwarkesh believes the next training paradigm must solve.
So, how to do it specifically?
He mentions a direction being discussed: on-policy self-distillation, or OPSD.
Roughly understood: Let a model that has already accumulated extensive experience in long sessions act as a 'senior employee' or teacher; then train the base model so that even without this full context, it can make judgments similar to the teacher's.
In other words, distill what the model learned through context during a real task back into the model's own weights.
This is different from ordinary SFT (Supervised Fine-Tuning). The most naive SFT might simply have the model predict tokens that appeared in the session, equivalent to making it recite the entire work log. But that's not effective learning. What's truly important isn't remembering all the details, but extracting the key insights that help the model perform better next time.
The advantage of OPSD is that it doesn't necessarily require an externally verifiable reward. As long as the model can learn useful things within the context, the 'post-learning model' can be used as a teacher, moving the base model closer to it.
Furthermore, compared to ordinary RL which only has a final reward, OPSD can provide denser supervision signals. It can compare the probability distribution differences between teacher and student at the token level, thus compressing the sparse experience from a real task into smaller, more precise weight updates.
Besides OPSD, Dwarkesh proposes another direction: dreaming.
Here, 'dreaming' refers to the AI constructing its own simulation environment based on real-world observations, then repeatedly practicing, trying strategies, and reinforcing effective behaviors within it.
This sounds a lot like model-based RL in the reinforcement learning tradition, or like what Sutton has long emphasized: agents accumulating experience through environmental interaction. The difference is that Dwarkesh places it in the context of large models and real deployment.
For example, after an AI observes a certain business process in a real company, it doesn't just write a summary. Instead, it spends significant computation constructing a 'game-like simulation environment' of that process. Then it tests different communication strategies, execution paths, and project approaches inside, seeing what is more likely to succeed. Finally, it compresses the experience gained from these simulated practices back into the model.
If this approach proves viable, it might become a new scaling axis.
In the past, AI scaling primarily came from three axes: pretraining, RL, and inference-time compute. Dwarkesh envisions that in the future, a fourth axis might emerge: test-time training, or dreaming. Models wouldn't just reason, but during reasoning and task execution, construct simulation environments for specific users, organizations, or projects, and train themselves within them.
This is also why someone in the comments mentioned David Silver and Richard Sutton's 'Welcome to the Era of Experience': that article similarly emphasizes that AI cannot rely forever on human data, and the next phase's key will be agents gaining experience from their own interactions with the environment.

Dwarkesh concretizes this macro judgment for today's large model training problem: RLVR is an important transitional phase, letting models develop agent capabilities in verifiable tasks; but to enter the more complex real world, models must learn to continually learn from real deployment and write that experience back into the weights.
In Dwarkesh's envisioned 2027 or 2028, the training process might look like this:
- First, RLVR trains a basically competent agent. This agent is thrown into an unfamiliar problem and can at least figure out the situation, try different strategies, and continue iterating after encountering obstacles.
- Then, this agent is deployed into the real world to start doing real work. It might work continuously with a user for a week on a project outside the original training distribution.
- At the end of the week, the user gives it a thumbs up or thumbs down, or even writes a work evaluation. If the result is positive, the model distills what it learned during this task back into the base model. This process might use OPSD, dreaming, or some new technology not yet invented.
Once this path is established, AI's capability boundaries are no longer limited by those initial 'verifiable tasks.'
It can first learn coding, math, web tasks, and tool use through RLVR; then learn organizational management, business processes, and complex collaboration through real deployment; then, starting from these experiences, continue expanding into adjacent domains.
This also implies that the main source of AI progress may change.
In the past, a model was trained before release, and users simply used it. The next generation of models might be: train a basic agent before release, then continue learning through massive real tasks after release. Every interaction with a user, every real project execution, every failure and correction could become material for the next round of capability improvement.
Therefore, what Dwarkesh calls the 'next generation training paradigm' is not simply saying models need to be bigger, data needs to be more, RL needs to be stronger.
It truly points to: AI moving from pre-deployment training to post-deployment learning; from human data to environmental experience; from temporary adaptation in context to long-term capability in weights.
The most important AI training data in the future may no longer be just the text already on the internet, nor just well-constructed verifiable tasks in labs, but the experience that AI accumulates itself while completing real tasks in the real world.
References:
https://x.com/dwarkesh_sp/status/2070551894674555081
This article is from the WeChat public account 'Almost Human' (ID: almosthuman2014), author: Focus on AI Training






