Original Author: Malika Aubakirova, Matt Bornstein, a16z crypto
Original Compilation: Deep Tide TechFlow
In Christopher Nolan's "Memento," the main character Leonard Shelby lives in a fragmented present. Brain damage has left him with anterograde amnesia, unable to form new memories. Every few minutes, his world resets, trapping him in an eternal "now," unable to remember what just happened or what will happen next. To survive, he tattoos words on his body and takes Polaroids, relying on these external props to replace the memory functions his brain can no longer perform.
Large language models live in a similar eternal present. After training ends, vast amounts of knowledge are frozen in their parameters; the model cannot form new memories or update its parameters based on new experiences. To compensate for this defect, we build a bunch of scaffolding for it: chat history acts as short-term sticky notes, retrieval systems serve as external notebooks, and system prompts are like tattoos on the body. But the model itself never truly internalizes this new information.
More and more researchers believe this is not enough. In-context learning (ICL) can solve problems, provided the answer (or fragments of the answer) already exists somewhere in the world. But for problems that require true discovery (like novel mathematical proofs), adversarial scenarios (like security attacks and defenses), or knowledge that is too implicit to be expressed in language, there is a strong argument that models need a way to directly write new knowledge and experience into their parameters after deployment.
In-context learning is temporary. True learning requires compression. Until we allow models to continuously compress, we might be stuck in the eternal present of "Memento." Conversely, if we can train models to learn their own memory architecture, rather than relying on external custom tools, we might unlock a whole new dimension of scaling.
This field of research is called continual learning. This concept is not new (see McCloskey and Cohen's 1989 paper), but we believe it is one of the most important research directions in AI today. The explosive growth of model capabilities over the past two to three years has made the gap between what models "know" and what they "can know" increasingly apparent. The purpose of this article is to share what we have learned from top researchers in this field, help clarify the different paths of continual learning, and promote the development of this topic within the startup ecosystem.
Note: This article was shaped by in-depth discussions with a group of excellent researchers, PhD students, and entrepreneurs who generously shared their work and insights in the field of continual learning. From theoretical foundations to the engineering realities of post-deployment learning, their insights have made this article much more solid than anything we could have written alone. Thank you for your time and ideas!
First, Let's Talk About Context
Before defending parameter-level learning (i.e., learning that updates model weights), it's necessary to acknowledge a fact: in-context learning does work. And there is a strong argument that it will continue to win.
The essence of a Transformer is a sequence-based next-token predictor conditioned on the input. Give it the right sequence, and you can get surprisingly rich behavior without ever touching the weights. This is why methods like context management, prompt engineering, instruction fine-tuning, and few-shot examples are so powerful. Intelligence is encapsulated in static parameters, and the manifested capabilities change dramatically based on what you feed into the context.
A recent in-depth article by Cursor on the scaling of autonomous programming agents is a good example: the model weights are fixed; what really makes the system run is the careful orchestration of context—what to put in, when to summarize, how to maintain a coherent state over hours of autonomous operation.
OpenClaw is another good example. It went viral not because of special model access (the underlying model is available to everyone), but because it extremely efficiently converted context and tools into a working state: tracking what you're doing, structuring intermediate outputs, deciding when to re-inject prompts, maintaining persistent memory of previous work. OpenClaw elevated the "shell design" of agents to the level of an independent discipline.
When prompt engineering first emerged, many researchers were skeptical that "just prompts" could become a serious interface. It seemed like a hack. But it is a native product of the Transformer architecture, requires no retraining, and automatically upgrades as models improve. As models get stronger, prompts get stronger. "Crude but native" interfaces often win because they are coupled directly to the underlying system, not fighting against it. So far, the trajectory of LLM development has followed this pattern.
State Space Models: Context on Steroids
As mainstream workflows shift from raw LLM calls to agent loops, in-context learning models are under increasing pressure. In the past, it was relatively rare for the context window to be completely filled. This usually happened when an LLM was asked to perform a long series of discrete tasks, and the application layer could trim and compress chat history in a straightforward way.
But for agents, a single task can consume a large portion of the total available context. Each step of an agent loop relies on the context passed from previous iterations. And they often fail after 20 to 100 steps because they "lose the thread": the context gets filled, coherence degrades, and they fail to converge.
Therefore, major AI labs are now investing significant resources (i.e., large-scale training runs) to develop models with ultra-long context windows. This is a natural path because it builds on what already works (in-context learning) and aligns with the industry's broader shift towards inference-time computation. The most common architecture involves interleaving fixed memory layers between standard attention heads, namely State Space Models (SSMs) and linear attention variants (collectively referred to as SSMs below). SSMs offer fundamentally better scaling curves in long-context scenarios.
Figure Caption: Scaling comparison of SSM vs. traditional attention mechanism
The goal is to help agents increase the number of coherent run steps by several orders of magnitude, from about 20 steps to about 20,000 steps, without losing the broad skills and knowledge provided by traditional Transformers. If successful, this would be a major breakthrough for long-running agents.
You could even view this approach as a form of continual learning: although the model weights aren't updated, an external memory layer that rarely needs resetting is introduced.
So, these non-parametric methods are real and powerful. Any evaluation of continual learning must start here. The question isn't whether today's context systems work—they do. The question is: have we already seen the ceiling, and can new methods take us further?
What Context Omits: The "Filing Cabinet Fallacy"
"What happened with AGI and pre-training is that, in a sense, they overshot... Humans are not AGI. Yes, humans do have a skill base, but humans lack a vast amount of knowledge. We rely on continual learning.
If I create a super-smart 15-year-old, he knows nothing. A good student, very eager to learn. You could say, go be a programmer, go be a doctor. Deployment itself would involve a process of learning, trial and error. It's a process, not throwing the finished product out there. — Ilya Sutskever"
Imagine a system with infinite storage space. The world's largest filing cabinet, every fact perfectly indexed, instantly retrievable. It can look up anything. Has it learned?
No. It was never forced to compress.
This is the core of our argument, referencing a point previously made by Ilya Sutskever: LLMs are essentially compression algorithms. During training, they compress the internet into parameters. Compression is lossy, and it is this lossiness that makes it powerful. Compression forces the model to find structure, generalize, and build representations that transfer across contexts. A model that memorizes all training samples is inferior to one that extracts underlying patterns. Lossy compression is learning itself.
Ironically, the mechanism that makes LLMs so powerful during training (compressing raw data into compact, transferable representations) is precisely what we stop them from doing after deployment. We halt compression at the moment of release, substituting it with external memory.
Of course, most agent shells compress context in some custom way. But doesn't the bitter lesson tell us that the model itself should learn this compression, directly and at scale?
Yu Sun shared an example to illustrate this debate: mathematics. Consider Fermat's Last Theorem. For over 350 years, no mathematician could prove it, not because they lacked the right literature, but because the solution was highly novel. The conceptual distance between existing mathematical knowledge and the final answer was too great.
When Andrew Wiles finally cracked it in the 1990s, he spent seven years working in near isolation, having to invent entirely new techniques to reach the answer. His proof relied on successfully bridging two different branches: elliptic curves and modular forms. Although Ken Ribet had previously shown that establishing this connection would automatically solve Fermat's Last Theorem, no one before Wiles possessed the theoretical tools to actually build that bridge. A similar argument can be made for Grigori Perelman's proof of the Poincaré conjecture.
The core question is: Do these examples prove that LLMs are missing something, some ability to update priors and engage in truly creative thinking? Or does this story恰恰证明恰恰相反——all human knowledge is just data available for training and recombination, and Wiles and Perelman merely demonstrate what LLMs could also do at a larger scale?
This question is empirical, and the answer is still uncertain. But we do know that there are many categories of problems where in-context learning fails today, and parameter-level learning could be useful. For example:
Figure Caption: Problem categories where in-context learning fails and parameter learning might succeed
More importantly, in-context learning can only handle things that can be expressed in language, while weights can encode concepts that prompts cannot convey in words. Some patterns are too high-dimensional, too implicit, too deeply structured to fit into context. For instance, the visual texture that distinguishes a benign artifact from a tumor in a medical scan, or the subtle audio fluctuations that define a speaker's unique rhythm—these patterns are not easily broken down into precise vocabulary.
Language can only approximate them. No prompt, no matter how long, can transmit these things; this kind of knowledge can only live in the weights. They reside in the latent space of learned representations, not in words. No matter how large the context window grows, there will always be knowledge that text cannot describe, knowledge that can only be carried by parameters.
This might explain why explicit "the robot remembers you" features (like ChatGPT's memory) often make users feel discomfort rather than delight. What users really want is not "recall," but "capability." A model that has internalized your behavioral patterns can generalize to new scenarios; a model that merely recalls your history cannot. The gap between "Here's what you wrote last time you replied to this email" (verbatim repetition) and "I understand your way of thinking well enough to anticipate what you need" is the gap between retrieval and learning.
Continual Learning Primer
There are multiple paths to continual learning. The dividing line is not "whether there is memory function," but: Where does compression happen? These paths exist on a spectrum, from no compression (pure retrieval, frozen weights), to full internal compression (weight-level learning, the model gets smarter), with an important middle ground (modules).
Figure Caption: Three paths of continual learning—Context, Modules, Weights
Context
On the context end, teams build smarter retrieval pipelines, agent shells, and prompt orchestration. This is the most mature category: infrastructure is proven, deployment paths are clear. The limitation is depth: context length.
A notable new direction: multi-agent architectures as a scaling strategy for context itself. If a single model is limited to a 128K token window, a coordinated group of agents—each holding its own context, focusing on a slice of the problem, communicating results—can approximate infinite working memory as a whole. Each agent does in-context learning within its own window; the system does aggregation. Karpathy's recent autoresearch project and Cursor's example of building a web browser are early cases. This is a purely non-parametric approach (no weight changes), but it significantly raises the ceiling of what context systems can do.
Modules
In the module space, teams build pluggable knowledge modules (compressed KV caches, adapter layers, external memory stores) that allow general models to specialize without retraining. An 8B model with the right module can match the performance of a 109B model on a target task, with a fraction of the memory footprint. The appeal is its compatibility with existing Transformer infrastructure.
Weights
On the weight update end, researchers are pursuing true parameter-level learning: sparse memory layers that update only relevant parameter segments, reinforcement learning loops that optimize the model from feedback, test-time training that compresses context into weights during inference. These are the deepest methods, and the hardest to deploy, but they truly allow the model to fully internalize new information or skills.
There are various specific mechanisms for parameter updates. Listing a few research directions:
Figure Caption: Overview of research directions in weight-level learning
Weight-level research covers multiple parallel tracks. Regularization and weight space methods have the longest history: EWC (Kirkpatrick et al., 2017) penalizes parameter changes based on their importance to previous tasks; weight interpolation (Kozal et al., 2024) mixes old and new weight configurations in parameter space, but both are relatively fragile at scale.
Test-time training, pioneered by Sun et al. (2020) and later developed into architectural primitives (TTT layers, TTT-E2E, TTT-Discover), takes a截然不同的 approach: perform gradient descent on test data, compressing new information into parameters at the moment it's needed.
Meta-learning asks: Can we train models that know "how to learn"? From MAML's few-shot-friendly parameter initialization (Finn et al., 2017) to Behrouz et al.'s Nested Learning (2025), which structures the model as a hierarchical optimization problem with modules operating on different time scales for fast adaptation and slow updates, inspired by biological memory consolidation.
Distillation retains knowledge of previous tasks by having a student model match frozen teacher checkpoints. LoRD (Liu et al., 2025) makes distillation efficient enough for continuous operation by simultaneously pruning the model and the replay buffer. Self-distillation (SDFT, Shenfeld et al., 2026) flips the source, using the model's own outputs under expert conditions as the training signal, bypassing the catastrophic forgetting of sequential fine-tuning.
Recursive self-improvement operates on similar lines: STaR (Zelikman et al., 2022) bootstraps reasoning能力 from self-generated reasoning chains; AlphaEvolve (DeepMind, 2025) discovered algorithmic optimizations that had gone unimproved for decades; Silver and Sutton's "Age of Experience" (2025) defines agent learning as a never-ending stream of continuous experience.
These research directions are converging. TTT-Discover has already融合 test-time training and RL-driven exploration. HOPE nests fast and slow learning loops within a single architecture. SDFT turns distillation into a fundamental operation for self-improvement. The boundaries between columns are blurring. The next generation of continual learning systems will likely combine multiple strategies: regularization for stability, meta-learning for speed, self-improvement for compound growth. A growing number of startups are betting on different layers of this tech stack.
Continual Learning Startup Landscape
The non-parametric end of the spectrum is the most well-known. Shell companies (Letta, mem0, Subconscious) build orchestration layers and scaffolding, managing what goes into the context window. External storage and RAG infrastructure (e.g., Pinecone, xmemory) provide the retrieval backbone. The data exists; the challenge is getting the right slice in front of the model at the right time. As context windows expand, the design space for these companies grows, especially on the shell side, where a new wave of startups is emerging to manage increasingly complex context strategies.
The parametric end is earlier and more diverse. Companies here are experimenting with some version of "post-deployment compression," allowing models to internalize new information in their weights. The paths roughly correspond to different bets on *how* models should learn after release.
Partial Compression: Learning Without Retraining. Some teams are building pluggable knowledge modules (compressed KV caches, adapter layers, external memory stores) that allow general models to specialize without touching the core weights. The common argument is: you get meaningful compression (not just retrieval), while keeping the stability-plasticity trade-off manageable because learning is isolated, not spread throughout the parameter space. An 8B model with the right module can match the performance of much larger models on target task. The advantage is composability: modules can be plugged and played with existing Transformer architectures, can be swapped or updated independently, with much lower experimentation cost than retraining.
RL and Feedback Loops: Learning from Signals. Other teams bet that the richest signal for post-deployment learning already exists in the deployment loop itself—user corrections, task success/failure, reward signals from real-world outcomes. The core idea is that the model should treat every interaction as a potential training signal, not just an inference request. This is highly analogous to how humans improve at their jobs: do work, get feedback, internalize what works. The engineering challenge is converting sparse, noisy, sometimes adversarial feedback into stable weight updates without catastrophic forgetting. But a model that can truly learn from deployment compounds value in ways context systems cannot.
Data-Centric: Learning from the Right Signals. A related but distinct bet is that the bottleneck is not the learning algorithm, but the training data and surrounding systems. These teams focus on curating, generating, or synthesizing the *right* data to drive continuous updates: the premise is that a model with high-quality, well-structured learning signals needs far fewer gradient steps to improve meaningfully. This dovetails naturally with feedback loop companies but emphasizes the upstream question: it's one thing if the model *can* learn, another what it *should* learn from and to what extent.
New Architectures: Designing Learning Capability from the Ground Up. The most radical bet argues that the Transformer architecture itself is the bottleneck, and continual learning requires fundamentally different computational primitives: architectures with continuous-time dynamics and built-in memory mechanisms. The argument here is structural: if you want a continually learning system, you should embed the learning mechanism into the underlying foundation.
Figure Caption: Continual Learning Startup Landscape
All major labs are also actively working within these categories. Some are exploring better context management and chain-of-thought reasoning, others are experimenting with external memory modules or sleep-time compute pipelines, and several stealth companies are pursuing new architectures. The field is early enough that no single approach has won yet, and given the breadth of use cases, there shouldn't be just one winner.
Why Naive Weight Updates Fail
Updating model parameters in a production environment triggers a cascade of failure modes that are not yet resolved at scale.
Figure Caption: Failure modes of naive weight updates
The engineering problems are well-documented. Catastrophic forgetting means a model sensitive enough to learn from new data will destroy existing representations—the stability-plasticity dilemma. Temporal decoupling refers to the fact that invariant rules and mutable state are compressed into the same set of weights; updating one corrupts the other. Logical integration fails because fact updates don't propagate to their corollaries: changes are confined to the token sequence level, not the semantic concept level. Unlearning is still impossible: there is no differentiable subtraction operation, so there is no precise surgical removal method for false or toxic knowledge.
There is a second class of problems that receives less attention. The current separation between training and deployment is not just an engineering convenience; it is a boundary for safety, auditability, and governance. Opening this boundary causes multiple things to go wrong simultaneously. Safety alignment can degrade unpredictably: even narrow fine-tuning on benign data can produce widespread misaligned behavior.
Continuous updates create an attack surface for data poisoning—a slow, persistent version of prompt injection, but it lives in the weights. Auditability collapses because a continuously updated model is a moving target, making version control, regression testing, or one-time certification impossible. Privacy risks intensify when user interactions are compressed into parameters, baking sensitive information into representations that are harder to filter than information in a retrieved context.
These are open problems, not fundamental impossibilities. Solving them is part of the continual learning research agenda, just like solving the core architectural challenges.
From "Memento" to True Memory
Leonard's tragedy in "Memento" is not that he can't function—in any given scene, he is resourceful, even brilliant. His tragedy is that he can never compound. Every experience remains external—a Polaroid, a tattoo, a note in someone else's handwriting. He can retrieve, but he cannot compress new knowledge.
As Leonard navigates this self-constructed maze, the line between truth and belief begins to blur. His condition doesn't just deprive him of memory; it forces him to constantly reconstruct meaning, making him both the detective and the unreliable narrator of his own story.
Today's AI operates under the same constraints. We have built very powerful retrieval systems: longer context windows, smarter shells, coordinated multi-agent swarms, and they work. But retrieval is not learning. A system that can look up any fact is not forced to find structure. It is not forced to generalize. The lossy compression that made training so powerful—the mechanism that turns raw data into transferable representations—is precisely what we turn off the moment we deploy.
The path forward is likely not a single breakthrough, but a layered system. In-context learning will remain the first line of adaptive defense: it is native, proven, and improving. Module mechanisms can handle the middle ground of personalization and domain specialization.
But for those truly difficult problems—discovery, adversarial adaptation, implicit knowledge that cannot be put into words—we may need to let models continue to compress experience into parameters after training. This means advances in sparse architectures, meta-learning objectives, and self-improvement loops. It might also require us to redefine what a "model" is: not a fixed set of weights, but an evolving system comprising its memory, its update algorithm, and its ability to abstract from its own experience.
The filing cabinet is getting bigger. But a bigger filing cabinet is still a filing cabinet. The breakthrough is to let the model do after deployment what made it powerful during training: compress, abstract, learn. We stand at the turning point from amnesiac models to models with a glimmer of experience. Otherwise, we'll be stuck in our own "Memento."













