Google’s Next ‘Attention is All You Need’ Moment

A neural network is better viewed as a collection of multiple optimisation processes, each with its own internal memory. The post Google’s Next ‘Attention is All You Need’ Moment appeared first on Analytics India Magazine.

Programming, App Development, Web Development Dec 1, 2025 0 14 Add to Reading List

Google’s Next ‘Attention is All You Need’ Moment

Google Research has published a paper arguing that many of today’s assumptions about deep learning are incomplete or misleading.

The authors introduced a new framework called Nested Learning (NL), which reframes how neural networks store information, learn from data and adapt over time. They claim this approach may explain why current AI systems hit limits and how future models could move beyond them.

Deep learning models are usually described as layers stacked on top of each other, each performing its own transformation. Google’s researchers argue that this picture hides what is actually happening inside these systems.

According to the paper, a neural network is better viewed as a collection of multiple optimisation processes, each with its own internal memory, learning behaviour and update rate.

They write that NL “coherently represents a model with a set of nested, multi-level, and/or parallel optimisation problems” each receiving and compressing its own flow of context. In other words, instead of one big learner, a model is a system of smaller learners operating at different speeds and levels.

The paper has already sparked strong reactions in the AI community.

“Al is constantly improving and the speed is accelerating. Google just dropped “Attention Is All You Need (V2)” — and it might finally fix catastrophic forgetting,” posted Pawel Czech, founder of lablab.ai on X.

Why Transformers Hit a Wall

Transformers can process vast data and generate strong outputs, but the paper says they hit a core limit because they stop learning once pre-training ends. After training, their long-term memory becomes fixed in their weights, and new information can’t be absorbed unless the model is retrained.

This is in line with what Safe Superintelligence Inc. (SSI) co-founder Ilya Sutskever said in his recent podcast with Dwarkesh Patel, where he argued that the traditional scaling recipe is approaching its limits. “At some point, pre-training will run out of data. The data is very clearly finite,” he said. He added that while increasing compute would certainly help, it would not fundamentally solve the bottleneck.

This is why, in his view, the industry is transitioning into a new phase.
“So it’s back to the age of research again, just with big computers,” Sutskever said.

Coming back to Google’s paper, the authors compared the limitation of pre-training to anterograde amnesia, a condition in which a person cannot form new long-term memories after a certain event or injury. In the same way, the model operates only with what fits inside its context window and whatever information was stored in its weights before training ended. Anything learned during a conversation or new task disappears as soon as the context resets.

The researchers argue that this creates a rigid system that cannot adapt or continually learn, regardless of how many layers or parameters are added.

But, theory is only one part of the story. Mohammed Arsalan, generative AI consultant at T-Systems ICT India, told AIM that Nested Learning can run on today’s infrastructure, but it won’t be plug-and-play. “It will need extra engineering effort and smarter compute management,” he said, adding that the multi-level update frequencies make it more complex than standard training.

Self-Modifying Models and HOPE

Building on NL, the team introduces a self-modifying sequence model that “learns how to modify itself by learning its own update algorithm.” They then combine this idea with a Continuum Memory System, which assigns different MLP blocks to different update frequencies.

The resulting architecture, HOPE, updates parts of itself at different rates and incorporates a deeper memory structure compared to Transformers. HOPE’s performance on common-sense reasoning and language modelling benchmarks shows improvements over Transformer++, RetNet, DeltaNet and Titans at certain scales.

“The HOPE model is still [in the] proof-of-concept stage, so while it doesn’t need a completely new stack, real-world deployment will definitely require upgrades to handle the continuum memory systems efficiently,” said Arsalan. According to him, it will take around two-three years to become practical.

Similarly, Adithya S Kolavi, research fellow at Microsoft, told AIM that in terms of infrastructure, it should fit into current training stacks with new scheduling logic added on top. “Something like an integration into the Hugging Face trainer, or TRL, seems realistic since inference itself does not change,” he said.

Haiyu Wu, PhD, research scientist at Altos Labs, said in a post on X that the Nested Learning paper “elegantly reformulates the training paradigm” and highlights two core ideas. The first is treating the optimiser’s momentum as its own learning task — one that adapts based on a “local surprise signal,” or the gap between the learned momentum and the actual gradient.

Second, he explained that using different weight-update frequencies allows the model to build both long-term and short-term memory, a structure inspired by how the human brain functions.

Whether NL becomes practical soon or not, it marks an important step in exploring new directions for AI systems, especially as the field looks beyond scaling and begins searching for more adaptive, memory-rich architectures.

The post Google’s Next ‘Attention is All You Need’ Moment appeared first on Analytics India Magazine.

Read Original