Why Everyone is Rushing to Build Reinforcement Learning Environments

“RL environment specifications are among the most consequential things we can write as AI researchers.” The post Why Everyone is Rushing to Build Reinforcement Learning Environments appeared first on Analytics India Magazine.

Why Everyone is Rushing to Build Reinforcement Learning Environments

Building reinforcement learning (RL) environments is quickly emerging as the next big thing in AI. OpenAI co-founder Andrej Karpathy recently noted in his post on X that the evolution of AI training can be broken down into three distinct eras—pretraining, supervised finetuning and, now, reinforcement learning environments.

“In the era of pretraining, what mattered was internet text,” Karpathy explained. The priority then was to gather a large, diverse and high-quality collection of online documents to train models.

With supervised finetuning, the focus shifted to conversations. “Contract workers were hired to create answers for questions, a bit like what you’d see on Stack Overflow or Quora, but geared towards LLM use cases,” he said.

According to Karpathy, neither of these approaches is going away. Instead, the current era adds something—environments. 

Unlike static text or curated conversations, environments allow models to interact, take actions, see outcomes and improve. This creates opportunities to move “beyond statistical expert imitation.” Environments can be used for both training and evaluation, but just like before, the challenge is assembling a large, diverse, high-quality set.

This is not the first time Karpathy has spoken about RL environments. Earlier this year, he asked his friends working in the open-source space to help construct a high diversity of RL environments that would help elicit LLM cognitive strategies.

Industry Moves Into RL Environments

In a recent podcast with Nikhil Kamath, Deedy Das of Menlo Ventures stated that he considers RL as a service and RL environments to be a booming area. 

Companies like OpenAI, Google and Anthropic are already working in this space. For instance, Google DeepMind recently launched Genie 3, a world model that allows AI systems to use their understanding of the world to simulate aspects of it and predict both how an environment will evolve and how their actions will impact it.

OpenAI’s first project, Gym, launched in 2016 as an open-source toolkit for RL. It created a framework for standardised environments, offering a common way to develop and test algorithms across tasks ranging from Atari games, like Pong, to robotic control.

In 2023, OpenAI also acquired Global Illumination to build creative tools, infrastructure and digital experiences for AI agents.

So far, RL in LLMs has performed best in domains with clear, verifiable rewards, such as coding and math, where success is easy to measure. In contrast, models have struggled in areas where reward functions are less defined.

Das explained that RL differs from traditional data collection because it requires an environment where models learn by pursuing rewards. These rewards can be straightforward, such as solving a math problem, or more nuanced, like text generation, where an LLM can act as the judge. 

For him, RL is less about generating synthetic data and more about building environments where models learn by pursuing rewards, much like AlphaGo, where the reward was simply winning the game.

Sachin Dharashivkar, chief executive officer at AthenaAgena, told AIM that RL environments can approximate aspects of real life,  allowing models to specialise in different domains. “Right now, everybody is asking which LLM is going to win. I don’t think that’s going to happen, because as we get newer environments, new domains will open up,” he said.

According to him, this will lead to the development of domain-specific AI systems. “If you want a companion, most of the time you don’t need Einstein-level intelligence. You want to create an environment that fosters engagement. Coding is one example, but accounting can be another, where you want extremely precise instructions to follow. In insurance, it might be something very different.”

When asked how RL environments differ from fine-tuning, he said, “In fine-tuning, we are changing the parameters of the model with supervised examples. But in RL…you don’t care how it is solved; you care about the end result. The simulator enables that.”

Reinforcement as a Service 

Today, several companies are offering reinforcement as a service. One of them is Prime Intellect, which recently launched Environments Hub. The company argues that RL environments are the key bottleneck to the next wave of AI progress, but big labs are keeping them closed.

Similarly, Shyamal Anadkat of OpenAI believes RL for large language models is still in its early stages, but will soon gain traction as companies look to adapt models for their own domains. “We’ll see many organisations customising or optimising domain-specific intelligence models with RL—once platforms and evaluations improve and make it easier to justify the activation energy,” he said. 

According to Banghua Zhu, assistant professor at Washington University and principal research scientist at NVIDIA, the role of environment engineer is emerging. These are specialists who design high-quality RL environments with verifiable rewards. 

In reinforcement learning from human feedback (RLHF), people gave feedback to train models. In reinforcement learning with verifiable rewards (RLVR), the feedback comes directly from the environment through verifiable rewards. This makes the job about creating challenging tasks, understanding what LLMs can or can’t do, and carefully engineering reliable environments that give clear signals. Zhu said doing so requires an understanding of LLM limits, creativity in task design and engineering rigour to build reliable environments and reward systems.

The range of possible environments is vast—from terminals, operating systems, enterprise workflows, and even text or video games. Startups are already entering this space, which some see as a critical pathway towards developing generalist agents.

“Setting up the environment to have the right reward function is one thing, but another aspect is engineering it well. Creating scalable, robust environments is a key technical challenge,” said SemiAnalysis in its blog post.

“RL environment specifications are among the most consequential things we can write as AI researchers. A relatively short specification (eg, less than 1,000 words of instructions saying what problems to create and how to grade them) often gets expanded either by humans or via synthetic methods into thousands of datapoints,” said Meta’s Jason Wei, on X.

Evaluation and the Future

Building effective RL environments, however, is not a straightforward task. “RL environment is not just about creating the interface. The knowledge of the environment comes from the domain…You need domain experts,” Dharashivkar said.

He added that horizontal RL environment startups may provide some value, but domain-specific expertise will be critical for production use cases. “You can always reach 70-80% accuracy by covering common cases. The real challenge is in corner cases. Demos work well, but production is where systems fail.”

Similarly, Ross Taylor, CEO at General Reasoning, in his recent post on X, said that there are hardly any high-quality RL environments and evaluations available. “Most agentic environments and evaluations are flawed when you look at the details. It’s a crisis, and no one is talking about it because they’re being hoodwinked by labs marketing their models on flawed evaluations,” he said.On evaluating RL environments, Dharashivkar argued that current benchmarks may not be sufficient. “We have so many professions and people—one brain does not solve for everything. Just like in human work, specialisation matters. I personally believe we will have thousands or millions of models for separate use cases.”

The post Why Everyone is Rushing to Build Reinforcement Learning Environments appeared first on Analytics India Magazine.

What's Your Reaction?

like

dislike

love

funny

angry

sad

wow