
For years, tech leaders have promised AI agents that can use software to complete tasks for people. But today’s consumer AI agents like OpenAI’s ChatGPT Agent or Perplexity’s Comet still have many limitations. Making these agents more capable may require new methods that the industry is only beginning to explore.
One promising approach is using simulated workspaces where AI agents can practice multi-step tasks. These are called reinforcement learning (RL) environments. They may become as important for AI agents as labeled datasets were for the last wave of AI.
AI researchers, founders, and investors say the demand for RL environments is growing. “All the major AI labs are building RL environments in-house,” says Jennifer Li, general partner at Andreessen Horowitz. “But these are very complex to create. Labs are also looking at outside vendors to deliver high-quality environments and tests.”
This demand has created well-funded startups, like Mechanize and Prime Intellect, aiming to lead the market. Large data-labeling companies, such as Mercor and Surge, are also moving into RL environments. Reports suggest that Anthropic alone may invest over $1 billion in RL environments in the next year.
Investors hope one of these startups could become the “Scale AI of environments,” just like Scale AI became central in the data labeling market. Whether RL environments will truly boost AI progress is still unknown.
What are RL environments?
RL environments are training grounds that mimic software applications. One founder compared building them to “creating a very boring video game.”
For example, an environment might simulate a web browser and ask an AI agent to buy socks online. The agent earns a reward when it succeeds. But the process is tricky. The agent could misread menus, get lost, or buy the wrong items.
Because agents can take unexpected actions, environments must handle all possibilities. They must also provide useful feedback. That makes them far more complex than simple datasets.
Some environments are sophisticated. They allow agents to use tools, access the internet, or operate across multiple applications. Others focus on narrow tasks, like learning enterprise software.
The concept is not new. OpenAI experimented with “RL Gyms” in 2016, and DeepMind’s AlphaGo used RL in a simulated Go environment. But today, RL environments aim to train general-purpose AI agents, not specialized systems.
A crowded market
Data-labeling companies like Scale AI, Surge, and Mercor are entering the RL environment space. They have resources and strong ties with AI labs. Surge CEO Edwin Chen says demand has increased sharply. Surge even created a dedicated team for RL environments.
Mercor, valued at $10 billion, offers environments for coding, healthcare, and law. Scale AI, once dominant in data labeling, faces new competition but is adapting. “We’ve adapted before,” says Chetan Rane, head of product for agents and RL environments. “From autonomous vehicles to ChatGPT, and now to agents and environments.”
Some startups are focusing only on RL environments. Mechanize, founded six months ago, aims to “automate all jobs” but is starting with coding agents. It offers engineers $500,000 salaries to build environments. Mechanize has already partnered with Anthropic.
Prime Intellect, backed by Andrej Karpathy and others, is creating an RL environment hub. It works like a “Hugging Face for environments.” Developers get access to robust environments and computational resources. Training AI in these environments uses more computing power than older methods, which opens opportunities for GPU providers.
Will RL environments scale?
The big question is whether RL environments can scale like past AI training methods. Reinforcement learning has helped recent AI breakthroughs, like OpenAI’s o1 and Anthropic’s Claude Opus 4. Older methods are showing diminishing returns.
RL environments let agents interact with tools and software. They provide richer feedback than just text responses. But some experts are cautious. Ross Taylor, a former Meta AI researcher, warns about “reward hacking,” where agents cheat to earn rewards without completing tasks.
OpenAI’s Sherwin Wu says the space is competitive and fast-changing. Andrej Karpathy, who supports RL environments, is also cautious. He said he is “bullish on environments and agentic interactions but bearish on reinforcement learning specifically.” This shows that RL alone may not be enough to drive progress.
Resources:
https://apnews.com/
https://knowledgenexuses.com/


