Why robot training data is becoming AI’s new gold rush

Robot training data is scarce. XDOF is raising $70 million to build the infrastructure frontier AI labs need for physical AI.

The race to build capable robots is exposing a problem the artificial intelligence industry has largely been able to sidestep until now: there is no internet-scale trove of high-quality physical-world data waiting to be scraped. Language models learned from oceans of text. Robots, by contrast, need examples of hands, tools, objects, friction, balance, timing and consequence. That gap is pushing a new class of startup into the spotlight.

XDOF, a robotics data infrastructure company emerging from stealth, says it has raised $70 million to build the pipelines, tools and annotation systems that advanced AI labs and robotics developers need if they want machines to learn how to operate in the real world. The company’s emergence comes just days after OpenAI said it would revive its robotics effort, underscoring a broader industry shift toward “physical AI” and the difficult question of how to train it.

According to co-founder and CEO Philippe Wu, the company is already working with around 20 customers, including several frontier AI labs it cannot name publicly. XDOF has roughly 60 employees and is positioning itself not as a robot builder, but as the behind-the-scenes data engine that can support the next generation of robot learning systems.

The data problem behind robotics

For all the attention on models and chips, robotics has a more basic constraint: the industry lacks the kind of training data that made large language models possible. Text is plentiful, publicly accessible and easy to ingest. Physical interaction is none of those things. A robot needs to learn how different objects respond to touch, how much force a motion requires, how camera angle changes perception and how to adapt when the world does not behave like a simulation.

Videos pulled from the web can help, but they are usually too messy and too disconnected from actual robot control to serve as serious training material. Likewise, footage gathered by gig workers or hobbyists often lacks the precision, consistency and sensor alignment needed to teach models how to move reliably in the real world. The bottleneck is not just quantity. It is fidelity, structure and repeatability.

That makes robot training data a different business from the one that fueled generative AI. It requires physical environments, robot fleets, sensor calibration, human operators, labeling workflows and quality control systems that are expensive to run and difficult to scale. In other words, robotics does not just need more data. It needs an entire industrial supply chain for data.

XDOF’s bet: infrastructure, not just datasets

XDOF’s pitch is that the hardest part of robotics will not be solved by a single dataset release or a one-time data collection project. Instead, the company wants to build a repeatable system for gathering, cleaning and organizing training data across multiple sources and use cases. That includes software, hardware and labor.

Wu says the company is deliberately avoiding a narrow “data-only” model. In its view, raw collection is only part of the challenge. Once data is gathered, it still has to be filtered, normalized, annotated and delivered in a form that can actually improve model performance. That is why XDOF is also investing in tooling and operational workflows that create a continuous feedback loop for robotics teams.

The company’s broader thesis is that the next foundational layer in AI will look less like a static dataset and more like an ongoing industrial operation. Training robots, in this model, is not a software problem alone. It is a supply-chain problem.

A $70 million vote of confidence

XDOF says it has raised the new funding from Thrive Capital, Spark Capital, Andreessen Horowitz, Lux Capital and WndrCo. The financing suggests that some of the most active venture investors in AI believe robotics data may become one of the sector’s next major infrastructure categories.

The deal also reflects growing confidence that robotics is moving from long-term promise to near-term commercial competition. As major labs and startups sharpen their focus on physical tasks, investors appear increasingly willing to back the picks-and-shovels businesses that support them.

“All of the top labs are trying to pursue robotics,” Wu said, arguing that no major AI group wants to repeat the experience of arriving late to a key technology shift.

His point is straightforward: in language models, the companies that moved earliest accumulated advantages in talent, compute and ecosystem momentum. In robotics, the same dynamic could apply to data. Labs that wait too long may find themselves trying to catch up in a field where the best training data has already been collected, cleaned and proprietary systems are already locked in.

From Berkeley research to startup formation

Wu’s interest in the problem dates back to his time as a PhD student at the University of California, Berkeley, where he focused on how robots could learn skills from larger datasets. He and future XDOF co-founder and chief technology officer Fred Shentu ran into the same obstacle that still defines the field: there was no large-scale, useful data source to support the kind of model training they wanted to do.

That constraint led to GELLO, a low-cost teleoperation system that allows a human to control a robot arm and generate training examples. The project became influential in robotics research because it addressed a practical need many labs shared but had struggled to solve efficiently.

Wu says the response to that work reinforced the broader market opportunity. If researchers and developers all need the same thing — precise demonstrations of robotic motion and manipulation — then the real business may not be the robots themselves, but the systems that make training them feasible.

Wu, Shentu and XDOF’s third co-founder, chief operating officer Nemo Jin, launched the startup in October 2024 with that idea in mind. The company’s goal is to create a data ecosystem for robotics foundation models, not just a tool for one-off collection projects.

The Berkeley dataset and what makes it unusual

One of XDOF’s earliest public efforts is a collaboration with UC Berkeley’s AI Research Lab on what the company describes as the largest high-quality robotics training dataset it has ever assembled. The dataset, called ABC, includes 130,000 robot manipulation trajectories, 300 hours of simulation and 100 hours of evaluation data.

That kind of pre-training resource has not previously been available to academic researchers at this scale, and supporters of the project say it could help narrow the gap between academic robotics research and the resources available inside large industrial labs.

David McAllister, a Berkeley PhD student who helped organize the release, said the broader AI field has repeatedly shown that open releases can trigger unexpected progress once a community gets access to data and models at scale.

The idea is familiar from other domains. In language, vision and image generation, the release of models or datasets has often catalyzed rapid experimentation and subsequent breakthroughs. XDOF and Berkeley are betting that robotics may follow a similar pattern if researchers can finally work with enough high-quality examples.

The company says the dataset has already been used to train robots on benchmark tasks such as folding T-shirts, flattening boxes and loading AirPods into their cases. Those may sound like simple chores, but they capture much of what makes robotics difficult: object variation, contact, sequencing and precision.

What makes robot data so hard to gather

Collecting robotics data sounds straightforward until the specifics are considered. A useful training example must often include synchronized video, robot joint positions, force signals, environmental conditions and a clear record of the action being attempted. The setup has to be reliable enough that a model can learn from one demonstration and generalize to another.

That is why Wu describes a three-layer approach to the problem. Each layer comes with different costs, data quality and operational complexity.

The three-tier data stack

  • First tier: teleoperation data collected directly from the production robot that will eventually deploy in the field.
  • Second tier: teleoperated robot systems used more broadly to collect general-purpose manipulation data, similar to the approach behind GELLO.
  • Third tier: egocentric human data gathered from people performing everyday tasks, supported by wearable sensing hardware that XDOF plans to develop.

Each tier serves a different role in the training pipeline. The first is closest to deployment and therefore highly valuable. The second expands scale and variety. The third could eventually provide a broader base of human-like motion and object interaction data that helps robots learn before they ever touch a production machine.

But the company says even the hardware used for collection matters. A camera choice can shape the quality of the dataset, which in turn can affect how well hand-tracking or control algorithms work. If the data capture system is poorly designed, it may introduce hidden flaws that only become visible after a model fails in the real world.

Wu said bad hardware choices at the outset can create subtle data problems that are expensive to detect and even harder to fix later.

That is one reason XDOF is also thinking about sensors, not just software. The company’s long-term vision appears to be an integrated stack in which acquisition hardware, operator workflows and data curation all reinforce one another.

Why the big labs may outsource the work

The obvious question is why major AI companies would not simply build this capability themselves. After all, the largest labs already spend heavily on compute, researchers and product development. Why not add the data collection layer internally?

Wu’s answer is that the operation required is closer to running a specialized industrial facility than writing a research prototype. At scale, robot data collection demands large warehouses, hundreds of robots, maintenance crews, calibration processes and trained operators who can consistently produce usable demonstrations.

For frontier labs focused on models, that kind of build-out can be distracting and capital intensive. Outsourcing the work to a specialist may be faster, cleaner and easier to scale. That is the wedge XDOF is trying to exploit.

The strategy is reminiscent of other AI infrastructure markets where the most valuable companies are not necessarily the ones that build the headline-grabbing model, but the ones that make it possible to train, deploy or fine-tune those models more efficiently. In robotics, that layer may be data.

How XDOF fits into the broader robotics race

OpenAI’s decision to restart its robotics initiative added fresh momentum to a competition already intensifying across the industry. The move suggested that leading AI organizations increasingly believe robotics will be a major frontier after text and image generation.

That shift has obvious implications for startups. If the largest players are returning to robotics, they will need suppliers, partners and infrastructure providers. A data company like XDOF is betting that it can become one of those indispensable providers.

The opportunity is especially large because robotics data can be reused across multiple systems and tasks, creating a compounding effect. Once a reliable collection pipeline exists, it can support new models, new benchmark tasks and new physical environments without rebuilding the entire stack from scratch.

That creates a classic infrastructure play: invest heavily upfront, then benefit as demand for the underlying resource rises across the market.

Potential use cases for robotics data platforms

  1. Training general-purpose robot manipulation models.
  2. Improving task-specific industrial robotics.
  3. Building datasets for humanoid systems.
  4. Supporting academic research with high-quality releases.
  5. Creating closed-loop feedback systems for model evaluation and refinement.

What the funding signals to the market

Backing from firms such as Thrive, a16z and Lux does more than put money into XDOF’s balance sheet. It also signals that investors increasingly view robotics as a market where the infrastructure layer may mature before the consumer or industrial application layer does.

That matters because the history of technology waves often shows that the most durable businesses are built around the constraints of the era. In cloud computing, the winners were not only the applications riding on top of infrastructure, but the companies that built the infrastructure itself. Robotics may be at a similar inflection point.

If the current wave of AI enthusiasm continues moving from language to physical action, the race will not just be to build better robots. It will be to gather better experiences from which those robots can learn.

Company / Project What it does Key figures Why it matters
XDOF Builds robotics data pipelines, tooling and annotation systems $70 million raised; about 60 employees; 20 customers Targets the data bottleneck slowing robot foundation models
ABC dataset High-quality robotics training data release with Berkeley 130,000 trajectories; 300 hours of simulation; 100 hours of evaluation One of the largest academic-quality robotics datasets announced so far
GELLO Low-cost teleoperation system for robot data collection Developed by Wu and Shentu during Berkeley work Helped demonstrate practical methods for generating manipulation data
OpenAI robotics relaunch Revived robotics program Program previously shut down in 2021 Shows major labs are again prioritizing physical-world AI

The labor reality behind physical AI

Unlike software training pipelines that can largely run in the cloud, robotics data collection is human-intensive. XDOF says it plans to hire and train teleoperators and egocentric data operators across multiple regions, creating a distributed workforce responsible for generating and validating the examples that models need.

That labor dimension makes the business harder, but potentially more defensible. It also raises practical questions about scale, consistency and long-term cost. To be valuable, the workforce must not only produce large volumes of data, but produce data with tight standards across different settings and operators.

In many ways, this is the unglamorous side of AI progress. The headlines may focus on breakthroughs in reasoning or dexterity, but the underlying competitive edge may be built by teams quietly collecting, filtering and labeling motion after motion, object after object, task after task.

What comes next

XDOF is still early, and robotics remains one of the most technically demanding areas in AI. The company will need to prove that its data workflows can reliably improve downstream model performance and that customers are willing to outsource a core input to a startup rather than build it themselves.

Still, the timing appears favorable. As more major AI players return to robotics, the need for standardized, scalable training data is becoming harder to ignore. If the market for physical AI expands the way generative AI did, the companies that control the best data pipelines could become some of the most strategically important businesses in the sector.

Wu’s wager is that robotics will not be won by the models alone. It will be won by the systems that teach those models how to move in the world.

And if he is right, the real breakout product may not be the robot itself, but the infrastructure that trained it.

Key takeaways

  • Robotics lacks the massive, openly available training datasets that powered language models.
  • XDOF is building tools, pipelines and annotation systems for robot training data.
  • The startup says it has raised $70 million and is working with about 20 customers.
  • Its Berkeley collaboration has produced a large new dataset called ABC.
  • Major AI labs are increasingly viewing physical AI as the next frontier.
Share this 🚀