What if robots can simulate an interactive 3D world, from a single image, in the wild, in real-time?
Introducing PointWorld-1B: a large pre-trained 3D world model that predicts environment dynamics given an image and robot actions.
https://lnkd.in/ecA5NQpP
by Stanford University & NVIDIA
Actions are embodied, and deeply spatial of robot’s own geometry.
To move the world, we don’t cast spells, click objects, or turn our heads—we move our hands. We consider our “interaction geometry” in 3D. PointWorld represents both state and action as 3D point flows: action from known robot surface, state from cameras.
Physics lives in space and time, so should the world model.
PointWorld is a transformer that takes in a partial scene point cloud and robot point flows to predict how each scene point moves under the robot action.
Trained with simple L2 loss but pixel-level supervision, across any embodiments, tasks, trajectories. Akin to “next token prediction”, but not for 1D tokens, for interaction in space and time, for learning the single source of truth of the physical world.
How do we get the data? Our observation: 3D vision is maturing for 3D world models.
We curate a large-scale dataset and labeled them with accurate 3D — spanning single-arm, bimanual, whole-body, mobile manipulation.
With data at this scale, we can finally ask: how to make 3D world models work?
Metric is simple and telling: L2 error on point trajectories. With a large eval set, error bars are micrometers (several times thinner than a human hair).
We distilled what worked into a roadmap: modernizing backbone -> tuning training objective -> leveraging image features -> scaling up the model.
And we see a clear scaling law: more data or larger model, better generalization.
With “human-hair” statistical precision, we then studied its properties.
My favorite ones:
- PointWorld zero-shot generalizes, surpasses specialists if finetuned.
- Real-world data is irreplaceable, but real-sim co-training also benefits.
- It captures uncertainty in object’s physical properties, without labels.
- It enables positive transfer across distinct robots (single arm, bimanual humanoid).
So, how can this be used on a physical robot?
Despite many applications, what excites me the most is that a world model enables achieving novel goals at test time by imagination — like Doctor Strange, but for robots.
All using a pre-trained model. No demonstrations. No finetuning.
“What I cannot create, I do not understand.” — One pinnacle goal of spatial intelligence is to re-create the unique, physical, interactive 3D world we live in, understand it, and use it to enable the next generation of robots.
We’re just getting started, and I’m thrilled about the initial findings in this work.
The project has been 1.5 years in the making. None of this would have been possible without the support from Yu-Wei Chao, Arsalan Mousavian, Ming-Yu Liu, Dieter Fox, Kaichun Mo, Fei-Fei Li.
#Robotics #WorldModels #SpatialIntellience