EXEED AI

Wenlong Huang's Recent LinkedIn Posts

Wenlong Huang

Wenlong Huang

@wenlongh

CS PhD Student at Stanford (AI / Robotics)

en2 postsLinkedIn

Posts

Wenlong Huang

Tech & AI

2mo

What representation enables open-world robot manipulation from generated videos? Introducing Dream2Flow, our recent work that bridges video generation and robot control with 3D object flow. 🌐 dream2flow.github.io by Stanford University 🔹Robot manipulation is about inducing changes in an environment through actions. We observe that video models (e.g., Veo) excel at producing plausible object motions from an in-the-wild image and language instructions. Intriguingly, these motions are more physically realistic when the actor is human rather than robot, likely because the internet contains far more human interaction data than robot data. 🔹But how do we turn those generated videos into low-level robot actions? This is a nuanced question beyond simple retargeting, because strategies taken by a human may not work on a robot. 🔹We propose Dream2Flow, which uses 3D object flow to separate what should happen in the scene from how a robot should realize it. We extract this flow from generated videos using off-the-shelf vision models, then use it as a shared objective for both trajectory optimization and reinforcement learning. 🔹Dream2Flow can perform a range of in-the-wild tasks zero-shot with trajectory optimization, including manipulation of rigid, articulated, and deformable objects. The robot plans by asking a counterfactual question using a dynamics model (either heuristics-based or learned): if I take this action, will the scene evolve toward the desired 3D flow? 🔹Using as reward for RL, Dream2Flow enables different embodiments to discover emergent behaviors that achieve the same effect (e.g., base motion of the robot dog). Dream2Flow unifies these behaviors through a shared task interface and unifies model-free and model-based methods around a shared tracking goal. 🔹By leveraging purely off-the-shelf video models, Dream2Flow also allows generalization to different object instances, backgrounds, and camera viewpoints. It is also surprisingly steerable: different language instructions in the same scene can induce different desired behaviors. 🔹World modeling encodes rich priors about not only environment dynamics but also behaviors within it. It is immensely useful for robotics, yet we are only scratching the surface of understanding it. The project was led by Karthik Dharmarajan and has been a year in the making, along with the rest of the team Jiajun Wu, Fei-Fei Li, and Ruohan Zhang. Karthik Dharmarajan will also be joining UC Berkeley as a PhD student this fall! Website: dream2flow.github.io  Paper: https://lnkd.in/gpwP2hkT Code: https://lnkd.in/gvJZTxaP
197

Wenlong Huang

Tech & AI

5mo

What if robots can simulate an interactive 3D world, from a single image, in the wild, in real-time? Introducing PointWorld-1B: a large pre-trained 3D world model that predicts environment dynamics given an image and robot actions. https://lnkd.in/ecA5NQpP by Stanford University & NVIDIA Actions are embodied, and deeply spatial of robot’s own geometry. To move the world, we don’t cast spells, click objects, or turn our heads—we move our hands. We consider our “interaction geometry” in 3D. PointWorld represents both state and action as 3D point flows: action from known robot surface, state from cameras. Physics lives in space and time, so should the world model. PointWorld is a transformer that takes in a partial scene point cloud and robot point flows to predict how each scene point moves under the robot action. Trained with simple L2 loss but pixel-level supervision, across any embodiments, tasks, trajectories. Akin to “next token prediction”, but not for 1D tokens, for interaction in space and time, for learning the single source of truth of the physical world. How do we get the data? Our observation: 3D vision is maturing for 3D world models. We curate a large-scale dataset and labeled them with accurate 3D — spanning single-arm, bimanual, whole-body, mobile manipulation. With data at this scale, we can finally ask: how to make 3D world models work? Metric is simple and telling: L2 error on point trajectories. With a large eval set, error bars are micrometers (several times thinner than a human hair). We distilled what worked into a roadmap: modernizing backbone -> tuning training objective -> leveraging image features -> scaling up the model. And we see a clear scaling law: more data or larger model, better generalization. With “human-hair” statistical precision, we then studied its properties. My favorite ones: - PointWorld zero-shot generalizes, surpasses specialists if finetuned. - Real-world data is irreplaceable, but real-sim co-training also benefits. - It captures uncertainty in object’s physical properties, without labels. - It enables positive transfer across distinct robots (single arm, bimanual humanoid). So, how can this be used on a physical robot? Despite many applications, what excites me the most is that a world model enables achieving novel goals at test time by imagination — like Doctor Strange, but for robots. All using a pre-trained model. No demonstrations. No finetuning. “What I cannot create, I do not understand.” — One pinnacle goal of spatial intelligence is to re-create the unique, physical, interactive 3D world we live in, understand it, and use it to enable the next generation of robots. We’re just getting started, and I’m thrilled about the initial findings in this work. The project has been 1.5 years in the making. None of this would have been possible without the support from Yu-Wei Chao, Arsalan Mousavian, Ming-Yu Liu, Dieter Fox, Kaichun Mo, Fei-Fei Li. #Robotics #WorldModels #SpatialIntellience
1.1K