What would a World Model look like if we start from a real embodied agent acting in the real world?
It has to have: 1) A real, physically grounded and complex action space—not just abstract control signals. 2) Diverse, real-life scenarios and activities.
Or in short:
It has to be annoyingly complex—in both the action and vision space—to even get close to real life.
We did an initial attempt: Whole-Body Conditioned Egocentric Video Prediction.
In collaboration with @dans_t123 ,
@_amirbar,
@ylecun ,
@trevordarrell and
@JitendraMalikCV.
(For more details, check:
arxiv.org/abs/2506.21552)
What we did is very simple: Predict Egocentric Video from human Actions (PEVA) - Given the past video and a future action represented by relative 3D body pose, PEVA predicts how the world looks next—from the first-person view.
By conditioning on kinematic pose trajectories, structured by the joint hierarchy of the body, it learns how physical actions shape perception.