The Decoupled Teacher

The Decoupled Teacher

Teaching a humanoid robot to manipulate objects while walking seems like it should benefit from mixed training data — human demonstrations and robot trajectories together. More data, more generalization. The intuition is wrong.

Psi-0 decouples the learning: pre-train a vision-language backbone on egocentric human video to learn visual-action representations, then post-train a flow-based action expert on humanoid robot data to learn precise joint control. The two stages use different data for different purposes. The human videos teach what manipulation looks like from the agent’s perspective. The robot data teaches how this particular body executes movements.

With 800 hours of human video and 30 hours of real robot data, Psi-0 outperforms baselines pre-trained on 10x as much data by over 40% in overall success rate across manipulation tasks.

The through-claim: the bottleneck in cross-embodiment transfer is not data quantity but data alignment. Human bodies and humanoid robots have different kinematics — a human’s shoulder joint and a robot’s actuator follow different dynamics even when performing the same task. Co-training on mixed data forces the model to reconcile these differences during learning, which contaminates both representations. Decoupling them — extracting visual understanding from one source and motor control from another — respects the structural difference between knowing what to do and knowing how this body does it. The factor-of-ten data efficiency gain comes not from better algorithms but from not mixing things that shouldn’t be mixed.


No comments yet.