Ψ₀: An Open Foundation Model Towards Universal Humanoid Loco-Manipulation

Accepted to RSS 2026

Songlin Wei^1*Hongyi Jing^1*Boqian Li^1*Zhenyu Zhao^1*Jiageng Mao¹Zhenhao Ni¹Sicheng He¹Jie Liu¹Xiawei Liu¹Kaidi Kang¹Sheng Zang¹Weiduo Yuan¹Marco Pavone²Di Huang³Yue Wang^1†

^* Equal Contribution ^† Corresponding Author

1. USC Physical Superintelligence (PSI) Lab2. NVIDIA3. WorldEngine

Watch Demos Paper Cite Model Data Code SIMPLE

Ψ₀ ... SIMPLE ...

Abstract

We introduce Ψ₀ (Psi-Zero), an open foundation model to address challenging humanoid loco-manipulation tasks. While existing approaches often attempt to address this fundamental problem by co-training on large and diverse human and humanoid data, we argue that this strategy is suboptimal due to the fundamental kinematic and motion disparities between humans and humanoid robots. Therefore, data efficiency and model performance remain unsatisfactory despite the considerable data volume. To address this challenge, Ψ₀ decouples the learning process to maximize the utility of heterogeneous data sources. Specifically, we propose a staged training paradigm with different learning objectives: first, we autoregressively pre-train a VLM backbone on large-scale egocentric human videos to acquire generalizable visual-action representations; then, we post-train a flow-based action expert on high-quality humanoid robot data to learn precise robot joint control. Our research further identifies a critical yet often overlooked data recipe: in contrast to approaches that scale with noisy Internet clips or heterogeneous cross-embodiment robot datasets, we demonstrate that pre-training on high-quality egocentric human manipulation data followed by post-training on domain-specific real-world humanoid trajectories yields superior performance. Extensive real-world experiments demonstrate that Ψ₀ achieves the best performance using only about 800 hours of human video data and 30 hours of real-world robot data, outperforming baselines pre-trained on more than 10x as much data by over 40% in overall success rate across multiple tasks. We will open-source the entire ecosystem to the community, including a data processing and training pipeline, a humanoid foundation model, and a real-time action inference engine.

Demos

Whole-body tasks that combine navigation, pickup, carrying, and placement.

Navigation

Push Cart

Sustains contact-rich cart pushing over a longer forward path.

Handoff

Pick Up Basket, Walk to Person

Collects a basket, traverses the scene, and finishes with a person-facing handoff.

Serving

Push Cart, Serve Food

Maintains stable locomotion while steering a cart through a serving motion.

Placement

Pick Up Lunchbox, Put on Desk

Transitions from grasp to desk placement without losing the object pose.

Method

TWO DATA SOURCES

Foundations for Ψ₀

To maximize the utility of heterogeneous data sources, Ψ₀ decouples the learning process. Human video provides large-scale, high-quality, and diverse manipulation observations, while humanoid data provides domain-specific real-world trajectories for embodied control.

Mock preview representing large-scale egocentric human video data — EgoDex Egocentric Human Videos

829 hours of human egocentric video capturing diverse dexterous manipulation behaviors. EgoDex provides broad visual-semantic coverage and scalable manipulation priors for long-horizon tasks.

Mock preview representing humanoid robot demonstration data — Humanoid Everyday Humanoid Robot Data

31 hours of humanoid data covering 260 diverse tasks in 7 categories. Humanoid Everyday grounds video priors in whole-body robot actions and embodiment-specific execution.

STAGED TRAINING

A Three-Stage Recipe That Bridges Video Priors and Robot Actions

We present an efficient training recipe for learning humanoid loco-manipulation skills from both human videos and real robot data. The overall training procedure consists of three stages: first, pre-training the VLM backbone on large-scale, high-quality, and diverse human egocentric videos while incorporating humanoid data to mitigate the visual gap; second, post-training the flow-based action expert on cross-task real humanoid data; and third, fine-tuning the action expert using a small amount of in-domain task data, which enables rapid adaptation to new tasks.

Learn Broad Manipulation Priors from Human Video

Training a humanoid foundation model faces a significant data scarcity bottleneck. Therefore, we leverage EgoDex. To further mitigate the visual gap between human videos and robotic observations, we incorporate Humanoid Everyday during this stage.

Ground Those Priors with Real Humanoid Data

After the VLM backbone is trained, we freeze its parameters and train the action expert from scratch. We use the Humanoid Everyday dataset for this task-agnostic post-training stage. Conditioned on hidden features from the VLM backbone, the action expert predicts future whole-body action chunks directly in joint space and learns a strong prior for embodied control.

Adapt to Target Tasks with In-Domain Teleoperation

With the pre-trained VLM and the post-trained action expert, our model can be fine-tuned further end-to-end using a small amount of in-domain data and rapidly learn long-horizon, dexterous loco-manipulation tasks.

DATA COLLECTION

Whole-Body Teleoperation System for Robot-Specific Data

Efficiently learning a long-horizon loco-manipulation task critically depends on the quality of in-domain data for fine-tuning. To address the limitations of prior systems, we propose a tailored teleoperation framework that explicitly separates upper-body pose tracking, dexterous manipulation, and locomotion commands, while enabling single-operator whole-body control. By using a small set of wearable trackers and separating locomotion from in-place whole-body actions, our framework enables single-operator humanoid teleoperation with improved locomotion stability across diverse task scenarios.

DEPLOYMENT

Real-Time Chunking for Deployment

Humanoid robots require smooth and reactive control, particularly when executing long-horizon, dexterous manipulation tasks. However, our model comprises over 2.5 billion parameters, with a single forward pass taking approximately 160 ms. To enable smooth policy rollout despite this latency, we adopt training-time real-time chunking. With RTC, each action prediction is conditioned on the previously committed action chunk and outputs a consistent chunk of future actions, while inference runs asynchronously with execution to avoid interruptions between chunks.

SIMULATION

Fast Evaluation in Simulation

Although our primary goal is to deploy Ψ₀ in the real world, we believe simulation that simulation is very valuable for accelerating experimental iteration and enabling unified, standardized evaluation. We introduce a large-scale humanoid loco-manipulation benchmark in simulation with automated task generation across 50 indoor scenes, imported rigid objects, and randomized episode conditions, giving Ψ₀ a fast evaluation loop before the most expensive hardware experiments.

REAL-WORLD

Real-World Task Setup

We evaluate Ψ₀ on eight diverse long-horizon dexterous loco-manipulation tasks involving manipulation, whole-body motion, and locomotion. The tasks range from simple interactions, such as pick-and-place, pushing, and wiping, to more challenging dexterous manipulations requiring precise finger-object coordination, including turning a faucet and pulling out a chip tray.

Experiments

Real-World Benchmark

Comparisons to Baselines

As illustrated in the following figure, our model outperforms all baselines by a large margin and exhibits the most stable performance across all eight long-horizon dexterous loco-manipulation tasks. These results highlight the effectiveness of our training paradigm, despite using a relatively small amount of robotic data in both the pre-training and post-training stages.

Real-world benchmark distribution graph — Evaluation results of policies across our eight tasks, showing task-wise success rates (%) (left) and aggregated skill-level success rates (%) (right).

Descriptions	ACT	Intern-M1	EgoVLA	H-RDT	Pi0.5	GR00T N1.6	Ψ₀
Remove the lid, turn on the faucet, and fill with water	0/10	0/10	0/10	0/10	2/10	2/10	6/10
Spray the bowl with water, wipe clean, and fold it up	1/10	0/10	0/10	0/10	3/10	4/10	7/10
Pick the bottle, turn around, and pour into cup	0/10	0/10	1/10	0/10	2/10	4/10	8/10
Grab the can, turn and pour onto plate, push the cart forward	0/10	0/10	0/10	0/10	1/10	3/10	7/10
Put the toy into the basket, turn around, and hand it over	0/10	1/10	0/10	0/10	5/10	0/10	9/10
Push the cart, grab the grapes, and place on the plate	0/10	5/10	0/10	0/10	3/10	4/10	6/10
Hold the lunch bag and squat down to place on the table	5/10	0/10	2/10	6/10	2/10	5/10	9/10
Pull out the tray and turn to throw the chip can into the trash	0/10	1/10	0/10	0/10	1/10	1/10	5/10

Ablation Studies

The Role of Pre-Training and Post-Training

We study the effects of pre-training, post-training, and real-time chunking on a dual-arm long-horizon task which consists of three steps: right-arm pick and place, left-arm pick-and-place and dual-arm lift.

EgoDex	HE	Post-Training (On HE)	Real-Time Chunking	MM-DiT Action Head	Naive DiT Action Head	Right-Arm Pick-n-Place	Left-Arm Pick-n-Place	Dual-Arm Carry	Overall Success Rate
✗	✗	✗	✗	✗	✓	1/10	1/10	1/10	0/10
✗	✗	✗	✗	✓	✗	9/10	2/10	3/10	2/10
✓	✗	✗	✗	✓	✗	8/10	6/10	6/10	6/10
✓	✓	✗	✗	✓	✗	8/10	8/10	9/10	8/10
✓	✓	✓	✗	✓	✗	9/10	9/10	10/10	9/10
✓	✓	✓	✓	✓	✗	9/10	9/10	9/10	9/10

Pre-Training on 10% EgoDex

Using only 10% of EgoDex performs worse than the baseline Ψ₀, demonstrating the efficacy of full EgoDex pre-training.

Setting	Exp. 1 Overall	Exp. 2 Overall
Baseline (Ψ₀)	8/10	7/10
Variant (10% EgoDex)	1/10	6/10

Pre-Training on HE Only

The HE-only variant achieves high performance on tasks that do not require fine-grained manipulation, but still lags behind the baseline on subtasks requiring more precise manipulation.

Setting	Exp. 1 Overall	Exp. 2 Overall
Baseline (Ψ₀)	8/10	7/10
Variant (HE)	4/10	4/10

Multi-Task Fine-Tuning

We also explore the effect of multi-task fine-tuning and observe that the performance for each individual task drops compared with single-task fine-tuning. We hypothesize that multi-task training disperses the model's learning objective and causes underfitting.

Cite

@article{wei2026psi0,
  title={{$\Psi_0$}: An Open Foundation Model Towards Universal Humanoid Loco-Manipulation},
  author={Wei, Songlin and Jing, Hongyi and Li, Boqian and Zhao, Zhenyu and Mao, Jiageng and Ni, Zhenhao and He, Sicheng and Liu, Jie and Liu, Xiawei and Kang, Kaidi and others},
  journal={arXiv preprint arXiv:2603.12263},
  year={2026}
}

@article{wei2026simple,
  title={SIMPLE: Simulation-Based Policy Learning and Evaluation for Humanoid Loco-manipulation},
  author={Wei, Songlin and Ni, Zhenhao and Liu, Jie and Zhao, Zhenyu and Ye, Junjie and Jing, Hongyi and Xia, Junkai and Liu, Xiawei and Leong, Michael and Heng, Liang and Huang, Di and Wang, Yue},
  journal={arXiv preprint arXiv:2606.08278},
  year={2026}
}

Ψ₀: An Open Foundation Model Towards Universal Humanoid Loco-Manipulation

Demos

Push Cart

Pick Up Basket, Walk to Person

Push Cart, Serve Food

Pick Up Lunchbox, Put on Desk

Rotate and Pour Water

Pour Can and Push Cart

Pour Can and Push Cart

Pull Tray and Throw Away Can

Close Fridge

Take Out Coffee

Spray Water, Wipe Bowl

Wipe Desk, Place Bottle

Throw Bottle, Then Mop

Turn On Faucet

Walk, Pull Chair