Abstract
Vision-Language-Action (VLA) models achieve strong performance in robotic manipulation, but reinforcement learning (RL) fine-tuning often degrades generalization under spatial distribution shifts. We analyze flow-matching VLA policies and identify the collapse of spatial inductive bias as a key factor limiting robust transfer. To address this, we propose SA-VLA, which explicitly grounds VLA policies in spatial structure by integrating implicit spatial representations, spatially-aware step-level dense rewards, and SCAN, a spatially-conditioned exploration strategy tailored for flow-matching policies. This principled alignment mitigates policy over-specialization and preserves zero-shot generalization to more complex tasks. Experiments on challenging multi-object and cluttered benchmarks demonstrate that SA-VLA enables stable RL fine-tuning and substantially more robust, transferable behaviors.
Training dynamics on the LIBERO-PLUS spatial-perturbation subset. Success rates are evaluated using SDE-based policy checkpoints saved every 10 training steps. Solid curves denote few-shot RL, and dashed curves denote zero-shot evaluation. Zero-shot evaluation uses 8 environments with a global batch size of 384, while few-shot RL uses 64 environments with a global batch size of 2048.
Training dynamics under limited spatial coverage. Dense rewards stabilize RL optimization, while combining dense rewards with SCAN further improves final success rate. Shaded regions denote one standard deviation over two seeds.
Few-shot evaluation on LIBERO-PLUS comparing SDE-based and learned exploration noise. Across both sparse and dense reward settings, learned noise consistently outperforms SDE-based noise with lower variance and more stable performance, highlighting the robustness of policy-dependent exploration.
Phase-wise dense reward visualization. Shown are changes in d_ro, d_od, gripper opening angle, and the corresponding dense reward throughout task execution.
BibTeX
@misc{pan2026savlaspatiallyawareflowmatchingvisionlanguageaction,
title={SA-VLA: Spatially-Aware Flow-Matching for Vision-Language-Action Reinforcement Learning},
author={Xu Pan and Zhenglin Wan and Xingrui Yu and Xianwei Zheng and Youkai Ke and Ming Sun and Rui Wang and Ziwei Wang and Ivor Tsang},
year={2026},
eprint={2602.00743},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2602.00743},
}