SA-VLA: Spatially-Aware Reinforcement Learning for Flow-Matching Vision-Language-Action Models

1State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing (LIESMARS), Wuhan University 2Centre for Frontier AI Research (CFAR), Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR) 3Department of Computer Science, National University of Singapore 4Institute of Information Engineering, Chinese Academy of Sciences 5School of Electrical and Electronic Engineering, Nanyang Technological University
Under Review

*Corresponding Author
introduction overview

LEFT: Illustration of spatial inductive bias collapse during naive RL fine-tuning (left) and preserved spatial grounding with SA-VLA (right) under the same task and identical spatial perturbations. For each method, end-effector poses from three temporal phases of a single execution trajectory are rendered as semi-transparent red, green, and blue masks and overlaid to visualize how spatial behavior evolves over time.

RIGHT: Overview of SA-VLA. Visual and spatial tokens are fused into geometry-aware embeddings, which are optimized via step-level dense rewards and spatially-conditioned exploration (SCAN) for robust RL adaptation.

Abstract

Vision-Language-Action (VLA) models achieve strong performance in robotic manipulation, but reinforcement learning (RL) fine-tuning often degrades generalization under spatial distribution shifts. We analyze flow-matching VLA policies and identify the collapse of spatial inductive bias as a key factor limiting robust transfer. To address this, we propose SA-VLA, which explicitly grounds VLA policies in spatial structure by integrating implicit spatial representations, spatially-aware step-level dense rewards, and SCAN, a spatially-conditioned exploration strategy tailored for flow-matching policies. This principled alignment mitigates policy over-specialization and preserves zero-shot generalization to more complex tasks. Experiments on challenging multi-object and cluttered benchmarks demonstrate that SA-VLA enables stable RL fine-tuning and substantially more robust, transferable behaviors.

Representative task executions under spatial perturbations. Despite variations in observation geometry, the policy consistently achieves task completion.

BibTeX

@misc{pan2026savlaspatiallyawareflowmatchingvisionlanguageaction,
      title={SA-VLA: Spatially-Aware Flow-Matching for Vision-Language-Action Reinforcement Learning}, 
      author={Xu Pan and Zhenglin Wan and Xingrui Yu and Xianwei Zheng and Youkai Ke and Ming Sun and Rui Wang and Ziwei Wang and Ivor Tsang},
      year={2026},
      eprint={2602.00743},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2602.00743}, 
}