Reinforcement learning (RL) fine-tuning of flow-matching Vision-Language-Action (VLA) policies often improves in-distribution performance but degrades spatial generalization under distribution shifts, leading to inconsistent geometric behavior across viewpoints and scene layouts commonly encountered in robotic deployment. We identify this failure as a coupled breakdown of spatial inductive bias during online optimization, caused by representation drift, sparse step-level supervision, and spatially unstructured exploration. These factors jointly bias policies toward short-horizon visual correlations and undermine geometric consistency. We propose SFM, a unified framework that restores spatial inductive bias by aligning representation, reward, and exploration in a shared geometric latent space. It integrates implicit spatial representations for stable grounding, a step-level spatially grounded reward (SGR) for geometric credit assignment, and SCAE, a spatially-conditioned annealed exploration strategy. Across cluttered manipulation benchmarks under spatial distribution shifts, SFM improves spatial robustness, achieving 5.05% gains under viewpoint perturbations while maintaining in-distribution performance.
Publications
SA-VLA: Spatially-Aware Reinforcement Learning for Flow-Matching Vision-Language-Action Models
Xu Pan, Zhenglin Wan, Xingrui Yu*, Xianwei Zheng, Youkai Ke, Ming Sun, Rui Wang, Ziwei Wang, Ivor Tsang
IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) CCF A · h5i 450
Citations -- 2026
Reinforcement learning (RL) fine-tuning of flow-matching Vision-Language-Action (VLA) policies often improves in-distribution performance but degrades spatial generalization under distribution shifts, leading to inconsistent geometric behavior across viewpoints and scene layouts. We identify this failure as a coupled breakdown of spatial inductive bias during online optimization, caused by representation drift, sparse step-level supervision, and spatially unstructured exploration. These factors jointly bias policies toward short-horizon visual correlations and undermine geometric consistency. We propose SA-VLA, a unified framework that restores spatial inductive bias by jointly aligning representation, reward, and exploration in a shared geometric latent space. It integrates implicit spatial representations for stable grounding, step-level dense rewards for geometric credit assignment, and SCAN, a spatially-conditioned annealed exploration strategy. Across cluttered manipulation benchmarks under spatial distribution shifts, SA-VLA improves robustness while maintaining in-distribution performance.
Vision-Language-Action models (VLAs) support generalist robotic control by enabling end-to-end decision policies directly from multi-modal inputs. As trained VLAs are increasingly shared and adapted, protecting model ownership becomes essential for secure deployment and responsible open-source usage. In this paper, we present GuardVLA, the first backdoor-based ownership verification framework specifically designed for VLAs. GuardVLA embeds a stealthy and harmless backdoor watermark into the protected model during training by injecting secret messages into embodied visual data. For post-release verification, we propose a swap-and-detect mechanism, in which the trigger projector and an external classifier head are used to activate and detect the embedded backdoor based on prediction probabilities. Extensive experiments across multiple datasets, model architectures, and adaptation settings demonstrate that GuardVLA enables reliable ownership verification while preserving benign task performance. Further results show that the embedded watermark remains detectable under post-release model adaptation.
Matching images with significant scale differences remains a persistent challenge in photogrammetry and remote sensing. The scale discrepancy often degrades appearance consistency and introduces uncertainty in keypoint localization. While existing methods address scale variation through scale pyramids or scale-aware training, matching under significant scale differences remains an open challenge. To overcome this, we address the scale difference issue by detecting co-visible regions between image pairs and propose SCoDe (Scale-aware Co-visible region Detector), which both identifies co-visible regions and aligns their scales for highly robust, hierarchical point correspondence matching. Specifically, SCoDe employs a novel Scale Head Attention mechanism to map and correlate features across multiple scale subspaces, and uses a learnable query to aggregate scale-aware information of both images for co-visible region detection. In this way, correspondences can be established in a coarse-to-fine hierarchy, thereby mitigating semantic and localization uncertainties. Extensive experiments on three challenging datasets demonstrate that SCoDe outperforms state-of-the-art methods, improving the precision of a modern local feature matcher by 8.41%. Notably, SCoDe shows a clear advantage when handling images with drastic scale variations.
Reliable correspondence estimation is a fundamental problem in image processing, underpinning a wide range of applications such as Structure from Motion, visual localization, and image registration. While recent learning-based approaches have substantially improved the representation capability of local features, most methods still operate primarily at the pixel or patch level. As a result, they lack explicit mechanisms to model regions that are jointly visible across views, leading to brittle behavior when spatial support, semantic context, or visibility patterns vary between images. We propose SAMatcher, a novel feature matching framework that formulates correspondence estimation through explicit co-visibility modeling. Rather than directly establishing point-wise correspondences from local appearance, SAMatcher first predicts consistent co-visible region masks and bounding boxes within a shared cross-view representation space, serving as structured priors to guide and regularize matching. The framework builds upon the Segment Anything Model (SAM) and introduces a symmetric cross-view interaction mechanism that treats paired images as interacting token sequences, enabling bidirectional semantic alignment and the discovery of jointly supported regions. To jointly optimize region segmentation and geometric localization, we introduce a unified supervision scheme that combines point-sampled mask learning with box regression and mask--box consistency constraints, enforcing cross-view coherence during training. Extensive experiments on challenging benchmarks demonstrate that SAMatcher significantly improves robustness under large-scale geometric and viewpoint variations. These results suggest that monocular visual foundation models can be systematically extended to multi-view correspondence estimation through explicit co-visibility modeling, providing a new perspective on structured representation learning for image matching.
Public trust in generative artificial intelligence exhibits increasingly divergent patterns across national contexts, yet prevailing research largely overlooks the macro-structural forces underlying this divergence. This study argues that trust in AI is not merely a technical response to performance but a product of institutional refraction. We propose an ‘Institutional Prism’ framework to demonstrate how institutional trust shapes user trust in domestic (DeepSeek) and global (ChatGPT) large language models. Drawing on Cognitive-Affective Trust Theory, we distinguish between cognitive and affective dimensions of trust and analyze survey data from 405 Chinese users. The findings show that higher institutional trust is positively associated with stronger affective trust in domestic AI models and shifts cognitive evaluations in a more favorable direction. While under lower institutional trust, this domestic advantage weakens. These findings reveal that institutional trust has emerged as a core dimension of AI trust formation. By linking micro-level psychological judgments with macro-level governance, this research contributes a new perspective to human-machine communication.
Research on Co-Visibility Prior-Guided Image Matching under Large-Scale Disparity
Xu Pan
Master's Thesis 2026