Publications

Lead Publications5

SFM: Spatially-Aware Flow Matching for Embodied Reinforcement Learning

Xu Pan, Zhenglin Wan, Xingrui Yu*, Xianwei Zheng, Youkai Ke, Ming Sun, Rui Wang, Ziwei Wang, Ivor Tsang

Under Review

Reinforcement learning fine-tuning of flow-matching Vision-Language-Action policies can improve in-distribution performance while degrading spatial generalization. SFM addresses this coupled failure by aligning representation, reward, and exploration in a shared geometric latent space. It combines implicit spatial token fusion, Spatially Grounded Reward for phase-aware geometric credit assignment, and Spatially Conditioned Annealed Exploration for geometry-aware stochastic exploration. Across LIBERO and LIBERO-Plus, SFM improves robustness under spatial distribution shifts while retaining in-distribution task performance.

SA-VLA: Spatially-Aware Reinforcement Learning for Flow-Matching Vision-Language-Action Models

Xu Pan, Zhenglin Wan, Xingrui Yu*, Xianwei Zheng, Youkai Ke, Ming Sun, Rui Wang, Ziwei Wang, Ivor Tsang

IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)CCF A · h5i 450

CODE HF SITE DOI ARXIV

Reinforcement learning (RL) fine-tuning of flow-matching Vision-Language-Action (VLA) policies often improves in-distribution performance but degrades spatial generalization under distribution shifts, leading to inconsistent geometric behavior across viewpoints and scene layouts. We identify this failure as a coupled breakdown of spatial inductive bias during online optimization, caused by representation drift, sparse step-level supervision, and spatially unstructured exploration. These factors jointly bias policies toward short-horizon visual correlations and undermine geometric consistency. We propose SA-VLA, a unified framework that restores spatial inductive bias by jointly aligning representation, reward, and exploration in a shared geometric latent space. It integrates implicit spatial representations for stable grounding, step-level dense rewards for geometric credit assignment, and SCAN, a spatially-conditioned annealed exploration strategy. Across cluttered manipulation benchmarks under spatial distribution shifts, SA-VLA improves robustness while maintaining in-distribution performance.

Scale-aware Co-visible Region Detection for Image Matching

Xu Pan, Zimin Xia, Xianwei Zheng*

ISPRS Journal of Photogrammetry and Remote SensingJCR Q1 · IF 12.2

CODE HF SITE DOI PAPER

Matching images with significant scale differences remains a persistent challenge in photogrammetry and remote sensing. The scale discrepancy often degrades appearance consistency and introduces uncertainty in keypoint localization. While existing methods address scale variation through scale pyramids or scale-aware training, matching under significant scale differences remains an open challenge. To overcome this, we address the scale difference issue by detecting co-visible regions between image pairs and propose SCoDe (Scale-aware Co-visible region Detector), which both identifies co-visible regions and aligns their scales for highly robust, hierarchical point correspondence matching. Specifically, SCoDe employs a novel Scale Head Attention mechanism to map and correlate features across multiple scale subspaces, and uses a learnable query to aggregate scale-aware information of both images for co-visible region detection. In this way, correspondences can be established in a coarse-to-fine hierarchy, thereby mitigating semantic and localization uncertainties. Extensive experiments on three challenging datasets demonstrate that SCoDe outperforms state-of-the-art methods, improving the precision of a modern local feature matcher by 8.41%. Notably, SCoDe shows a clear advantage when handling images with drastic scale variations.

SAMatcher: Co-Visibility Modeling with Segment Anything for Robust Feature Matching

Xu Pan, Zhen Pang, Qiyuan Ma, He Chen, Wei Ji, Shuhan Shen, Xianwei Zheng*

CoRR abs/2606.03406

CODE HF SITE DOI ARXIV

Reliable correspondence estimation is a fundamental problem in image processing, underpinning a wide range of applications such as Structure from Motion, visual localization, and image registration. While recent learning-based approaches have substantially improved the representation capability of local features, most methods still operate primarily at the pixel or patch level. As a result, they lack explicit mechanisms to model regions that are jointly visible across views, leading to brittle behavior when spatial support, semantic context, or visibility patterns vary between images. We propose SAMatcher, a novel feature matching framework that formulates correspondence estimation through explicit co-visibility modeling. Rather than directly establishing point-wise correspondences from local appearance, SAMatcher first predicts consistent co-visible region masks and bounding boxes within a shared cross-view representation space, serving as structured priors to guide and regularize matching. The framework builds upon the Segment Anything Model (SAM) and introduces a symmetric cross-view interaction mechanism that treats paired images as interacting token sequences, enabling bidirectional semantic alignment and the discovery of jointly supported regions. To jointly optimize region segmentation and geometric localization, we introduce a unified supervision scheme that combines point-sampled mask learning with box regression and mask--box consistency constraints, enforcing cross-view coherence during training. Extensive experiments on challenging benchmarks demonstrate that SAMatcher significantly improves robustness under large-scale geometric and viewpoint variations. These results suggest that monocular visual foundation models can be systematically extended to multi-view correspondence estimation through explicit co-visibility modeling, providing a new perspective on structured representation learning for image matching.

Research on Co-Visibility Prior-Guided Image Matching under Large-Scale Disparity

Xu Pan

Master's Thesis

Establishing precise and stable large-scale image matching relationships is a necessary prerequisite for breaking the limitations of ``narrow baseline" acquisition and achieving the construction of large-scale, high-fidelity 3D city models from multi-source, multi-view images of space, sky, and ground. With the widespread application of satellite constellations, unmanned aerial vehicle (UAV) remote sensing, multi-platform oblique photography, and mobile mapping systems, spatial data acquisition methods are evolving from single-platform, single-scale approaches to multi-source, heterogeneous, multi-scale collaboration, and high-dynamic updates. In this process, large-scale image data exhibits significant large-scale differences. Specifically, the same ground feature structure may have image projection scales that differ by several times or even an order of magnitude under different sensors, flight altitudes, and resolutions. Simultaneously, perspective changes, occlusion interference, and variations in imaging conditions lead to significant challenges in image matching, severely restricting subsequent 3D reconstruction, making in-depth research urgent. Although conventional image matching has been extensively studied, existing methods still face difficulties in large-scale image matching. On one hand, scale variations cause local features to shift significantly between views, making it difficult for local similarity-based descriptors to maintain stable consistency. On the other hand, in real-world scenes, the effective mutually visible regions often account for only a small part of the image, while existing methods typically perform unconstrained matching over the entire global range, introducing substantial interference from non-mutually visible regions and generating visually similar but geometrically inconsistent false correspondences. Such mismatches are particularly severe in complex urban and natural scenes and can accumulate errors continuously during subsequent geometric optimization and 3D reconstruction, significantly reducing the overall stability and accuracy of the system. Therefore, relying solely on improving feature expression capabilities is no longer sufficient to meet complex application requirements, making it essential to introduce explicit prior constraints at the matching spatial structure level. To address the above issues, this paper conducts research on large-scale image matching methods guided by mutually visible priors, with specific research contents and innovations summarized as follows: (1) a scale-aware co-visible region detection method, SCoDe, which introduces scale-head attention within a Transformer architecture to explicitly model cross-scale feature relationships in multi-scale subspaces, enabling robust and consistent co-visible region estimation while effectively constraining the matching search space. (2) a semantic-consistency-driven cross-view matching refinement framework, SAMatcher, which explicitly models cross-view dependencies via symmetric feature interaction, introduces a prompt-based mask decoding mechanism with joint bounding-box prediction, and further incorporates mask–box consistency constraints to jointly optimize semantic and geometric representations, thereby improving matching accuracy and robustness under large viewpoint and scale variations. Extensive experiments demonstrate that the proposed method consistently outperforms existing approaches across multiple matching paradigms, particularly under large-scale variations and partial overlap scenarios. Cross-dataset evaluations further validate its strong generalization ability. Overall, by introducing explicit co-visibility priors, this work reformulates conventional point-wise matching into a unified region-constrained and semantics-consistent framework, providing a systematic solution for robust cross-view correspondence estimation in challenging large-scale disparity settings, with significant implications for high-precision 3D reconstruction and spatial intelligence applications.

Collaborative Publications2

Towards Backdoor-Based Ownership Verification for Vision-Language-Action Models

Ming Sun, Rui Wang, Xingrui Yu*, Lihua Jing, Hangyu Du, Zhenglin Wan, Xu Pan, Ivor Tsang

Under Review

DOI ARXIV

Vision-Language-Action models (VLAs) support generalist robotic control by enabling end-to-end decision policies directly from multi-modal inputs. As trained VLAs are increasingly shared and adapted, protecting model ownership becomes essential for secure deployment and responsible open-source usage. In this paper, we present GuardVLA, the first backdoor-based ownership verification framework specifically designed for VLAs. GuardVLA embeds a stealthy and harmless backdoor watermark into the protected model during training by injecting secret messages into embodied visual data. For post-release verification, we propose a swap-and-detect mechanism, in which the trigger projector and an external classifier head are used to activate and detect the embedded backdoor based on prediction probabilities. Extensive experiments across multiple datasets, model architectures, and adaptation settings demonstrate that GuardVLA enables reliable ownership verification while preserving benign task performance. Further results show that the embedded watermark remains detectable under post-release model adaptation.

Institutional Trust and the Domestic AI Advantage: Evidence from DeepSeek and ChatGPT Users in China

Jiashen Huang*, Yu Jia, Xu Pan

Chinese Journal of Communication (Under Review)CiteScore Q1 · IF 1.9

DOI ARXIV

Public trust in generative artificial intelligence exhibits increasingly divergent patterns across national contexts, yet prevailing research largely overlooks the macro-structural forces underlying this divergence. This study argues that trust in AI is not merely a technical response to performance but a product of institutional refraction. We propose an ‘Institutional Prism’ framework to demonstrate how institutional trust shapes user trust in domestic (DeepSeek) and global (ChatGPT) large language models. Drawing on Cognitive-Affective Trust Theory, we distinguish between cognitive and affective dimensions of trust and analyze survey data from 405 Chinese users. The findings show that higher institutional trust is positively associated with stronger affective trust in domestic AI models and shifts cognitive evaluations in a more favorable direction. While under lower institutional trust, this domestic advantage weakens. These findings reveal that institutional trust has emerged as a core dimension of AI trust formation. By linking micro-level psychological judgments with macro-level governance, this research contributes a new perspective to human-machine communication.