Hello World!

Hi, I am a M.Sc. student at Wuhan University, working at the State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing (LIESMARS), under the guidance of Prof. Xianwei Zheng. My research focuses on embodied intelligence and 3D visual perception, with an emphasis on how spatial representations support generalizable decision-making and agent-centric policy learning.

My current work lies at the intersection of computer vision, reinforcement learning, and generative modeling, where I study how 2D and 3D representations can be unified to enable robust perception-action coupling. I am particularly interested in structure-aware visual representations that support cross-view understanding, generalization across environments, and interaction-driven learning in embodied settings.

Previously, I explored generative AI for image and video synthesis during my internship at Baidu. I am currently a remote research assistant at the Centre for Frontier AI Research (CFAR), Agency for Science, Technology and Research (A*STAR), supervised by Dr. Xingrui Yu, where I work on generalizable reinforcement learning for embodied agents, with a focus on agent-centric formulations and transferable policies grounded in implicit spatial representations that support generalization across tasks and scenes.

More broadly, my goal is to develop spatially grounded learning frameworks that bridge perception, geometry, and control, advancing the next generation of embodied systems that can reason about and act within complex real-world environments.

Find me on

News

Mar 2026
Our work SAMatcher is now available! [Project] [GitHub]
Jan 2026
Our work SA-VLA is now available! [Project] [GitHub] [arXiv]
Nov 2025
Successfully defended my Master's thesis proposal.
Aug 2025
The first paper accepted by the ISPRS Journal of Photogrammetry and Remote Sensing. [Paper]
Aug 2025
Began a remote research internship at CFAR, A*STAR, supervised by Dr. Xingrui Yu and in collaboration with Zhenglin Wan.
Jul 2025
Attended the 2025 Annual Academic Conference on Photogrammetry and Remote Sensing, CSGPC in Kunming, China.
Dec 2024
Began a research internship at Baidu in Shenzhen, supervised by Dr. Yan Zhang, exploring frontier text-to-image and text-to-video generation.
Jul 2024
Began collaboration on the SCoDe project under the guidance of Dr. Zimin Xia.
Sep 2023
Enrolled in the Master's program at the State Key Lab. LIESMARS, Wuhan University, as a recommended exemption student, under the supervision of Prof. Xianwei Zheng.
...

Experiences

Wuhan University

School of Remote Sensing and Information Engineering

State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing (LIESMARS)

International Technology R&D Department, Baidu, Inc.

Centre for Frontier AI Research (CFAR),
Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR)

Acknowledgements:
I’m grateful to my collaborators and mentors for their guidance and support, especially
Prof. Xianwei Zheng (WHU), Prof. Hanjiang Xiong (WHU), Dr. Xingrui Yu (A*STAR), Dr. Zimin Xia (EPFL), Dr. Yan Zhang (Baidu),
and my colleagues/peers including
Zhenglin Wan (NUS), Jiashen Huang (NTU), Ziqong Lu (HKU), Qiyuan Ma (WHU), Jintao Zhang (WHU), Chenyu Zhao (WHU), He Chen (WHU),
and others I’ve had the pleasure to work with.

Selected Publications

VIEW ALL

SA-VLA: Spatially-Aware Reinforcement Learning for Flow-Matching Vision-Language-Action Models

Xu Pan, Zhenglin Wan, Xingrui Yu*, Xianwei Zheng, Youkai Ke, Ming Sun, Rui Wang, Ziwei Wang, Ivor Tsang

(Under Review) Citations 1 2026

CODE HF SITE DOI ARXIV

Vision-Language-Action (VLA) models achieve strong performance in robotic manipulation, but reinforcement learning (RL) fine-tuning often degrades generalization under spatial distribution shifts. We analyze flow-matching VLA policies and identify the collapse of spatial inductive bias as a key factor limiting robust transfer. To address this, we propose SA-VLA, which explicitly grounds VLA policies in spatial structure by integrating implicit spatial representations, spatially-aware step-level dense rewards, and SCAN, a spatially-conditioned exploration strategy tailored for flow-matching policies. This principled alignment mitigates policy over-specialization and preserves zero-shot generalization to more complex tasks. Experiments on challenging multi-object and cluttered benchmarks demonstrate that SA-VLA enables stable RL fine-tuning and substantially more robust, transferable behaviors.

Scale-aware Co-visible Region Detection for Image Matching

Xu Pan, Zimin Xia, Xianwei Zheng*

ISPRS Journal of Photogrammetry and Remote Sensing JCR Q1 · IF 12.2 Citations 0 2025

CODE HF SITE DOI PAPER

Matching images with significant scale differences remains a persistent challenge in photogrammetry and remote sensing. The scale discrepancy often degrades appearance consistency and introduces uncertainty in keypoint localization. While existing methods address scale variation through scale pyramids or scale-aware training, matching under significant scale differences remains an open challenge. To overcome this, we address the scale difference issue by detecting co-visible regions between image pairs and propose SCoDe (Scale-aware Co-visible region Detector), which both identifies co-visible regions and aligns their scales for highly robust, hierarchical point correspondence matching. Specifically, SCoDe employs a novel Scale Head Attention mechanism to map and correlate features across multiple scale subspaces, and uses a learnable query to aggregate scale-aware information of both images for co-visible region detection. In this way, correspondences can be established in a coarse-to-fine hierarchy, thereby mitigating semantic and localization uncertainties. Extensive experiments on three challenging datasets demonstrate that SCoDe outperforms state-of-the-art methods, improving the precision of a modern local feature matcher by 8.41%. Notably, SCoDe shows a clear advantage when handling images with drastic scale variations.

SleeperVLA: Towards Backdoor-Based Ownership Verification for Vision-Language-Action Models

Ming Sun, Rui Wang, Xingrui Yu*, Lihua Jing, Hangyu Du, Zhenglin Wan, Xu Pan, Ivor Tsang

(Under Review) 2026

Vision-Language-Action models (VLAs) support generalist robotic control by enabling end-to-end decision policies directly from multi-modal inputs. As trained VLAs are increasingly shared and adapted, protecting model ownership becomes essential for secure deployment and responsible open-source usage. In this paper, we present SleeperVLA, the first backdoor-based ownership verification framework specifically designed for VLAs. SleeperVLA embeds a stealthy and harmless backdoor watermark into the protected model during training by injecting secret messages into embodied visual data. For post-release verification, we propose a swap-and-detect mechanism, in which a trigger-aware projector and an external classifier head are used to activate and detect the embedded backdoor based on prediction probabilities. Extensive experiments across multiple datasets, model architectures, and adaptation settings demonstrate that SleeperVLA enables reliable and unique ownership verification while preserving benign task performance. Further results show that the embedded watermark remains detectable under post-release model adaptation.

SAMatcher: Co-Visibility Modeling with Segment Anything for Robust Feature Matching

Xu Pan, Qiyuan Ma, Mingyue Dong, He Chen, Wei Ji, Xianwei Zheng*

IEEE Transactions on Geoscience and Remote Sensing (Under Review) JCR Q1 · IF 8.6 2026

CODE HF SITE

Reliable correspondence estimation is a long-standing problem in computer vision and a critical component of applications such as Structure from Motion, visual localization, and image registration. While recent learning-based approaches have substantially improved local feature descriptiveness, most methods still rely on implicit assumptions about shared visual content across views, leading to brittle behavior when spatial support, semantic context, or visibility patterns diverge between images. We propose SAMatcher, a novel feature matching framework that formulates correspondence estimation through explicit co-visibility modeling. Rather than directly establishing point-wise matching from local appearance, SAMatcher first predicts consistent region masks and bounding boxes within a shared cross-view semantic space, serve as structured priors to guide and regularize correspondence estimation. SAMatcher employs a symmetric cross-view interaction mechanism that treats paired images as interacting token sequences, enabling bidirectional semantic alignment and selective reinforcement of jointly supported regions. Based on this formulation, a reliability-aware supervision strategy jointly constrains region segmentation and geometric localization, enforcing cross-view consistency during training. Extensive experiments on challenging benchmarks demonstrate that SAMatcher significantly improves correspondence robustness under large-scale and viewpoint variations. Beyond quantitative gains, our results indicate that monocular visual foundation models can be systematically extended to multi-view correspondence estimation when co-visibility is explicitly modeled, offering new insights for fusion-based visual understanding.

Projects

SA-VLA

2026

A research project on robust RL adaptation of flow-matching–based VLA models for robotic manipulation, focusing on generalization under distribution shifts in challenging benchmarks.

Vision-Language-Action Model Robotic Manipulation Flow-Matching Reinforcement Learning

TwSphinx54/SA-VLA SITE PAPER

Co-visibility Guided Image Matching

2025

A research project on robust image matching in robot vision, photogrammetry and remote sensing, using explicit co-visibility modeling to handle extreme scale and viewpoint variations.

Co-visibility Image Matching 3D Vision Segmentation Photogrammetry SCoDe SAMatcher

Geo-Tell/SCoDe SITE PAPER

GNDAS

2022

The GNDASystem (Global Natural Disaster Assessment System) is a web-based geographic information system application designed for the analysis and assessment of natural disasters.

Natural Disasters Geographic Information System (GIS)

TwSphinx54/GNDAS

I2RSI

2022

The I2RSI System (Intelligent Interpretation of Remote Sensing Images) is a web-based application for remote sensing image interpretation, powered by the Baidu PaddlePaddle deep learning framework.

Remote Sensing Interpretation Deep Learning

TwSphinx54/I2RSI

Academic Service

Member	ISPRS Student Consortium (ISPRS SC)	2024
Volunteer	2023 International Graduate Workshop on Geo-Informatics (IGWG'23)	2023
Volunteer	2022 International Graduate Workshop on Geo-Informatics (IGWG'22)	2022

Personal Philosophy

I follow Stoic philosophy. Life is a joyful ascent: a true mountaineer delights in the climb itself, not just the summit.

Thou sufferest this justly: for thou choosest rather to become good to-morrow than to be good to-day.

— Marcus Aurelius, Meditations 8.22

I also resonate with the spirit of Slow Science.

We live in an age tyrannized by efficiency, outcomes, and speed, to the point that nothing lasts and nothing leaves a deep impression. In the midst of noisy bubbles and short-lived hype, I hope to take time to think carefully, to doubt, to refine, and to do research that is genuinely meaningful and worth remembering.