SAMatcher: Co-Visibility Modeling with Segment Anything for Robust Feature Matching

1State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing (LIESMARS), Wuhan University
IEEE Transactions on Geoscience and Remote Sensing (Under Review)
*Corresponding Authors
introduction

Overview of the motivation behind SAMatcher. (a) An image pair exhibiting large scale variation in co-visible regions, where local appearance-based matching becomes unreliable. (b) Co-visible region segmentation in SAMatcher, which identifies jointly visible areas across views while suppressing non-overlapping content. (c) Illustration of pixel confusion caused by scale inconsistency and its mitigation by constraining matching to co-visible regions, resulting in more geometrically consistent correspondences.

overview

Overview of the proposed SAMatcher framework. Given an image pair, SAMatcher extracts high-level visual representations using a shared encoder. A cross-view symmetric fusion module aligns semantic information across views and highlights potentially co-visible content. Based on the fused features, a prompt-driven mask decoder predicts co-visible region masks, while a dedicated box decoder estimates corresponding bounding boxes. These region-level predictions provide semantic and geometric priors that guide correspondence estimation, improving robustness under occlusion, background clutter, and large viewpoint or scale variations.

Abstract

Reliable correspondence estimation is a long-standing problem in computer vision and a critical component of applications such as Structure from Motion, visual localization, and image registration. While recent learning-based approaches have substantially improved local feature descriptiveness, most methods still rely on implicit assumptions about shared visual content across views, leading to brittle behavior when spatial support, semantic context, or visibility patterns diverge between images. We propose SAMatcher, a novel feature matching framework that formulates correspondence estimation through explicit co-visibility modeling. Rather than directly establishing point-wise matching from local appearance, SAMatcher first predicts consistent region masks and bounding boxes within a shared cross-view semantic space, serve as structured priors to guide and regularize correspondence estimation. SAMatcher employs a symmetric cross-view interaction mechanism that treats paired images as interacting token sequences, enabling bidirectional semantic alignment and selective reinforcement of jointly supported regions. Based on this formulation, a reliability-aware supervision strategy jointly constrains region segmentation and geometric localization, enforcing cross-view consistency during training. Extensive experiments on challenging benchmarks demonstrate that SAMatcher significantly improves correspondence robustness under large-scale and viewpoint variations. Beyond quantitative gains, our results indicate that monocular visual foundation models can be systematically extended to multi-view correspondence estimation when co-visibility is explicitly modeled, offering new insights for fusion-based visual understanding.