SAMatcher: Co-Visibility Modeling with Segment Anything for Robust Feature Matching

1State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing (LIESMARS), Wuhan University
2National Key Laboratory of Space Target Awareness, Space Engineering University
IEEE Transactions on Image Processing (Under Review)
*Corresponding Author
introduction

Overview of the motivation behind SAMatcher. (a) An image pair exhibiting large scale variation in co-visible regions, where local appearance-based matching becomes unreliable. (b) Co-visible region segmentation in SAMatcher, which identifies jointly visible areas across views while suppressing non-overlapping content. (c) Illustration of pixel confusion caused by scale inconsistency and its mitigation by constraining matching to co-visible regions, resulting in more geometrically consistent correspondences.

overview

Overview of the proposed SAMatcher framework. Given an image pair, SAMatcher extracts high-level visual representations using a shared encoder. A cross-view symmetric fusion module aligns semantic information across views and highlights potentially co-visible content. Based on the fused features, a prompt-driven mask decoder predicts co-visible region masks, while a dedicated box decoder estimates corresponding bounding boxes. These region-level predictions provide semantic and geometric priors that guide correspondence estimation, improving robustness under occlusion, background clutter, and large viewpoint or scale variations.

Abstract

Reliable correspondence estimation is a fundamental problem in image processing, underpinning a wide range of applications such as Structure from Motion, visual localization, and image registration. While recent learning-based approaches have substantially improved the representation capability of local features, most methods still operate primarily at the pixel or patch level. As a result, they lack explicit mechanisms to model regions that are jointly visible across views, leading to brittle behavior when spatial support, semantic context, or visibility patterns vary between images. We propose SAMatcher, a novel feature matching framework that formulates correspondence estimation through explicit co-visibility modeling. Rather than directly establishing point-wise correspondences from local appearance, SAMatcher first predicts consistent co-visible region masks and bounding boxes within a shared cross-view representation space, serving as structured priors to guide and regularize matching. The framework builds upon the Segment Anything Model (SAM) and introduces a symmetric cross-view interaction mechanism that treats paired images as interacting token sequences, enabling bidirectional semantic alignment and the discovery of jointly supported regions. To jointly optimize region segmentation and geometric localization, we introduce a unified supervision scheme that combines point-sampled mask learning with box regression and mask--box consistency constraints, enforcing cross-view coherence during training. Extensive experiments on challenging benchmarks demonstrate that SAMatcher significantly improves robustness under large-scale geometric and viewpoint variations. These results suggest that monocular visual foundation models can be systematically extended to multi-view correspondence estimation through explicit co-visibility modeling, providing a new perspective on structured representation learning for image matching.