SAMatcher: Co-Visibility Modeling with Segment Anything for Robust Feature Matching

Pan, Xu; Ma, Qiyuan; Zhang, Jintao; Chen, He; Zheng, Xianwei

SAMatcher: Co-Visibility Modeling with Segment Anything for Robust Feature Matching

Xu Pan¹, Qiyuan Ma¹, Mingyue Dong¹, He Chen¹, Wei Ji², Xianwei Zheng^1,*

¹State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing (LIESMARS), Wuhan University
²National Key Laboratory of Space Target Awareness, Space Engineering University
IEEE Transactions on Image Processing (Under Review)
^*Corresponding Author

Paper Code Model arXiv

Overview of the motivation behind SAMatcher. (a) An image pair exhibiting large scale variation in co-visible regions, where local appearance-based matching becomes unreliable. (b) Co-visible region segmentation in SAMatcher, which identifies jointly visible areas across views while suppressing non-overlapping content. (c) Illustration of pixel confusion caused by scale inconsistency and its mitigation by constraining matching to co-visible regions, resulting in more geometrically consistent correspondences.

Overview of the proposed SAMatcher framework. Given an image pair, SAMatcher extracts high-level visual representations using a shared encoder. A cross-view symmetric fusion module aligns semantic information across views and highlights potentially co-visible content. Based on the fused features, a prompt-driven mask decoder predicts co-visible region masks, while a dedicated box decoder estimates corresponding bounding boxes. These region-level predictions provide semantic and geometric priors that guide correspondence estimation, improving robustness under occlusion, background clutter, and large viewpoint or scale variations.

Abstract

Reliable correspondence estimation is a fundamental problem in image processing, underpinning a wide range of applications such as Structure from Motion, visual localization, and image registration. While recent learning-based approaches have substantially improved the representation capability of local features, most methods still operate primarily at the pixel or patch level. As a result, they lack explicit mechanisms to model regions that are jointly visible across views, leading to brittle behavior when spatial support, semantic context, or visibility patterns vary between images. We propose SAMatcher, a novel feature matching framework that formulates correspondence estimation through explicit co-visibility modeling. Rather than directly establishing point-wise correspondences from local appearance, SAMatcher first predicts consistent co-visible region masks and bounding boxes within a shared cross-view representation space, serving as structured priors to guide and regularize matching. The framework builds upon the Segment Anything Model (SAM) and introduces a symmetric cross-view interaction mechanism that treats paired images as interacting token sequences, enabling bidirectional semantic alignment and the discovery of jointly supported regions. To jointly optimize region segmentation and geometric localization, we introduce a unified supervision scheme that combines point-sampled mask learning with box regression and mask--box consistency constraints, enforcing cross-view coherence during training. Extensive experiments on challenging benchmarks demonstrate that SAMatcher significantly improves robustness under large-scale geometric and viewpoint variations. These results suggest that monocular visual foundation models can be systematically extended to multi-view correspondence estimation through explicit co-visibility modeling, providing a new perspective on structured representation learning for image matching.

SAMatcher architecture and co-visibility modeling pipeline

Architecture of the proposed symmetric cross-view feature interaction module. Features from the source and target views are interleaved and processed by a stack of symmetric interaction blocks, enabling bidirectional token-level communication across views. Window-based attention with positional encoding facilitates efficient local interaction while preserving view identity. Subsequent single-view refinement further enhances view-specific structural details, producing cross-view aligned yet discriminative representations for downstream co-visible region prediction and correspondence estimation.

Qualitative matching results on challenging viewpoint changes

Evaluation on MegaDepth for larger scale differences. Each row corresponds to a complete matching pipeline. Specifically, we consider combinations of feature extractors and matchers, including SuperPoint (SP), DISK, D2-Net (D2), ContextDesc (CON), R2D2, and LoFTR. These are paired with either Nearest Neighbor (NN) or SuperGlue (SG) for matching, except LoFTR, which is an end-to-end dense matching framework.

Performance comparison on MegaDepth and ScanNet benchmarks

Ridge-style visualization of relative performance gains on MegaDepth. The plot summarizes the improvements brought by OETR (dark shading) and SAMatcher (light shading) over their respective base pipelines across different matching configurations. This visualization highlights the consistency of performance gains across metrics and methods.

Ablation study and co-visibility visualization

Qualitative comparison of co-visible region detection. For each image pair, we show OETR box-only predictions and SAMatcher mask predictions, with masks overlaid as semi-transparent purple regions. While OETR provides coarse bounding boxes, SAMatcher produces accurate and consistent co-visible regions across views, even under large viewpoint changes and partial overlap.

Region-guided correspondence comparison. SP+SG, +OETR, and +SAMatcher. Green lines denote correct matches, red lines incorrect ones. Under large scale variation, OETR often predicts inaccurate or missing regions, while SAMatcher identifies valid co-visible regions and yields more reliable correspondences.

Complementarity of mask and box predictions. Masks (magenta) provide high recall but coarse coverage, while boxes (red) offer precise localization. Constraining masks with boxes yields refined co-visible regions (green), improving correspondence reliability

Zero-shot generalization on unseen datasets. Top: GL3D (outdoor aerial scenes). Bottom: ScanNet (indoor environments). Predicted co-visible regions are overlaid as semi-transparent magenta masks. SAMatcher consistently captures mutually observable regions while suppressing non-overlapping content under domain shifts.

More Works from Xu Pan

SAMatcher: Co-Visibility Modeling with Segment Anything for Robust Feature Matching

Abstract

Complementarity of mask and box predictions. Masks (magenta) provide high recall but coarse coverage, while boxes (red) offer precise localization. Constraining masks with boxes yields refined co-visible regions (green), improving correspondence reliability