SAMatcher: Co-Visibility Modeling with Segment Anything for Robust Feature Matching
Abstract
Reliable correspondence estimation is a long-standing problem in computer vision and a critical component of applications such as Structure from Motion, visual localization, and image registration. While recent learning-based approaches have substantially improved local feature descriptiveness, most methods still rely on implicit assumptions about shared visual content across views, leading to brittle behavior when spatial support, semantic context, or visibility patterns diverge between images. We propose SAMatcher, a novel feature matching framework that formulates correspondence estimation through explicit co-visibility modeling. Rather than directly establishing point-wise matching from local appearance, SAMatcher first predicts consistent region masks and bounding boxes within a shared cross-view semantic space, serve as structured priors to guide and regularize correspondence estimation. SAMatcher employs a symmetric cross-view interaction mechanism that treats paired images as interacting token sequences, enabling bidirectional semantic alignment and selective reinforcement of jointly supported regions. Based on this formulation, a reliability-aware supervision strategy jointly constrains region segmentation and geometric localization, enforcing cross-view consistency during training. Extensive experiments on challenging benchmarks demonstrate that SAMatcher significantly improves correspondence robustness under large-scale and viewpoint variations. Beyond quantitative gains, our results indicate that monocular visual foundation models can be systematically extended to multi-view correspondence estimation when co-visibility is explicitly modeled, offering new insights for fusion-based visual understanding.
Architecture of the proposed symmetric cross-view feature interaction module. Features from the source and target views are interleaved and processed by a stack of symmetric interaction blocks, enabling bidirectional token-level communication across views. Window-based attention with positional encoding facilitates efficient local interaction while preserving view identity. Subsequent single-view refinement further enhances view-specific structural details, producing cross-view aligned yet discriminative representations for downstream co-visible region prediction and correspondence estimation.
Evaluation on MegaDepth for larger scale differences. Each row corresponds to a complete matching pipeline. Specifically, we consider combinations of feature extractors and matchers, including SuperPoint (SP), DISK, D2-Net (D2), ContextDesc (CON), R2D2, and LoFTR. These are paired with either Nearest Neighbor (NN) or SuperGlue (SG) for matching, except LoFTR, which is an end-to-end dense matching framework.
Ridge-style visualization of relative performance gains on MegaDepth. The plot summarizes the improvements brought by OETR (dark shading) and SAMatcher (light shading) over their respective base pipelines across different matching configurations. This visualization highlights the consistency of performance gains across metrics and methods.
Qualitative visualization of co-visible region detection. For each image pair, the predicted co-visible regions are overlaid as semi-transparent purple masks on both source and target images. Despite large viewpoint changes and partial overlap, SAMatcher consistently highlights regions that are mutually observable across views while suppressing non-overlapping content.
Region-guided correspondence visualization. Predicted co-visible masks and bounding boxes are overlaid together with correspondence results. Green lines denote correct matches that satisfy the confidence threshold, while red lines indicate incorrect correspondences. The proposed region guidance effectively suppresses spurious matches outside co-visible regions and concentrates correspondences within geometrically meaningful areas.