First review! A Comprehensive Survey of the Segmentation Everything Model (SAM)

Click the card below to follow the “CVer” official account

AI/CV heavy dry goods, delivered in the first time

Click to enter->【Image Segmentation】WeChat Technology Exchange Group

Reprinted from: Heart of the Machine

As the first study to comprehensively introduce the progress of SAM-based models, this paper focuses on the application of SAM to various tasks and data types, and discusses its historical development, recent progress, and far-reaching impact on a wide range of applications.

Artificial intelligence (AI) is developing in the direction of AGI, which means that artificial intelligence systems can perform a wide range of tasks and can exhibit a level of intelligence similar to that of humans. AI in a narrow sense is in contrast to it, because specialized AI Designed to perform specific tasks efficiently. It can be seen that it is imminent to design a general-purpose basic model. The base model is trained on a wide range of data and thus can be adapted to various downstream tasks. The Segment Anything Model (SAM) proposed by Meta recently breaks through the segmentation boundary and greatly promotes the development of basic computer vision models.

SAM is a hint-based model that achieves strong zero-shot generalization with over 1 billion masks trained on 11 million images. Many researchers believe that “this is the GPT-3 moment for CV, because the SAM has learned a general idea of what an object is, even unknown objects, unfamiliar scenes (such as underwater, cell microscopes), and ambiguous situations”, and shows great potential as a base model for CV.

In order to fully understand SAM, researchers from Hong Kong University of Science and Technology (Guangzhou), Shanghai Jiaotong University and other institutions conducted in-depth research on it and jointly published the paper “A Comprehensive Survey on Segment Anything Model for Vision and Beyond”.

729a442dc7076de5999464dfe485d04d.png

Paper: https://arxiv.org/abs/2305.08196

As the first study to comprehensively introduce the progress of SAM-based models, this paper focuses on the application of SAM to various tasks and data types, and discusses its historical development, recent progress, and far-reaching impact on a wide range of applications.

This paper first introduces the background and terminology of the underlying models, including SAM, and state-of-the-art methods that are important for segmentation tasks;

Then, the study analyzes and summarizes the advantages and limitations of SAM in various image processing applications, including software scenarios, real-world scenarios, and complex scenarios, and importantly, the study draws some insights to guide future research development More versatile base models and improved SAM architecture;

Finally, the study also summarizes the application of SAM in vision and other fields.

Let’s take a look at the specific content of the paper.

SAM Model Overview

SAM is derived from Meta’s Segment Anything (SA) project in 2023. The project found that the basic models emerging in the NLP and CV fields showed strong performance, and the researchers tried to build a similar model to unify the entire image segmentation task. However, available data in the field of segmentation is lacking, which is different from what they were designed for. Therefore, as shown in Figure 1, the researchers divide the path into three steps of task, model and data.

2fba550975c1950f42057b9097bceaf1.png

The SAM architecture is shown below and mainly consists of three parts: image encoder; hint encoder; and mask decoder.

62882ef4e15e0ae94d955dad7bfdc20e.png

After a preliminary understanding of SAM, the study then introduces SAM for image processing.

SAM for image processing

This part is mainly divided into scenarios, including: software scenarios, real scenarios, and complex scenarios.

Software scenario

Software scenarios require operations on image editing and restoration, such as removing objects, filling objects, and replacing objects. However, existing inpainting works, such as [99], [100], [101], [102], require fine-grained annotations for each mask to achieve good performance, which is labor-intensive . SAM [20] can generate accurate masks with simple hints such as dots or boxes, which can help assist image editing scenarios.

Inpaint Anything (IA) [39] devised a pipeline to address inpainting-related problems by combining the strengths of SAMs, state-of-the-art image inpainters [99], and AI-generated content models [103]. This process is shown in Figure 3. For object removal, the pipeline consists of SAM and state-of-the-art inpainters such as LaMa [99]. The user’s tap operations are used as hints to SAM to generate a mask of the object region, which is then filled by LaMa using the corrosion and dilation operations. For object filling and replacement, the second step uses an AI-generated content model like Stable Diffusion (SD) [103] to fill selected objects with newly generated objects via text cues.

6318fb23d715dc010ccd863b595448a8.png

A similar idea can also be seen in Edit Everything [40], shown in Figure 4, which allows users to edit images using simple text commands.

519fed99a8fae925386b95e35f327744.png

Real scene

The researchers show that SAM has the ability to assist in many real-world scenarios, such as real-world object detection, object counting, and moving object detection scenarios. Recently, [108] evaluated the performance of SAMs in various real-world segmentation scenarios (e.g., natural imagery, agriculture, manufacturing, remote sensing, and healthcare scenarios). The paper finds that it generalizes well to common scenes like natural images, while performing poorly on low-contrast scenes and requiring strong prior knowledge on complex scenes.

For example, in the application of civil infrastructure defect assessment, [42] utilized SAM to detect cracks in concrete structures and compared its performance with the baseline U-Net [109]. The crack detection process is shown in Figure 6. The results show that SAM outperforms UNet in detecting longitudinal cracks, which are more likely to be found in normal scenes similar to training images, while it underperforms U-Net in less common scenarios, namely spalling cracks.

6109ad3207dace454ace3fabcfa5b11a.png

Procedure for crack detection using SAM and U-Net. The figure is taken from the original paper [42].

Unlike the case of complex images in crack detection, since the shapes of craters are mainly concentrated in circles or ellipses, it is more appropriate to use SAM as a detection tool for crater detection. Craters are one of the most important morphological features in planetary exploration, and detecting and counting them is an important but time-consuming task in planetary science. Although existing machine learning and computer vision work successfully addresses some specific problems in crater detection, they rely on specific types of data and thus do not work well across different data sources.

In [110], the researchers proposed a general crater detection scheme using SAM for zero-shot generalization to unfamiliar objects. This pipeline uses SAM to segment the input image with no restrictions on data type and resolution. Then, it uses the circle-ellipse index to filter the segmentation masks that are not circle-ellipse. Finally, a post-processing filter is used to remove duplicates, artifacts and false positives. This pipeline shows great potential as a general tool in the current field, and the authors also discuss the disadvantage of only recognizing specific shapes.

Complex scenes

In addition to the conventional scenarios mentioned above, whether SAM can solve the segmentation problem in complex scenes (such as low-contrast scenes) is also a meaningful question, which can expand its application range. To explore the generalization ability of SAM in more complex scenarios, Ji et al. [22] quantitatively compared it with state-of-the-art models in three scenarios, namely, camouflaged animals, industrial defects, and medical lesions. They conduct experiments on three camouflaged object segmentation (COS) datasets, namely CAMO [116] with 250 samples, COD10K [117] with 2026 samples, and NC4K [118] with 4121 samples. And compare it with Transformer-based models CamoFormer-P/S [119] and HitNet [120]. The results show that SAM is underskilled in covert scenarios and point out that potential solutions may rely on support from domain-specific prior knowledge. The same conclusion can be drawn in [29], where the authors compare SAM with 22 state-of-the-art methods for camouflaged object detection on the same three datasets mentioned above.

Cao et al. [115] proposed a new framework named Segment Any Anomaly + (SAA + ) for zero-shot anomaly segmentation, as shown in Figure 7. The framework utilizes hybrid hint normalization to improve the adaptability of modern base models, leading to more accurate anomaly segmentation without domain-specific fine-tuning. The authors conduct detailed experiments on four anomaly segmentation benchmarks, namely VisA [122], MVTecAD [123], MTD [124] and KSDD2 [125], and achieve state-of-the-art performance.

2bb1c92a136620a63564804d9441a42e.png

The first approach (WSSAM) proposed by He et al. [126], using SAM for weakly supervised hidden object segmentation, addresses the challenge of segmenting objects that blend in with their surroundings using sparsely annotated data (see Figure 8). The proposed WSSAM includes SAM-based pseudo-labeling and multi-scale feature grouping to improve model learning and distinguish hidden objects from backgrounds. The authors found that using only scribble supervision [127], a SAM can generate segmentation masks good enough to train a segmenter.

8842fcd51579552c0bf88b476f6cdf4c.png

More models and applications: vision and more

Visual relevance

The first is medical imaging. The purpose of medical image segmentation is to display the anatomical or pathological structure of the corresponding tissue, which can be used for computer-aided diagnosis and intelligent clinical surgery.

Figure 10 below is an overview of medical image SAM, including computed tomography (CT) images, magnetic resonance imaging (MRI) images, colonoscopy images, multi-format images, H & amp;E stained tissue section images, etc.

4b87c98d8042e37f5b6d0444405980a7.png

Next is video. In the field of computer vision, video object tracking (VOT) and video segmentation are considered as crucial and indispensable tasks. VOT involves locating a specific object in a video frame and then tracking it throughout the rest of the video. Therefore, VOT has various practical applications, such as surveillance and robotics.

SAM has made outstanding contributions in the field of VOT. The Track Anything Model (TAM) introduced in [46] efficiently achieves excellent interactive tracking and segmentation in videos. Figure 11 below is the TAM pipeline.

2c39246394429272828283ab20b3508e.png

In addition, another tracking model is SAMTrack, see reference [172] for details. SAMTrack is a video segmentation framework that enables object tracking and segmentation through interactive and automatic methods. Figure 12 below is the pipeline of SAMTrack.

219502f7f5fd90de0980e7ea9214fa3b.png

Figure 13 below is a lightweight SAM-guided refinement module (SEEM), which is used to improve the performance of existing methods.

e47d14132176324e02faf37819460351.png

Next comes data annotation. SAMText [180] is an extensible pipeline for text mask annotation of scenes in videos. It leverages SAM to generate mask annotations on a large dataset SAMText-9M, which contains over 2,400 video clips and over 9 million mask annotations.

In addition, reference [143] used the existing remote sensing object detection dataset and the data-centric machine learning model SAM to construct a large-scale remote sensing image segmentation dataset SAMRS, which contains object classification, location and instance information, and can be used for semantic Segmentation, instance segmentation, and object detection research.

Beyond Vision

The first is 3D reconstruction. Besides achieving fine-grained 3D segmentation, SA3D [183] can be used for 3D reconstruction. Using a 3D mask grid, researchers can determine the space an object occupies in 3D and reconstruct it in various ways. Figure 14 below shows the overall pipeline of SA3D.

e0c8f121a6edf542a6432b7b54de97e4.png

Reference [186] proposes a new object removal pipeline ORNeRF, which removes objects from 3D scenes using point or text prompts on a single view. By rapidly propagating user annotations to all views using a point projection strategy, the method achieves better performance using less time than previous work. Figure 15 below is the framework of ORNeRF.

bb5e592e03030888e5a84e98c40ab1d2.png

This is followed by non-European domains. In order to handle different feature dimensions for different tasks, the SNA method shown in Figure 16 below introduces a specialized reducible graph convolution layer. This layer can dynamically activate or deactivate channels according to the input feature dimension.

7550d19bf3d1fe8c81986c1acca7c3aa.png

Then there are the robots. Figure 17 below shows the overall flow of Instruct2Act [190]. In the perception part, predefined APIs are used to access several underlying models. SAM [20] accurately localizes candidate objects, and CLIP [13] classifies them. The framework leverages the expertise of the underlying model and robotics capabilities to translate complex high-level instructions into precise policy code.

ac9bada522e7fcec6acfdac5eb7e40c9.png

This is followed by video text positioning. Figure 18 below demonstrates SAMText [180] , a scalable and efficient solution for generating mask annotations for the video text localization task. By applying a SAM model to bounding box annotations, it can generate mask annotations for large-scale video-to-text datasets.

921e713f6517585a6f449cda34dc47c1.png

There are also image captions. Wang et al. [44] proposed a method Caption Anything (CAT) for controllable image subtitles, as shown in Figure 20 below, the CAT framework introduces multi-modal control into image subtitles, presenting various human-intentioned Visual focus and language style.

011a0e63a65784b2e3e31e134ab424ff.png

Audiovisual is also involved. The audiovisual localization and segmentation method of reference [45] is used to learn cross-modal representations that can align audio and visual information, as shown in Figure 21 below. AV-SAM leverages pixel-level audiovisual fusion across audio and visual features in pretrained audio encoders and image encoders to aggregate cross-modal representations. The aggregated cross-modal features are then fed into a prompt encoder and mask decoder to generate the final audiovisual segmentation mask.

577a374e56ce7550db65d09377ce2a82.png

Finally, multimodal visual and open-vocabulary interactive segmentation. The method of reference [44], shown in Figure 22 below, aims to completely replace manual points using a text-only input CLIP strategy. This approach provides pixel-level results from text input that can be easily converted to point prompts for SAM models.

cffc727b28ce8ee8012d54cbab5fcdf0.png

Conclusion

This paper presents the first comprehensive review of research progress on SAM-based models in computer vision and beyond. It first summarizes the development history of the basic models (large language model, large visual model, and multimodal large model) and the basic terminology of SAM, and focuses on the application of SAM in various tasks and data types. Parallel work and its follow-up. The researchers also discuss the great potential of SAMs in a wide range of image processing applications, including software scenarios, real-world scenarios, and complex scenarios.

Furthermore, the researchers analyze and summarize the advantages and limitations of SAM in various applications. These observations can provide some insights for future development of more powerful base models and further improvements in the robustness and generalization of SAMs. The article concludes with a plethora of other amazing applications of SAM in vision and other domains.

Click to enter->【Image segmentation】WeChat group

The latest CVPR 2023 papers and code download


Background reply: CVPR2023, you can download the collection of CVPR 2023 papers and code open source papers

Background reply: Transformer review, you can download the latest 3 Transformer review PDFs

Image segmentation communication group established
Scan the QR code below, or add WeChat: CVer333, you can add CVer Assistant WeChat, and you can apply to join the CVer-Image Segmentation WeChat exchange group. In addition, other vertical directions have been covered: target detection, image segmentation, target tracking, face detection & amp; recognition, OCR, pose estimation, super-resolution, SLAM, medical imaging, Re-ID, GAN, NAS, depth estimation, automatic Driving, reinforcement learning, lane line detection, model pruning & amp; compression, noise removal, fog removal, rain removal, style transfer, remote sensing images, behavior recognition, video understanding, image fusion, image retrieval, paper submission & communication , PyTorch, TensorFlow, Transformer, etc.
Be sure to note: research direction + location + school/company + nickname (such as image segmentation + Shanghai + hand in + Kaka), according to the format notes, it can be passed faster and invited into the group

▲Scan code or add WeChat ID: CVer333, enter the exchange group
CVer Computer Vision (Knowledge Planet) is here! If you want to learn about the latest, fastest and best CV/DL/AI paper delivery, high-quality practical projects, AI industry frontiers, learning tutorials from entry to mastery, etc., welcome to scan the QR code below and join CVer Computer Vision, which has gathered thousands of people!

▲ Scan the code to enter the planet
▲Click the card above to follow the CVer official account

It is not easy to organize, please like and watch9273cad8680a0e687f2735aec5b5120a.gif