1. Introduction
Code address: https://github.com/luca-medeiros/lang-segment-anything
https://github.com/IDEA-Research/GroundingDINO
lang-segment-anything is an algorithm for segmenting objects in images based on language text prompts. It combines the two major algorithms of GroundingDINO and segment-anything. It has a good application scenario in semi-automatic annotation. It can classify objects without training. Segmentation is performed, that is, matching of text to image objects can be completed.
Note: After testing, lang-segment-anything was successfully deployed on the ubuntu18.04 system, but failed on the windows system. It is recommended to deploy on the ubuntu system.
2. Local deployment
Download 2.1 code and related files
In addition to downloading the code of lang-segment-anything, you also need to download some configuration files and weights of GroundingDINO and segment-anything.
First, go to the official github website to download all the files of lang-segment-anything. You can git clone or download the zip package.
Then download There are three options for the weight file required by segment-anything: l, b, h. Only the h with the best effect is downloaded here.
In addition, you also need to download the weight file and corresponding configuration file of GroundingDINO: there are two options: swinb and swint.
GroundingDINO uses the bert model. Because it is easy to fail to connect when downloading the model online, it is best to download it locally. Just download the following 5 files.
?
The above files can be obtained from the Baidu Netdisk link:
https://pan.baidu.com/s/1iqFjmTdJrja1ilSoxnWw6w?pwd=ek6i
After the download is completed, copy the three files to lang-segmengt-anything-main.
2.2 Environment Configuration
Basically follow the operation on github. Use conda to create a virtual environment from the yml file: conda env create -f environment.yml. Mainly install torch, segment-anything and groundingdino. Note that you need to modify the groundingdino in the environment.yml file to The groundingdino-py and lang-sam packages cannot be found and do not need to be installed. The installation of torch can be performed according to the following code:
pip install torch torchvision torchmetrics --index-url https://download.pytorch.org/whl/cu118
2.3 Code modification
First modify the lang-sam.py file under lang-sam, mainly adding the bulid_groundingdino_local and load_model_loacl methods to load the local groundingdino model. Note that the weight file and configuration file must correspond.
class LangSAM(): def __init__(self, sam_type="vit_h", ckpt_path="sam_weight/sam_vit_h_4b8939.pth"): self.sam_type = sam_type self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") self.bulid_groundingdino_local() self.build_sam(ckpt_path) def bulid_groundingdino_local(self): ckpt_filename = "grounding_weight/groundingdino_swinb_cogcoor.pth" ckpt_config_filename = "grounding_weight/config/GroundingDINO_SwinB_cfg.py" self.groundingdino = load_model_loacl(ckpt_config_filename, ckpt_filename)
def load_model_loacl(model_config_path, model_checkpoint_path, device="cuda"): try: args = SLConfig.fromfile(model_config_path) args.device = device model = build_model(args) checkpoint = torch.load(model_checkpoint_path, map_location='cpu') model.load_state_dict(clean_state_dict(checkpoint["model"]), strict=False) model.eval() model.to(device) except Exception as e: print(str(e)) return model
In addition, you also need to enter the groundingdino.util.get_tokenlizer installed by pip and make the following modifications to the code: mainly modifying the loading method of tokenizer and BertModel to offline loading. You can enter groundingdino.util.get_tokenlizer in the code, then hold down the ctrl key and click get_tokenlizer with the left mouse button to quickly jump to the code.
from transformers import AutoTokenizer, BertModel, BertTokenizer, RobertaModel, RobertaTokenizerFast import os def get_tokenlizer(text_encoder_type): if not isinstance(text_encoder_type, str): # print("text_encoder_type is not a str") if hasattr(text_encoder_type, "text_encoder_type"): text_encoder_type = text_encoder_type.text_encoder_type elif text_encoder_type.get("text_encoder_type", False): text_encoder_type = text_encoder_type.get("text_encoder_type") elif os.path.isdir(text_encoder_type) and os.path.exists(text_encoder_type): pass else: raise ValueError( "Unknown type of text_encoder_type: {}".format(type(text_encoder_type)) ) print("final text_encoder_type: {}".format(text_encoder_type)) # tokenizer = AutoTokenizer.from_pretrained(text_encoder_type) tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') return tokenizer def get_pretrained_language_model(text_encoder_type): if text_encoder_type == "bert-base-uncased" or (os.path.isdir(text_encoder_type) and os.path.exists(text_encoder_type)): # return BertModel.from_pretrained(text_encoder_type) return BertModel.from_pretrained('bert-base-uncased') if text_encoder_type == "roberta-base": return RobertaModel.from_pretrained(text_encoder_type) raise ValueError("Unknown text_encoder_type {}".format(text_encoder_type))
2.4demo demonstration
When you run the following code, you will first be prompted whether to update lightning. You have to select No. The updated version will not work.
lightning run app app.py
After running the code, the following window will pop up on the web page. After uploading the image, you can segment the image with text prompts. The following text prompts you to enter person, and the person corresponding to person is segmented in the image.
The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. OpenCV skill tree Home page Overview 24045 people are learning the system