当前位置：首页 > news >正文

多模态模型 Grounding DINO 初识

news 2025/7/9 16:12:09

简介

Grounding DINO 是一种先进的零样本目标检测模型，由 IDEA Research 开发。它通过将基于 Transformer 的检测器 DINO 与Grounded Pre-Training相结合，实现了通过人类输入（如类别名称或指代表达）对任意物体进行检测。

例如在不需要任何训练的情况下，告诉Grounding DINO找出图像中人所在的位置，Grounding DINO就能标注出人的坐标。如下：

演示流程：

基本原理

在Grounding DINO中，作者想要完成这样一项任务：根据人类文字输入去检测任意类别的目标，称作开放世界目标检测问题（open-set object detection）。

完成open-set object detection的关键是将language信息引入到目标的通用特征表示中。例如，GLIP利用对比学习的方式在目标检测和文字短语之间建立起了联系，它在close-set和open-set数据集上都有很好的表现。尽管如此，GLIP是基于传统的one-stage detector结构，因此还有一定的局限性。

受很多前期工作的启发（GLIP、DINO等），作者提出了Grounding DINO，它相对于GLIP有以下几点优势：

Grounding DINO 的transformer结构更接近于NLP模型，因此它更容易同时处理图片和文字；
Transformer-based detector在处理大型数据集时被证明有优势；
作为DETR的变种，DINO能够完成end-to-end的训练，而且不需要NMS等额外的后处理。

网络结构：

Grounding DINO的整体结构如上图所示。Grounding DINO是一个双encoder单decoder结构，它包含了

一个image backbone用于提取image feature
一个text backbone用于提取text feature
一个feature enhancer用于融合image和text feature
一个language-guide query selection模块用于query初始化
一个cross-modality decoder用于bbox预测

特点与优势

零样本检测能力：Grounding DINO 能够在没有目标数据集标注的情况下，通过文本提示检测未见过的类别。例如，在 COCO 数据集的零样本检测基准测试中，它达到了 52.5 AP。
强大的跨模态融合能力：通过深度的视觉与语言模态融合，模型在开放集目标检测和指代表达理解任务中表现出色。
端到端优化：基于 Transformer 的架构使得 Grounding DINO 可以端到端地进行优化，无需手工设计模块。

应用场景

Grounding DINO 可以广泛应用于需要灵活目标检测的场景
自动驾驶：通过自然语言描述检测特定的交通标志或障碍物。
机器人视觉：根据指令识别和操作物体。
图像标注与内容理解：自动识别图像中的对象并生成描述。

安装

参考：https://blog.csdn.net/weixin_44151034/article/details/139362032

1.安装虚拟环境

conda create -n dino python=3.10 -y

2.安装pytorch

conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.4 -c pytorch -c nvidia

如果有显卡在安装pytorch之前需要了解显卡驱动支持的最高cuda版本，安装pytorch的同时会安装cuda和cudnn等模块，保证cuda版本在显卡支持的范围之内。

nvidia-smi

4.下载项目并安装

git clone https://github.com/IDEA-Research/GroundingDINO.git
cd GroundingDINO
pip install -e . -i https://pypi.tuna.tsinghua.edu.cn/simple

5.下载权重

预训练模型：groundingdino_swint_ogc.pth 这里可能需要使用魔法

mkdir weights
cd weights
wget -q https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth

到这里为止就安装好了Grounding DINO，下面就来使用它**。**

运行

Grounding DINO 能做的事情包括理解文字信息，找出图像中文字信息描述的对象。比如告诉Grounding DINO 找出图像中人所在的位置。

在项目根目录下创建test.py文件，位置一定不能错，关系到寻找配置文件的路径。需要输入给模型包括一个图像和一段文字。准备一张图像，再准备一段文字，文字为想要检测的物体，用空格或句号隔开，如： "chair . person . cell . flower"

from groundingdino.util.inference import load_model, load_image, predict, annotate
import cv2#加载模型
model = load_model("groundingdino/config/GroundingDINO_SwinT_OGC.py", "weights/groundingdino_swint_ogc.pth")
#要预测的图片路径
IMAGE_PATH = "1.jpeg"
#要预测的类别提示，可以输入多个类中间用英文句号隔开
TEXT_PROMPT = "chair . person . cell . flower"
BOX_TRESHOLD = 0.35
TEXT_TRESHOLD = 0.25image_source, image = load_image(IMAGE_PATH)boxes, logits, phrases = predict(model=model,image=image,caption=TEXT_PROMPT,box_threshold=BOX_TRESHOLD,text_threshold=TEXT_TRESHOLD
)annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)
#保存预测的图片,保存到outputs文件夹中，名称为annotated_image.jpg
cv2.imwrite("annotated_image.jpg", annotated_frame)

执行代码：

得到标注结果的图像：

自带页面

Grounding DINO 使用代码推理还有一个更方便的网页端推理页面。在demo目录下面的gradio_demo.py 是一个python实现的前端推理页面。

由于我在写这篇文章时发现直接跑有点问题，所以需要稍作修改。主要修改在load_model_hf函数中，修改了模型文件的加载方式。原本从huggingface获取但是模型文件已经不在了，修改成使用上面下载的模型。

import argparse
from functools import partial
import cv2
import requests
import os
from io import BytesIO
from PIL import Image
import numpy as np
from pathlib import Pathimport warningsimport torch# prepare the environment
os.system("python setup.py build develop --user")
os.system("pip install packaging==21.3")
os.system("pip install gradio==3.50.2")warnings.filterwarnings("ignore")import gradio as grfrom groundingdino.models import build_model
from groundingdino.util.slconfig import SLConfig
from groundingdino.util.utils import clean_state_dict
from groundingdino.util.inference import annotate, load_image, predict
import groundingdino.datasets.transforms as Tfrom huggingface_hub import hf_hub_download# Use this command for evaluate the Grounding DINO model
config_file = "groundingdino/config/GroundingDINO_SwinT_OGC.py"
ckpt_repo_id = "ShilongLiu/GroundingDINO"
ckpt_filenmae = "weights/groundingdino_swint_ogc.pth"def load_model_hf(model_config_path, repo_id, filename, device='cpu'):args = SLConfig.fromfile(model_config_path) model = build_model(args)args.device = device# cache_file = hf_hub_download(repo_id=repo_id, filename=filename)# checkpoint = torch.load(cache_file, map_location='cpu')checkpoint = torch.load(filename, map_location='cpu')log = model.load_state_dict(clean_state_dict(checkpoint['model']), strict=False)# print("Model loaded from {} \n => {}".format(cache_file, log))_ = model.eval()return model    def image_transform_grounding(init_image):transform = T.Compose([T.RandomResize([800], max_size=1333),T.ToTensor(),T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])])image, _ = transform(init_image, None) # 3, h, wreturn init_image, imagedef image_transform_grounding_for_vis(init_image):transform = T.Compose([T.RandomResize([800], max_size=1333),])image, _ = transform(init_image, None) # 3, h, wreturn imagemodel = load_model_hf(config_file, ckpt_repo_id, ckpt_filenmae)def run_grounding(input_image, grounding_caption, box_threshold, text_threshold):init_image = input_image.convert("RGB")original_size = init_image.size_, image_tensor = image_transform_grounding(init_image)image_pil: Image = image_transform_grounding_for_vis(init_image)# run grounidngboxes, logits, phrases = predict(model, image_tensor, grounding_caption, box_threshold, text_threshold, device='cpu')annotated_frame = annotate(image_source=np.asarray(image_pil), boxes=boxes, logits=logits, phrases=phrases)image_with_box = Image.fromarray(cv2.cvtColor(annotated_frame, cv2.COLOR_BGR2RGB))return image_with_boxif __name__ == "__main__":parser = argparse.ArgumentParser("Grounding DINO demo", add_help=True)parser.add_argument("--debug", action="store_true", help="using debug mode")parser.add_argument("--share", action="store_true", help="share the app")args = parser.parse_args()block = gr.Blocks().queue()with block:gr.Markdown("# [Grounding DINO](https://github.com/IDEA-Research/GroundingDINO)")gr.Markdown("### Open-World Detection with Grounding DINO")with gr.Row():with gr.Column():input_image = gr.Image(source='upload', type="pil")grounding_caption = gr.Textbox(label="Detection Prompt")run_button = gr.Button(label="Run")with gr.Accordion("Advanced options", open=False):box_threshold = gr.Slider(label="Box Threshold", minimum=0.0, maximum=1.0, value=0.25, step=0.001)text_threshold = gr.Slider(label="Text Threshold", minimum=0.0, maximum=1.0, value=0.25, step=0.001)with gr.Column():gallery = gr.outputs.Image(type="pil",# label="grounding results").style(full_width=True, full_height=True)# gallery = gr.Gallery(label="Generated images", show_label=False).style(#         grid=[1], height="auto", container=True, full_width=True, full_height=True)run_button.click(fn=run_grounding, inputs=[input_image, grounding_caption, box_threshold, text_threshold], outputs=[gallery])block.launch(server_name='0.0.0.0', server_port=7579, debug=args.debug, share=args.share)

执行 python demo/gradio_demo.py ，注意一定要在这个路径下。首先会安装一些库，然后启动

hon3.10/site-packages (from anyio->httpx->gradio==3.50.2) (1.3.1)
final text_encoder_type: bert-base-uncased
Running on local URL:  http://0.0.0.0:7579To create a public link, set `share=True` in `launch()`.
IMPORTANT: You are using gradio version 3.50.2, however version 4.44.1 is available, please upgrade.
--------