当前位置：首页 > news >正文

多模态大模型 Qwen2.5-VL 的学习之旅

news 2025/10/21 23:18:49

Qwen-VL 是阿里云研发的大规模视觉语言模型（Large Vision Language Model, LVLM）。Qwen-VL 可以以图像、文本、检测框作为输入，并以文本和检测框作为输出。Qwen-VL 系列模型性能强大，具备多语言对话、多图交错对话等能力，并支持中文开放域定位和细粒度图像识别与理解。

https://github.com/QwenLM/Qwen2.5-VL

安装方法

pip install git+https://github.com/huggingface/transformers accelerate
pip install qwen-vl-utils[decord]

模型硬件要求：

Precision	Qwen2.5-VL-3B	Qwen2.5-VL-7B	Qwen2.5-VL-72B
FP32	11.5 GB	26.34 GB	266.21 GB
BF16	5.75 GB	13.17 GB	133.11 GB
INT8	2.87 GB	6.59 GB	66.5 GB
INT4	1.44 GB	3.29 GB	33.28 GB

模型特性

强大的文档解析能力：将文本识别升级为全文档解析，擅长处理多场景、多语言以及包含各种内置元素（手写文字、表格、图表、化学公式和乐谱）的文档。
精准的对象定位跨格式支持：提升了检测、指向和计数对象的准确性，支持绝对坐标和JSON格式，以实现高级空间推理。
超长视频理解和细粒度视频定位：将原生动态分辨率扩展到时间维度，增强对时长数小时的视频的理解能力，同时能够在秒级提取事件片段。
增强的计算机和移动设备代理功能：借助先进的定位、推理和决策能力，为模型赋予智能手机和计算机上更出色的代理功能。

使用案例

基础图文问答

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_infomodel = Qwen2_5_VLForConditionalGeneration.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
)# 传入文本、图像或视频
messages = [{"role": "user","content": [{"type": "image","image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",},{"type": "text", "text": "Describe this image."},],}
]# Preparation for inference
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text],images=image_inputs,videos=video_inputs,padding=True,return_tensors="pt",
)
inputs = inputs.to(model.device)# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

多图输入

messages = [{"role": "user","content": [{"type": "image", "image": "file:///path/to/image1.jpg"},{"type": "image", "image": "file:///path/to/image2.jpg"},{"type": "text", "text": "Identify the similarities between these images."},],}
]

视频理解

Messages containing a images list as a video and a text query

messages = [{"role": "user","content": [{"type": "video","video": ["file:///path/to/frame1.jpg","file:///path/to/frame2.jpg","file:///path/to/frame3.jpg","file:///path/to/frame4.jpg",],},{"type": "text", "text": "Describe this video."},],}
]

Messages containing a local video path and a text query

messages = [{"role": "user","content": [{"type": "video","video": "file:///path/to/video1.mp4","max_pixels": 360 * 420,"fps": 1.0,},{"type": "text", "text": "Describe this video."},],}
]

Messages containing a video url and a text query

messages = [{"role": "user","content": [{"type": "video","video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4","min_pixels": 4 * 28 * 28,"max_pixels": 256 * 28 * 28,"total_pixels": 20480 * 28 * 28,},{"type": "text", "text": "Describe this video."},],}
]

物体检测

定位最右上角的棕色蛋糕，以JSON格式输出其bbox坐标

在这里插入图片描述

请以JSON格式输出图中所有物体bbox的坐标以及它们的名字，然后基于检测结果回答以下问题：图中物体的数目是多少？

在这里插入图片描述

图文解析OCR

请识别出图中所有的文字

在这里插入图片描述

Spotting all the text in the image with line-level, and output in JSON format.

在这里插入图片描述

提取图中的：[‘发票代码’,‘发票号码’,‘到站’,‘燃油费’,‘票价’,‘乘车日期’,‘开车时间’,‘车次’,‘座号’]，并且按照json格式输出。

在这里插入图片描述

Agent & Computer Use

The user query:在盒马中,打开购物车，结算（到付款页面即可） (You have done the following operation on the current device):

在这里插入图片描述

编辑推荐

系统地介绍大语言模型的提示词工程以及AI Agent的基本概念和设计方法论。许多用户在使用ChatGPT等AI工具时，常常感到困惑：为什么有时候能得到满意的回答，有时候却答非所问？通过本书，读者将学习如何构建有效的AI提示词，以及如何设计合理的对话流程，从而更好地驾驭AI工具。

查看全文

http://www.dtcms.com/a/152373.html

无标注文本的行业划分（行业分类）算法 —— 无监督或自监督学习

以太网的mac帧格式

优化uniappx页面性能，处理页面滑动卡顿问题

WebServiceg工具

中心极限定理（CLT）习题集 · 题目篇

深入浅出学会函数（上）

C++ 模板特化 (Template Specialization)

如何规避矩阵运营中的限流风险及解决方案

springboot整合redis实现缓存

mapbox高阶，高程影像、行政区边界阴影效果实现

Windows 安装 JDK

Qt 处理 XML 数据

HarmonyOS：一多能力介绍：一次开发，多端部署

声音分离人声和配乐-从头设计数字生命第5课， demucs——仙盟创梦IDE

【Unity AR开发插件】一、高效热更新：Unity AR 插件结合 HybridCLR 与 ARFoundation 的开源仓库分享

大模型技术全景解析：从基础架构到Prompt工程

Windows IIS 配置编辑器应用程序初始化＜applicationInitialization＞

docker容器监控自动恢复

PySide6 GUI 学习笔记——常用类及控件使用方法（常用类矩阵QRectF）

从单机工具到协同平台：开源交互式模拟环境的技术演进之路

windows上的RagFlow+ollama知识库本地部署

Control Center安卓版：自定义控制中心，提升手机操作体验

CPT204 Advanced Obejct-Oriented Programming 高级面向对象编程 Pt.8 排序算法

【C++游戏引擎开发】第23篇：基础阴影映射（Shadow Mapping）

低代码平台开发手机USB-HID调试助手

跟着deepseek学golang--认识golang

卡尔曼滤波解释及示例

electron-updater实现自动更新

学习ros过程中常用指令

Nacos简介—1.Nacos使用简介

安装方法

模型特性

使用案例

基础图文问答

多图输入

视频理解

物体检测

图文解析OCR

Agent & Computer Use

编辑推荐

相关文章：