当前位置：首页 > news >正文

【“星瑞” O6 评测】—NPU 部署 face parser 模型

news 2025/10/22 3:24:37

在这里插入图片描述

前言

瑞莎星睿 O6 (Radxa Orion O6) 拥有高达 28.8TOPs NPU (Neural Processing Unit) 算力，支持 INT4 / INT8 / INT16 / FP16 / BF16 和 TF32 类型的加速。这里通过通过官方的工具链进行FaceParsingBiSeNet的部署

1. FaceParsingBiSeNet onnx 推理

首先从百度网盘提取码 8gin，下载开源的模型：face_parsing_512x512.onnx
编写 onnx 的推理脚本，如下

import os
import cv2
import argparse
import numpy as np
from PIL import Image
import onnxruntime
import timedef letterbox(image, new_shape=(640, 640), color=(114, 114, 114), auto=False, scaleFill=False, scaleup=True):"""对图像进行letterbox操作，保持宽高比缩放并填充到指定尺寸:param image: 输入的图像，格式为numpy数组 (height, width, channels):param new_shape: 目标尺寸，格式为 (height, width):param color: 填充颜色，默认为 (114, 114, 114):param auto: 是否自动计算最小矩形，默认为True:param scaleFill: 是否不保持宽高比直接缩放，默认为False:param scaleup: 是否只放大不缩小，默认为True:return: 处理后的图像，缩放比例，填充大小"""shape = image.shape[:2]  # 当前图像的高度和宽度r = min(new_shape[0] / shape[0], new_shape[1] / shape[1])if not scaleup:  # 只缩小不放大（为了更好的效果）r = min(r, 1.0)new_unpad = int(round(shape[1] * r)), int(round(shape[0] * r))dw, dh = new_shape[1] - new_unpad[0], new_shape[0] - new_unpad[1]  # 计算填充尺寸if auto:  # 最小矩形dw, dh = np.mod(dw, 64), np.mod(dh, 64)  # 强制为 64 的倍数dw /= 2  # 从两侧填充dh /= 2if shape[::-1] != new_unpad:  # 缩放图像image = cv2.resize(image, new_unpad, interpolation=cv2.INTER_LINEAR)top, bottom = int(round(dh - 0.1)), int(round(dh + 0.1))left, right = int(round(dw - 0.1)), int(round(dw + 0.1))image = cv2.copyMakeBorder(image, top, bottom, left, right, cv2.BORDER_CONSTANT, value=color)  # 添加填充scale_ratio = rpad_size = (dw, dh)return image, scale_ratio, pad_sizedef preprocess_image(image, shape, bgr2rgb=True):"""图片预处理"""img, scale_ratio, pad_size = letterbox(image, new_shape=shape)if bgr2rgb:img = img[:, :, ::-1]img = img.transpose(2, 0, 1)  # HWC2CHWimg = np.ascontiguousarray(img, dtype=np.float32)return img, scale_ratio, pad_sizedef generate_mask(img, seg, outpath, scale=0.4):'分割结果可视化'color = [[255, 0,   0],[255, 85,  0],[255, 170, 0],[255, 0,   85],[255, 0,   170],[0,   255, 0],[85,  255, 0],[170, 255, 0],[0,   255, 85],[0,   255, 170],[0,   0,   255],[85,  0,   255],[170, 0,   255],[0,   85,  255],[0,   170, 255],[255, 255, 0],[255, 255, 85],[255, 255, 170],[255, 0,   255],[255, 85,  255]]img = img.transpose(1, 2, 0)  # HWC2CHWminidx = int(seg.min())maxidx = int(seg.max())color_img = np.zeros_like(img)for i in range(minidx, maxidx):if i <= 0:continuecolor_img[seg == i] = color[i]showimg = scale * img + (1 - scale) * color_imgImage.fromarray(showimg.astype(np.uint8)).save(outpath)if __name__ == '__main__':# define cmd argumentsparser = argparse.ArgumentParser()parser.add_argument('--image-path', type=str, help='path of the input image (a file)')parser.add_argument('--output-path', type=str, help='paht for saving the predicted alpha matte (a file)')parser.add_argument('--model-path', type=str, help='path of the ONNX model')args = parser.parse_args()# check input argumentsif not os.path.exists(args.image_path):print('Cannot find the input image: {0}'.format(args.image_path))exit()if not os.path.exists(args.model_path):print('Cannot find the ONXX model: {0}'.format(args.model_path))exit()ref_size = [512, 512]# read imageim = cv2.imread(args.image_path)img, scale_ratio, pad_size = preprocess_image(im, ref_size)showimg = img.copy()[::-1, ...]mean = np.asarray([0.485, 0.456, 0.406])scale = np.asarray([0.229, 0.224, 0.225])mean = mean.reshape((3, 1, 1))scale = scale.reshape((3, 1, 1))img = (img / 255 - mean) * scaleim = img[None].astype(np.float32)np.save("models/ComputeVision/Semantic_Segmentation/onnx_faceparse/datasets/calibration_data.npy", im)# Initialize session and get predictionsession = onnxruntime.InferenceSession(args.model_path, None)input_name = session.get_inputs()[0].nameoutput_name = session.get_outputs()[0].nameoutput = session.run([output_name], {input_name: im})start_time = time.perf_counter()for _ in range(5):output = session.run([output_name], {input_name: im})end_time = time.perf_counter()use_time = (end_time - start_time) * 1000fps = 1000 / use_timeprint(f"推理耗时:{use_time:.2f} ms, fps:{fps:.2f}")# refine matteseg = np.argmax(output[0], axis=1).squeeze()generate_mask(showimg, seg, args.output_path)

推理

python models/ComputeVision/Semantic_Segmentation/onnx_faceparse/inference_onnx.py --image-path models/ComputeVision/Semantic_Segmentation/onnx_faceparse/test_data/test_lite_face_parsing.png --output-path output/face_parsering.jpg --model-path asserts/models/bisenet/face_parsing_512x512.onnx

打印输出

推理耗时:1544.36 ms, fps:0.65

可视化效果
在这里插入图片描述

代码解释

np.save(“models/ComputeVision/Semantic_Segmentation/onnx_faceparse/datasets/calibration_data.npy”, im)

这里是将输入保存供 NPU PTQ 量化使用.

2. FaceParsingBiSeNet NPU 推理

创建 cfg 配置文件，具体如下。

[Common]
mode = build[Parser]
model_type = onnx
model_name = face_parsing_512x512
detection_postprocess =
model_domain = image_segmentation
input_model = /home/5_radxa/ai_model_hub/asserts/models/bisenet/face_parsing_512x512.onnx
input = input
input_shape = [1, 3, 512, 512]
output = out
output_dir = ./[Optimizer]
output_dir = ./
calibration_data = ./datasets/calibration_data.npy
calibration_batch_size = 1
metric_batch_size = 1
dataset = NumpyDataset
quantize_method_for_weight = per_channel_symmetric_restricted_range
quantize_method_for_activation = per_tensor_asymmetric
save_statistic_info = True[GBuilder]
outputs = bisenet.cix
target = X2_1204MP3
profile = True
tiling = fps

注意： [Parser]中的 input，output 是输入，输出 tensor 的名字，可以通过 netron 打开 onnx 模型看。输入，输出名字不匹配时，会有报错：

[I] Build with version 6.1.3119
[I] Parsing model....
[I] [Parser]: Begin to parse onnx model face_parsing_512x512...
2025-04-18 11:13:53.104146: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /root/miniconda3/envs/radxa/lib/python3.8/site-packages/cv2/../../lib64:/root/miniconda3/envs/radxa/lib/python3.8/site-packages/AIPUBuilder/simulator-lib/:/root/miniconda3/envs/radxa/lib/python3.8/site-packages/AIPUBuilder/simulator-lib
2025-04-18 11:13:53.104217: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2025-04-18 11:13:54.266791: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /root/miniconda3/envs/radxa/lib/python3.8/site-packages/cv2/../../lib64:/root/miniconda3/envs/radxa/lib/python3.8/site-packages/AIPUBuilder/simulator-lib/:/root/miniconda3/envs/radxa/lib/python3.8/site-packages/AIPUBuilder/simulator-lib
2025-04-18 11:13:54.266893: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2025-04-18 11:13:54.266959: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (chenjun): /proc/driver/nvidia/version does not exist
[E] [Parser]: Graph does not contain such a node/tensor name:output
[E] [Parser]: Parser Failed!

编译模型
在 x86 主机上编译模型

cd models/ComputeVision/Semantic_Segmentation/onnx_faceparse
cixbuild ./cfg/onnx_bisenet.cfg

报错: ImportError: libaipu_simulator_x2.so: cannot open shared object file: No such file or directory

解决方案

通过find / -name libaipu_simulator_x2.so
通过export LD_LIBRARY_PATH=/root/miniconda3/envs/radxa/lib/python3.8/site-packages/AIPUBuilder/simulator-lib/libaipu_simulator_x2.so:$LD_LIBRARY_PATH添加环境变量

编译成功的打印信息

[I] Build with version 6.1.3119
[I] Parsing model....
[I] [Parser]: Begin to parse onnx model face_parsing_512x512...
2025-04-18 11:20:16.726111: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /root/miniconda3/envs/radxa/lib/python3.8/site-packages/cv2/../../lib64:/root/miniconda3/envs/radxa/lib/python3.8/site-packages/AIPUBuilder/simulator-lib/:/root/miniconda3/envs/radxa/lib/python3.8/site-packages/AIPUBuilder/simulator-lib
2025-04-18 11:20:16.726199: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2025-04-18 11:20:17.908202: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /root/miniconda3/envs/radxa/lib/python3.8/site-packages/cv2/../../lib64:/root/miniconda3/envs/radxa/lib/python3.8/site-packages/AIPUBuilder/simulator-lib/:/root/miniconda3/envs/radxa/lib/python3.8/site-packages/AIPUBuilder/simulator-lib
2025-04-18 11:20:17.908267: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2025-04-18 11:20:17.908283: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (chenjun): /proc/driver/nvidia/version does not exist
[W] [Parser]: The output name out is not a node but a tensor. However, we will use the node Resize_161 as output node.
2025-04-18 11:20:21.229063: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[I] [Parser]: The input tensor(s) is/are: input_0
[I] [Parser]: Input input from cfg is shown as tensor input_0 in IR!
[I] [Parser]: Output out from cfg is shown as tensor Resize_161_post_transpose_0 in IR!
[I] [Parser]: 0 error(s), 1 warning(s) generated.
[I] [Parser]: Parser done!
[I] Parse model complete
[I] Simplifying float model.
[I] [IRChecker] Start to check IR: /home/5_radxa/ai_model_hub/models/ComputeVision/Semantic_Segmentation/onnx_faceparse/internal/face_parsing_512x512.txt
[I] [IRChecker] model_name: face_parsing_512x512
[I] [IRChecker] IRChecker: All IR pass (Checker Plugin disabled)
[I] [graph.cpp :1600] loading graph weight: /home/5_radxa/ai_model_hub/models/ComputeVision/Semantic_Segmentation/onnx_faceparse/./internal/face_parsing_512x512.bin size: 0x322454c
[I] Start to simplify the graph...
[I] Using fixed-point full optimization, it may take long long time ....
[I] GSim simplified result:
------------------------------------------------------------------------OpType.Eltwise:   -3OpType.Mul:   +3OpType.Tile:   -3
------------------------------------------------------------------------
略
略
略
略
略
[I] [builder.cpp:1939]     Read and Write:80.21MB
[I] [builder.cpp:1080] Reduce constants memory size: 3.477MB
[I] [builder.cpp:2411] memory statistics for this graph (face_parsing_512x512)
[I] [builder.cpp: 585] Total memory     :       0x00d52b98 Bytes ( 13.323MB)
[I] [builder.cpp: 585] Text      section:       0x00042200 Bytes (  0.258MB)
[I] [builder.cpp: 585] RO        section:       0x00006d00 Bytes (  0.027MB)
[I] [builder.cpp: 585] Desc      section:       0x0002ea00 Bytes (  0.182MB)
[I] [builder.cpp: 585] Data      section:       0x00c8b360 Bytes ( 12.544MB)
[I] [builder.cpp: 585] BSS       section:       0x0000fb38 Bytes (  0.061MB)
[I] [builder.cpp: 585] Stack            :       0x00040400 Bytes (  0.251MB)
[I] [builder.cpp: 585] Workspace(BSS)   :       0x004c0000 Bytes (  4.750MB)
[I] [builder.cpp:2427]
[I] [tools.cpp :1181]  -  compile time: 20.726 s
[I] [tools.cpp :1087] With GM optimization, DDR Footprint stastic(estimation):
[I] [tools.cpp :1094]     Read and Write:92.67MB
[I] [tools.cpp :1137]  -  draw graph time: 0.03 s
[I] [tools.cpp :1954] remove global cwd: /tmp/af3c1da8ea81cc1cf85dba1587ff72126ee96222bb098b52633050918b4c7
build success.......
Total errors: 0,  warnings: 15

NPU 推理可视化
编写 npu 推理脚本, 可视化推理结果，统计推理耗时

import numpy as np
import cv2
import argparse
import os
import sys
import time# Define the absolute path to the utils package by going up four directory levels from the current file location
_abs_path = "/home/radxa/1_AI_models/ai_model_hub"
# Append the utils package path to the system path, making it accessible for imports
sys.path.append(_abs_path)
from utils.tools import get_file_list
from utils.NOE_Engine import EngineInferimport os
import cv2
import argparse
import numpy as np
from PIL import Imagedef letterbox(image, new_shape=(640, 640), color=(114, 114, 114), auto=False, scaleFill=False, scaleup=True):"""对图像进行letterbox操作，保持宽高比缩放并填充到指定尺寸:param image: 输入的图像，格式为numpy数组 (height, width, channels):param new_shape: 目标尺寸，格式为 (height, width):param color: 填充颜色，默认为 (114, 114, 114):param auto: 是否自动计算最小矩形，默认为True:param scaleFill: 是否不保持宽高比直接缩放，默认为False:param scaleup: 是否只放大不缩小，默认为True:return: 处理后的图像，缩放比例，填充大小"""shape = image.shape[:2]  # 当前图像的高度和宽度r = min(new_shape[0] / shape[0], new_shape[1] / shape[1])if not scaleup:  # 只缩小不放大（为了更好的效果）r = min(r, 1.0)new_unpad = int(round(shape[1] * r)), int(round(shape[0] * r))dw, dh = new_shape[1] - new_unpad[0], new_shape[0] - new_unpad[1]  # 计算填充尺寸if auto:  # 最小矩形dw, dh = np.mod(dw, 64), np.mod(dh, 64)  # 强制为 64 的倍数dw /= 2  # 从两侧填充dh /= 2if shape[::-1] != new_unpad:  # 缩放图像image = cv2.resize(image, new_unpad, interpolation=cv2.INTER_LINEAR)top, bottom = int(round(dh - 0.1)), int(round(dh + 0.1))left, right = int(round(dw - 0.1)), int(round(dw + 0.1))image = cv2.copyMakeBorder(image, top, bottom, left, right, cv2.BORDER_CONSTANT, value=color)  # 添加填充scale_ratio = rpad_size = (dw, dh)return image, scale_ratio, pad_sizedef preprocess_image(image, shape, bgr2rgb=True):"""图片预处理"""img, scale_ratio, pad_size = letterbox(image, new_shape=shape)if bgr2rgb:img = img[:, :, ::-1]img = img.transpose(2, 0, 1)  # HWC2CHWimg = np.ascontiguousarray(img, dtype=np.float32)return img, scale_ratio, pad_sizedef generate_mask(img, seg, outpath, scale=0.4):'分割结果可视化'color = [[255, 0,   0],[255, 85,  0],[255, 170, 0],[255, 0,   85],[255, 0,   170],[0,   255, 0],[85,  255, 0],[170, 255, 0],[0,   255, 85],[0,   255, 170],[0,   0,   255],[85,  0,   255],[170, 0,   255],[0,   85,  255],[0,   170, 255],[255, 255, 0],[255, 255, 85],[255, 255, 170],[255, 0,   255],[255, 85,  255]]img = img.transpose(1, 2, 0)  # HWC2CHWminidx = int(seg.min())maxidx = int(seg.max())color_img = np.zeros_like(img)for i in range(minidx, maxidx):if i <= 0:continuecolor_img[seg == i] = color[i]showimg = scale * img + (1 - scale) * color_imgImage.fromarray(showimg.astype(np.uint8)).save(outpath)if __name__ == '__main__':parser = argparse.ArgumentParser()parser.add_argument('--image-path', type=str, help='path of the input image (a file)')parser.add_argument('--output-path', type=str, help='paht for saving the predicted alpha matte (a file)')parser.add_argument('--model-path', type=str, help='path of the ONNX model')args = parser.parse_args()model = EngineInfer(args.model_path)ref_size = [512, 512]# read imageim = cv2.imread(args.image_path)img, scale_ratio, pad_size = preprocess_image(im, ref_size)showimg = img.copy()[::-1, ...]mean = np.asarray([0.485, 0.456, 0.406])scale = np.asarray([0.229, 0.224, 0.225])mean = mean.reshape((3, 1, 1))scale = scale.reshape((3, 1, 1))img = (img / 255 - mean) * scaleim = img[None].astype(np.float32)## inferinput_data = [im]# output = model.forward(input_data)[0]N = 5start_time = time.perf_counter()for _ in range(N):output = model.forward(input_data)[0]end_time = time.perf_counter()use_time = (end_time - start_time) * 1000 / Nfps = N / (end_time - start_time)print(f"包含输入量化，输出反量化，推理耗时:{use_time:.2f} ms, fps:{fps:.2f}")fps = model.get_ave_fps()use_time2 = 1000 / fpsprint(f"NPU 计算部分耗时：{use_time2:.2f} ms， fps: {fps:.2f}")# refine matteoutput = np.reshape(output, (1, 19, 512, 512))seg = np.argmax(output, axis=1).squeeze()generate_mask(showimg, seg, args.output_path)# release modelmodel.clean()

推理

source /home/radxa/1_AI_models/ai_model_hub/.venv/bin/activate
python models/ComputeVision/Semantic_Segmentation/onnx_faceparse/inference_npu.py --image-path models/ComputeVision/Semantic_Segmentation/onnx_faceparse/test_data/test_lite_face_parsing.png --output-path output/face_parsering.jpg --model-path models/ComputeVision/Semantic_Segmentation/onnx_faceparse/bisenet.cix

推理耗时

npu: noe_init_context success
npu: noe_load_graph success
Input tensor count is 1.
Output tensor count is 1.
npu: noe_create_job success
包含输入量化，输出反量化，推理耗时:379.63 ms, fps:2.63
NPU 计算部分耗时：10.70 ms， fps: 93.43
npu: noe_clean_job success
npu: noe_unload_graph success
npu: noe_deinit_context success

可以看到这里输入量化，输出反量化还是很耗时

可视化效果ok.
在这里插入图片描述