当前位置: 首页 > news >正文

【“星睿O6”AI PC开发套件评测】GPU矩阵指令算力,GPU带宽和NPU算力测试

【“星睿O6”AI PC开发套件评测】GPU矩阵指令算力,GPU带宽和NPU算力测试

安谋科技、此芯科技与瑞莎计算机联合打造了面向AI PC、边缘、机器人等不同场景的“星睿O6”开发套件

该套件异构集成了Arm®v9 CPU核心、Arm Immortalis™ GPU以及安谋科技“周易”NPU

开箱和系统配置

在这里插入图片描述

在这里插入图片描述

根据这里的文档,刷上debian系统即可
https://docs.radxa.com/orion/o6/getting-started/quick-start
在这里插入图片描述

npu工具根据这个文档安装,注意一定要用 python 3.8 版本才能成功,如果系统有多个python版本,可以 python3.8 -m pip install CixBuilder-6.1.2958.1-py3-none-any.whl
https://docs.radxa.com/orion/o6/app-development/artificial-intelligence/npu-introduction

小提示

  • 风扇声音太大,控制风扇转速

root权限执行,范围 0~255

echo 30 > /sys/class/hwmon/hwmon1/pwm1
  • 修复 su 权限错误

root权限执行

chmod 4755 /bin/su

GPU算力测试

  • GPU定频最高

root权限执行

echo "performance" > /sys/class/misc/mali0/device/devfreq/15000000.gpu/governor

安装vulkaninfo工具,可以看到系统默认已支持vulkan

apt install vulkan-tools

GPU驱动采用的与Android系统类似的mali_kbase内核驱动,不同于mesa开源驱动所使用的panthor,用户态是闭源驱动

root@orion-o6:~# cat /sys/class/misc/mali0/device/devfreq/15000000.gpu/max_freq
900000000
root@orion-o6:~# lsmod | grep mali
mali_kbase           1044480  22

GPU算力测试使用项目 https://github.com/nihui/vkpeak

在vulkaninfo中发现,Mali Immortalis-G720 还支持矩阵扩展,支持的数据类型包含 fp16 * fp16 累加到 fp32,其中 M=16 N=32 K=32

https://registry.khronos.org/vulkan/specs/latest/man/html/VK_KHR_cooperative_matrix.html

vkpeak工具尚未适配这个MNK配置,修改vulkan shader如下,增加这个配置,循环执行 matrix mla 统计耗时测算 GFLOPS

考虑 arm mali gpu 普遍不具备 shared memory 硬件特性,于是利用 coopMatLoad 的 broadcasting 特性尽量减少内存访问

#version 450#extension GL_EXT_shader_16bit_storage: require
#extension GL_EXT_shader_explicit_arithmetic_types_float16: require
#extension GL_KHR_memory_scope_semantics: require
#extension GL_EXT_shader_explicit_arithmetic_types: require
#extension GL_KHR_cooperative_matrix: requirelayout (constant_id = 0) const int loop = 1;layout (binding = 0) writeonly buffer c_blob { uvec4 c_blob_data[]; };shared uvec4 tmp_a[2];
shared uvec4 tmp_b[4];
shared uvec4 tmp_c[4];void main()
{const int gx = int(gl_GlobalInvocationID.x);const int lx = int(gl_LocalInvocationID.x);if (lx < 2){tmp_a[lx] = uvec4(gx);tmp_b[lx] = uvec4(lx);}barrier();coopmat<float16_t, gl_ScopeSubgroup, 16, 32, gl_MatrixUseA> a;coopmat<float16_t, gl_ScopeSubgroup, 32, 32, gl_MatrixUseB> b;coopMatLoad(a, tmp_a, 0, 0, gl_CooperativeMatrixLayoutRowMajor);coopMatLoad(b, tmp_b, 0, 0, gl_CooperativeMatrixLayoutRowMajor);coopmat<float, gl_ScopeSubgroup, 16, 32, gl_MatrixUseAccumulator> c = coopmat<float, gl_ScopeSubgroup, 16, 32, gl_MatrixUseAccumulator>(0.f);for (int i = 0; i < loop; i++){c = coopMatMulAdd(a, b, c);c = coopMatMulAdd(a, b, c);c = coopMatMulAdd(a, b, c);c = coopMatMulAdd(a, b, c);c = coopMatMulAdd(a, b, c);c = coopMatMulAdd(a, b, c);c = coopMatMulAdd(a, b, c);c = coopMatMulAdd(a, b, c);c = coopMatMulAdd(a, b, c);c = coopMatMulAdd(a, b, c);c = coopMatMulAdd(a, b, c);c = coopMatMulAdd(a, b, c);c = coopMatMulAdd(a, b, c);c = coopMatMulAdd(a, b, c);c = coopMatMulAdd(a, b, c);c = coopMatMulAdd(a, b, c);}coopMatStore(c, tmp_c, 0, 0, gl_CooperativeMatrixLayoutRowMajor);barrier();if (lx < 4){c_blob_data[gx] = tmp_c[lx];}
}

vkpeak 测试截图,可以看到 fp16 相对于 fp32 有翻倍的算力,而 fp16-fp32 matrix 接近 fp32 算力,由于当前驱动没有实现 fp16 累加类型支持,实际对于神经网络计算的加成可能不如用单纯的 fp16 vec4 计算

GPU 不支持 fp64

在这里插入图片描述

GPU带宽测试

gpu reduce sum 是个经典的计算过程,在高度优化下通常受制于 gpu 显存带宽

这里有个很好的优化教程 https://developer.download.nvidia.cn/assets/cuda/files/reduction.pdf

我在这个教程的基础上,针对移动GPU和其他GPU的特性,扩充了几个内核版本,分别测算 reduce 的显存带宽

结果显示这 Mali Immortalis-G720 在 v6 版本内核中跑到了最快,对应带宽为 13.47GB/s

llvmpipe是mesa实现的CPU模拟,一起对比下

在这里插入图片描述

相关 vulkan reduce sum 代码如下(为了简洁,已清理 Mali G720 无关的内容)

#version 450
#extension GL_KHR_shader_subgroup_basic : enable
#extension GL_KHR_shader_subgroup_arithmetic : enablelayout (local_size_x_id = 0) in;layout (binding = 0) readonly buffer in_blob { int in_data[]; };
layout (binding = 1) writeonly buffer out_blob { int out_data[]; };layout (push_constant) uniform parameter
{int count;
} p;shared int sdata[gl_WorkGroupSize.x];void main()
{const uint lx = gl_LocalInvocationID.x;const uint gx0 = gl_WorkGroupID.x * 2 * gl_WorkGroupSize.x + lx;const uint gx1 = gx0 + gl_WorkGroupSize.x;// load data from global memory to shared memoryint in0 = gx0 < p.count ? in_data[gx0] : 0;int in1 = gx1 < p.count ? in_data[gx1] : 0;sdata[lx] = in0 + in1;// synchronize to ensure all data is loadedbarrier();memoryBarrierShared();// perform reduction in shared memoryif (gl_WorkGroupSize.x >= 64 && gl_SubgroupSize < 32){if (lx < 32) sdata[lx] += sdata[lx + 32];barrier();memoryBarrierShared();}// subgroup reduceconst uint sid = gl_SubgroupInvocationID;int s = 0;if (gl_SubgroupID == 0){s = sdata[sid] + sdata[sid + gl_SubgroupSize];s = subgroupAdd(s);}// write result for this block to global memoryif (lx == 0){out_data[gl_WorkGroupID.x] = s;}
}

pytorch模型转NPU的过程记录

简单定义一个pytorch模型,内容是10次矩阵乘,方便测试算力。导出onnx模型和 x.npy 用于后面 npu compiler 做量化校准

import torch
import numpyclass MatMulNet(torch.nn.Module):def __init__(self):super().__init__()self.linear = torch.nn.Linear(4000, 4000, bias=False)def forward(self, x):x = self.linear(x)x = self.linear(x)x = self.linear(x)x = self.linear(x)x = self.linear(x)x = self.linear(x)x = self.linear(x)x = self.linear(x)x = self.linear(x)x = self.linear(x)return xx = torch.rand((1, 4000, 4000))model = MatMulNet()torch.onnx.export(model, x, 'matmulnet.onnx', input_names=['in0'], output_names=['out0'])numpy.save('x.npy', x.numpy())

再编写一个对应的 matmulnet.cfg 配置,主要是要根据模型修改 input_shape input calibration_data 等设置

[Common]
mode = build[Parser]
model_type = onnx
model_name = matmulnet
detection_postprocess =
model_domain = image_classification
input_model = ./matmulnet.onnx
output_dir = ./
input_shape = [1, 4000, 4000]
input = in0[Optimizer]
calibration_data = x.npy
calibration_batch_size = 1
metric_batch_size = 1
output_dir = ./
dataset = numpydataset
save_statistic_info = True
cast_dtypes_for_lib = True[GBuilder]
target = X2_1204MP3
outputs = matmulnet.cix
profile = True
tiling = fps

执行 cixbuild matmulnet.cfg 进行npu模型优化,量化校准,保存最终的 matmulnet.cix 模型文件,这个过程很慢,很吃CPU+内存+硬盘,就像在用CPU训练模型似的

nihui@nihui-pc:~/dev/o6-test$ cixbuild matmulnet.cfg
[I] Build with version 6.1.2958
[I] Parsing model....
[I] [Parser]: Begin to parse onnx model matmulnet...
2025-03-30 17:16:02.520291: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/nihui/.local/lib/python3.8/site-packages/cv2/../../lib64:
2025-03-30 17:16:02.520314: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2025-03-30 17:16:03.533484: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/nihui/.local/lib/python3.8/site-packages/cv2/../../lib64:
2025-03-30 17:16:03.533509: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2025-03-30 17:16:03.533539: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (nihui-pc): /proc/driver/nvidia/version does not exist
2025-03-30 17:16:05.448576: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[I] [Parser]: The input tensor(s) is/are: in0_0
[I] [Parser]: Input in0 from cfg is shown as tensor in0_0 in IR!
[I] [Parser]: 0 error(s), 0 warning(s) generated.
[I] [Parser]: Parser done!
[I] Parse model complete
[I] Simplifying float model.
[I] [IRChecker] Start to check IR: /home/nihui/dev/o6-test/internal_2025_3_30_17_16_1_mhe2r/matmulnet.txt
[I] [IRChecker] model_name: matmulnet
[I] [IRChecker] IRChecker: All IR pass (Checker Plugin disabled)
[I] [graph.cpp :1600] loading graph weight: /home/nihui/dev/o6-test/./internal_2025_3_30_17_16_1_mhe2r/matmulnet.bin size: 0x26281100
[I] Start to simplify the graph...
[I] Using fixed-point full optimization, it may take long long time ....
[I] Simplify Done.
[I] Simplify float model Done.
[I] Optimizing model....
[I] [OPT] [17:16:10]: [arg_parser] is running.
[I] [OPT] [17:16:10]: tool name: Compass-Optimizer, version: 1.3.2958, use cuda: False, running device: cpu
[I] [OPT] [17:16:10]: [quantization config Info][model name]: matmulnet, [quantization method for weight]: per_tensor_symmetric_restricted_range, [quantization method for activation]: per_tensor_symmetric_full_range, [calibation strategy for weight]: extrema, [calibation strategy for activation]: mean, [quantization precision]: activation_bits=8, weight_bits=8, bias_bits=32, lut_items_in_bits=8[I] [OPT] [17:16:10]: Suggest using "aipuchecker" to validate the IR firstly if you are not sure about its validity.
[I] [OPT] [17:16:10]: IR loaded.
Building graph: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 1651.05it/s]
[I] [OPT] [17:16:10]: Begin to load weights.
[I] [OPT] [17:16:10]: Weights loaded.
Deserializing bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 89.12it/s]
[I] [OPT] [17:16:10]: Successfully parsed IR with python API.
[I] [OPT] [17:16:10]: init graph by forwarding one sample filled with zeros
forward_to: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:02<00:00,  5.23it/s]
[I] [OPT] [17:16:13]: [graph_optimize_stage1] is running.
[I] [OPT] [17:16:13]: [statistic] is running.
statistic batch: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00,  5.98s/it]
[I] [OPT] [17:16:19]: [graph_optimize_stage2] is running.
[I] [OPT] [17:16:19]: applying calibration strategy based on statistic info
calibration: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 5166.87it/s]
[I] [OPT] [17:16:19]: [quantize] is running.
[I] [OPT] [17:16:20]: These OPs will automatically cast dtypes to adapt to lib's dtypes' spec (may cause model accuracy loss due to corresponding spec's restriction): {'OpType.Input', 'OpType.Reshape', 'OpType.FullyConnected'}
quantize each layer: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 16.98it/s]
[I] [OPT] [17:16:22]: collecting per-layer similarity infomation between float graph and quanted graph by forwarding 1 sample on both of them
[I] [OPT] [17:16:29]: [graph_optimize_stage3] is running.
[I] [OPT] [17:16:29]: [serialize] is running.
[I] [OPT] [17:16:29]: check the final graph by forwarding one sample filled with zeros
forward_to: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:03<00:00,  3.78it/s]
[I] [OPT] [17:16:33]: Begin to serialzie IR
Writing IR: 13it [00:00, 628.59it/s]
Serializing bin: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 419.53it/s]
[I] [OPT] [17:16:33]: IR has been saved into /home/nihui/dev/o6-test/./internal_2025_3_30_17_16_1_mhe2r
[I] [OPT] [17:16:33]: Compass-Optimizer has done at [serialize] period.
[I] [OPT] [17:16:33]: [Done]cost time: 31s, and [scale]: out: [tensor([9942.5059])] in: [tensor([255.0000])] [output tensors cosine]: [0.9991913553234072][output tensors MSE]: [8.952337537948551e-09]
[I] Optimizing model complete
[I] Simplifying quant model...
[I] [IRChecker] Start to check IR: /home/nihui/dev/o6-test/internal_2025_3_30_17_16_1_mhe2r/matmulnet_quant.txt
[I] [IRChecker] model_name: matmulnet
[I] [IRChecker] IRChecker: All IR pass (Checker Plugin disabled)
[I] [graph.cpp :1600] loading graph weight: /home/nihui/dev/o6-test/./internal_2025_3_30_17_16_1_mhe2r/matmulnet_quant.bin size: 0x98bd900
[I] Start to simplify the graph...
[I] Using fixed-point full optimization, it may take long long time ....
[I] Simplify Done.
[I] Simplify quant model Done.
[I] Building ...
[I] [IRChecker] Start to check IR: /home/nihui/dev/o6-test/internal_2025_3_30_17_16_1_mhe2r/matmulnet_quant_s.txt
[I] [IRChecker] model_name: matmulnet
[I] [IRChecker] IRChecker: All IR pass
[I] [tools.cpp : 342] BuildTool version: 6.1.2958. Build for target X2_1204MP3 PID: 24109
[I] [tools.cpp : 362] using default profile events to profile default
[I] [tools.cpp : 781] global cwd: /tmp/9845fce62963e3e71cf53fe8278fa0a4fdb2f2accb810446318bc27aff74
[I] [graph.cpp :1600] loading graph weight: /home/nihui/dev/o6-test/./internal_2025_3_30_17_16_1_mhe2r/matmulnet_quant_s.bin size: 0x98bd900
[I] [tiling.cpp:5112] Auto tiling now, please wait ...
[I] [aipu_plugin.cpp: 344] Convolution(/linear/MatMul) uses performance-lib
[I] [aipu_plugin.cpp: 344] Convolution(/linear_1/MatMul) uses performance-lib
[I] [aipu_plugin.cpp: 344] Convolution(/linear_2/MatMul) uses performance-lib
[I] [aipu_plugin.cpp: 344] Convolution(/linear_3/MatMul) uses performance-lib
[I] [aipu_plugin.cpp: 344] Convolution(/linear_4/MatMul) uses performance-lib
[I] [aipu_plugin.cpp: 344] Convolution(/linear_5/MatMul) uses performance-lib
[I] [aipu_plugin.cpp: 344] Convolution(/linear_6/MatMul) uses performance-lib
[I] [aipu_plugin.cpp: 344] Convolution(/linear_7/MatMul) uses performance-lib
[I] [aipu_plugin.cpp: 344] Convolution(/linear_8/MatMul) uses performance-lib
[I] [aipu_plugin.cpp: 344] Convolution(/linear_9/MatMul) uses performance-lib
[I] [actg.cpp  : 473] new sgnode with actg: 0
[I] [datalayout_schedule2.cpp:1067] Layout loss: 10
[I] [datalayout_schedule2.cpp:1068] Layout scheduling ...
[I] [datalayout_schedule2.cpp:1071] The layout loss for graph matmulnet: 1
[I] [datalayout_schedule.cpp: 776] The graph matmulnet post optimized score:0
[I] [datalayout_schedule.cpp: 789] layout schedule costs: 0.392489ms
[I] [IRChecker] Start to check IR:
[I] [IRChecker] model_name: cost_model
[I] [IRChecker] IRChecker: All IR pass
[I] [load_balancer.cpp:2152] enable multicore schedule optimization for load balance strategy 0 it may degrade performance on single core targets.
[I] [load_balancer.cpp:1233] ----------------------------------------------
[I] [load_balancer.cpp:1234] Scheduler Optimization Performance Evaluation:
[I] [load_balancer.cpp:1271] level: 0 cycles: 0 utils: 0 0 0
[I] [load_balancer.cpp:1271] level: 1 cycles: 93004044 utils: 1 0 0
[I] [load_balancer.cpp:1277] total cycles: 93004044
[I] [load_balancer.cpp:1278] ----------------------------------------------
[I] [load_balancer.cpp: 141] schedule level: done
[I] [load_balancer.cpp: 144] [level 0]
[I] [load_balancer.cpp:  93] subgraph_in0
[I] [load_balancer.cpp: 104] -*-[real]in0
[I] [load_balancer.cpp: 148] [load] 0
[I] [load_balancer.cpp: 144] [level 1]
[I] [load_balancer.cpp:  93] subgraph_subgraph_reshape
[I] [load_balancer.cpp: 104] -*-[real]subgraph_reshape_sg_input_0
[I] [load_balancer.cpp: 104] -*-[real]reshape
[I] [load_balancer.cpp: 104] -*-[real]reshape/layout/NCHWC32
[I] [load_balancer.cpp:  93] -*-subgraph_/linear/MatMul
[I] [load_balancer.cpp: 104] -*--*-[real]/linear/MatMul
[I] [load_balancer.cpp:  93] -*-subgraph_/linear_1/MatMul
[I] [load_balancer.cpp: 104] -*--*-[real]/linear_1/MatMul
[I] [load_balancer.cpp:  93] -*-subgraph_/linear_2/MatMul
[I] [load_balancer.cpp: 104] -*--*-[real]/linear_2/MatMul
[I] [load_balancer.cpp:  93] -*-subgraph_/linear_3/MatMul
[I] [load_balancer.cpp: 104] -*--*-[real]/linear_3/MatMul
[I] [load_balancer.cpp:  93] -*-subgraph_/linear_4/MatMul
[I] [load_balancer.cpp: 104] -*--*-[real]/linear_4/MatMul
[I] [load_balancer.cpp:  93] -*-subgraph_/linear_5/MatMul
[I] [load_balancer.cpp: 104] -*--*-[real]/linear_5/MatMul
[I] [load_balancer.cpp:  93] -*-subgraph_/linear_6/MatMul
[I] [load_balancer.cpp: 104] -*--*-[real]/linear_6/MatMul
[I] [load_balancer.cpp:  93] -*-subgraph_/linear_7/MatMul
[I] [load_balancer.cpp: 104] -*--*-[real]/linear_7/MatMul
[I] [load_balancer.cpp:  93] -*-subgraph_/linear_8/MatMul
[I] [load_balancer.cpp: 104] -*--*-[real]/linear_8/MatMul
[I] [load_balancer.cpp:  93] -*-subgraph_/linear_9/MatMul
[I] [load_balancer.cpp: 104] -*--*-[real]/linear_9/MatMul
[I] [load_balancer.cpp: 104] -*-[real]/linear_9/MatMul_post_reshape
[I] [load_balancer.cpp: 148] [load] 93004044
[I] [load_balancer.cpp: 151] schedule level: done done
[I] [mc_scheduler_mem_alloc.cpp: 422] with GM optimization reduce footprint:0B
[I] [aipu_plugin_tpc.cpp: 173] LayoutConvertor(reshape/layout/NCHWC32)uses tensor-process-lib
[I] [aipu_plugin_tpc.cpp: 173] LayoutConvertor(reshape/layout/NCHWC32)uses tensor-process-lib
[I] [layoutconvertor.cpp: 258] Building reshape/layout/NCHWC32...
[I] [aipu_plugin_tpc.cpp: 173] LayoutConvertor(reshape/layout/NCHWC32)uses tensor-process-lib
[I] [builder.cpp:1788] The graph DDR Footprint requirement(estimation) of feature maps:
[I] [builder.cpp:1789]     Read and Write:335.69MB
[I] [builder.cpp:1043] Reduce constants memory size: 137.404MB
[W] [ar_reader.cpp: 142] name offset not found
[W] [ar_reader.cpp:  63] /usr/bin//../lib//libmcheck.ais not a archive file.
[I] [builder.cpp:2250] memory statistics for this graph (matmulnet)
[I] [builder.cpp: 559] Total memory     :       0x01004f90 Bytes ( 16.019MB)
[I] [builder.cpp: 559] Text      section:       0x00024750 Bytes (  0.142MB)
[I] [builder.cpp: 559] RO        section:       0x00000d00 Bytes (  0.003MB)
[I] [builder.cpp: 559] Desc      section:       0x00048300 Bytes (  0.282MB)
[I] [builder.cpp: 559] Data      section:       0x00f42850 Bytes ( 15.260MB)
[I] [builder.cpp: 559] BSS       section:       0x00014bf0 Bytes (  0.081MB)
[I] [builder.cpp: 559] Stack            :       0x00040400 Bytes (  0.251MB)
[I] [builder.cpp: 559] Workspace(BSS)   :       0x00000000 Bytes (  0.000MB)
[I] [builder.cpp:2266]
[I] [tools.cpp :1127]  -  compile time: 2.624 s
[I] [tools.cpp :1033] With GM optimization, DDR Footprint stastic(estimation):
[I] [tools.cpp :1040]     Read and Write:488.36MB
[I] [tools.cpp :1083]  -  draw graph time: 0.002 s
[I] [tools.cpp :1766] remove global cwd: /tmp/9845fce62963e3e71cf53fe8278fa0a4fdb2f2accb810446318bc27aff74
Serialization Model: /home/nihui/dev/o6-test/matmulnet.cix
build success.......
Total errors: 0,  warnings: 2

把 matmulnet.cix 和 x.npy 两个文件都拷贝到 orion o6 开发板上,写个最简单的 NPU 模型推理测试代码,记录一次推理所需时间并测算出 int8 TOPS

orion o6 debian系统默认已经安装了 libnoe 运行库,能直接使用

import numpy
import time
from libnoe import *npu = NPU()npu.noe_init_context()
print('noe_init_context done')graph_id = npu.noe_load_graph('./matmulnet.cix')['data']
print('noe_load_graph done')input_datatype = npu.noe_get_tensor_descriptor(graph_id, NOE_TENSOR_TYPE_INPUT, 0).data_type
output_datatype = npu.noe_get_tensor_descriptor(graph_id, NOE_TENSOR_TYPE_OUTPUT, 0).data_type
print('noe_get_tensor_descriptor done')job_cfg = { "partition_id": 0, "dbg_dispatch": 0, "dbg_core_id": 0, "qos_level": 0, }
fm_idxes = []
wt_idxes = []
job_id = npu.noe_create_job(graph_id, job_cfg, fm_idxes, wt_idxes)['data']
print('noe_create_job done')x = numpy.load('x.npy')
npu.noe_load_tensor(job_id, 0, x.tobytes())
print('noe_load_tensor done')# infer
t0 = time.perf_counter()npu.noe_job_infer_sync(job_id, -1)t1 = time.perf_counter()
duration = t1 - t0print('noe_job_infer_sync done ', duration * 1000, ' ms')
print('gi8ops = ', 4000 * 4000 * 4000 * 10.0 / (1024 * 1024 * 1024) / duration * 2)out = npu.noe_get_tensor(job_id, NOE_TENSOR_TYPE_OUTPUT, 0, D_INT8)['data']
print('noe_get_tensor done')npu.noe_clean_job(job_id)
print('noe_clean_job done')npu.noe_unload_graph(graph_id)
print('noe_unload_graph done')npu.noe_deinit_context()
print('noe_deinit_context done')

执行效果如下,可以看到 10 次 4000x4000 矩阵乘 int8 量化后在 NPU 上耗时 343ms,等效于 3.47TOPS

在这里插入图片描述

conv3x3 NPU算力测试

考虑到 matmul 计算比较吃带宽,改为计算密度更高的 conv3x3 卷积

class Conv3x3Net(torch.nn.Module):def __init__(self):super().__init__()self.conv = torch.nn.Conv2d(800, 800, (3,3), padding=(1,1), bias=False)def forward(self, x):x = self.conv(x)x = self.conv(x)x = self.conv(x)x = self.conv(x)x = self.conv(x)x = self.conv(x)x = self.conv(x)x = self.conv(x)x = self.conv(x)x = self.conv(x)return xx = torch.rand((1, 800, 100, 100))

npu转换过程和上面一样,计算TOPS改为 print('gi8ops = ', 800 * 800 * 3 * 3 * 102 * 102 * 10.0 / (1024 * 1024 * 1024) / duration * 2)

结果显然 conv3x3 的效率更高,达到 11.71TOPS

根据规格说明,NPU 支持 INT4 / INT8 / INT16 / FP16 / BF16 和 TF32 加速,算力高达 28.8TOPs

但是 NPU手册上却写着只支持 8bit 16bit 量化,实际工具只支持最低 int8 量化,int4 算力目前无法测试 qaq

在这里插入图片描述

相关文章:

  • npm i 出现permission denied
  • AtCoder 第402场初级竞赛 A~E题解
  • JavaScript 渲染内容爬取实践:Puppeteer 进阶技巧
  • Socket
  • 【STL】unordered_map
  • 【速写】多LoRA并行衍生的一些思考
  • Nginx:前后端分离配置(静态资源+反向代理)
  • navicat导入sql文件 所有问题解决方法集合
  • ios开发中xxx.debug.dylib not found
  • day21 | 26暑期实习
  • windows server2019 内网离线安装mysql5.7方式;windows server2019安装软件提示丢失msvcp100.dll问题处理
  • char32_t、char16_t、wchar_t 用于 c++ 语言里存储 unicode 编码的字符,给出它们的具体定义
  • Linux系统编程 day9 SIGCHLD and 线程
  • uniapp开发2--uniapp中的条件编译总结
  • 【HarmonyOS】ArKUI框架
  • 基于贝叶斯优化的Transformer多输入单输出回归预测模型Bayes-Transformer【MATLAB】
  • HarmonyOS Next 编译之如何使用多目标产物不同包名应用
  • 字符串全排列(Java版本自己用)
  • 随机数算法原理以及模拟实现
  • 如何高效的进行生产管理?
  • 世界读书日|阅读在上海
  • 牛市早报|外汇局:4月以来外汇市场交易保持平稳,跨境资金延续净流入
  • 经济日报刊文:如何破除“内卷式”竞争
  • 世界读书日丨人均一年超10本!你达到上海平均阅读水平了吗
  • 《国语汇校集注》:以1900余条注解,揭示隐微,提供思考
  • 打造“朋友圈”,“淘书乐”为旧书找“新朋友”