【“星瑞” O6 评测】 — CPU llama.cpp不同优化速度对比
前言
随着大模型应用场景的不断拓展,arm cpu 凭借其独特优势在大模型推理领域的重要性日益凸显。它在性能、功耗、架构适配等多方面发挥关键作用,推动大模型在不同场景落地
1. Kleidi AI 简介
Arm Kleidi 成为解决这些挑战的理想方案,它能够为运行在 Arm CPU 上的所有 AI 推理工作负载提供无缝的性能优化。KleidiAI 是一套轻量级且高性能开源的 Arm 例程,专为 AI 加速而设计。Arm 的 KleidiAI 库,提供了针对 sme、i8mm 和点积加速等硬件功能优化的矩阵乘法内核,目前已被集成到最新版本的主流端侧 AI 框架中,包括 ExecuTorch、Llama.cpp、LiteRT (通过XNNPACK)和 MediaPipe,能让数百万名开发者无需进行额外操作,即可自动获取 AI 性能的显著提升。
这里我们对比同一个模型,CPU编译时不同优化选项带来的提升
2. 依赖安装
sudo apt install cmake libcurl4-openssl-dev
下载代码
git clone https://github.com/ggml-org/llama.cpp.git## 切换到我测试的分支(可选)
git checkout b5195
3. 编译时不同优化选项实测
3.1 不开启任何优化
cmake -B build
cmake --build build --config Release -j
3.2 下载/转换/量化模型
从https://www.modelscope.cn/models/Qwen/Qwen2.5-3B-Instruct/files
下载模型
转换
pip install -r requirements/requirements-convert_hf_to_gguf.txt
python convert_hf_to_gguf.py /home/radxa/.cache/modelscope/hub/models/Qwen/Qwen2.5-3B-Instruct
量化
可以将模型的权重系数量化成Q4_0
./build/bin/llama-quantize /home/radxa/.cache/modelscope/hub/models/Qwen/Qwen2.5-3B-Instruct/Qwen2.5-3B-Instruct-F16.gguf asserts/Qwen2.5-3B-Instruct-Q4_0.gguf Q4_0
验证模型正确性
taskset -c 0,5,6,7,8,9,10,11 ./build/bin/llama-cli -m asserts/Qwen2.5-3B-Instruct-Q4_0.gguf -c 4096 -t 8 --conversation
打印信息
> hello
Hello! How can I assist you today? Do you have any questions or topics you'd like to discuss?>
llama_perf_sampler_print: sampling time = 2.79 ms / 32 runs ( 0.09 ms per token, 11477.76 tokens per second)
llama_perf_context_print: load time = 498.94 ms
llama_perf_context_print: prompt eval time = 592.82 ms / 9 tokens ( 65.87 ms per token, 15.18 tokens per second)
llama_perf_context_print: eval time = 1711.00 ms / 22 runs ( 77.77 ms per token, 12.86 tokens per second)
llama_perf_context_print: total time = 6498.13 ms / 31 tokens
Interrupted by user
3.3 不开启任何优化的benchmark
taskset -c 0,5,6,7,8,9,10,11 ./build/bin/llama-bench -m asserts/Qwen2.5-3B-Instruct-Q4_0.gguf -p 128 -n 128 -t 8
结果
model | size | params | backend | threads | test | t/s |
---|---|---|---|---|---|---|
qwen2 3B Q4_0 | 1.69 GiB | 3.09 B | CPU | 8 | pp128 | 17.16 ± 0.08 |
qwen2 3B Q4_0 | 1.69 GiB | 3.09 B | CPU | 8 | tg128 | 12.85 ± 0.09 |
3.4 开启avmv9优化
编译
cmake -B build_armv9 -DCMAKE_CXX_FLAGS="-march=armv9-a" -DCMAKE_C_FLAGS="-march=armv9-a"
cmake --build build_armv9 --config Release -j
benchmark命令: taskset -c 0,5,6,7,8,9,10,11 ./build_armv9/bin/llama-bench -m asserts/Qwen2.5-3B-Instruct-Q4_0.gguf -p 128 -n 128 -t 8
结果
model | size | params | backend | threads | test | t/s |
---|---|---|---|---|---|---|
qwen2 3B Q4_0 | 1.69 GiB | 3.09 B | CPU | 8 | pp128 | 84.39 ± 0.80 |
qwen2 3B Q4_0 | 1.69 GiB | 3.09 B | CPU | 8 | tg128 | 18.76 ± 0.22 |
3.5 开启kleidiai优化
kleidiai已经集成到llama.cpp的后端,只需要编译时给定正确的选项就行。
官方给的编译,我有报错
cmake -B build_kle -DGGML_CPU_KLEIDIAI=ON
cmake --build build_kle --config Release -j
报错:
/home/radxa/1_AI_models/llama.cpp/ggml/src/ggml-cpu/kleidiai/kernels.cpp:22:30: error: zero-size array ‘gemm_gemv_kernels’22 | static ggml_kleidiai_kernels gemm_gemv_kernels[] = {| ^~~~~~~~~~~~~~~~~
gmake[2]: *** [ggml/src/CMakeFiles/ggml-cpu.dir/build.make:272: ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/kleidiai/kernels.cpp.o] Error 1
gmake[2]: *** Waiting for unfinished jobs....
于是改用clang++编译器,
## 安装依赖
sudo apt install clang libomp-dev## 编译
cmake -B build_kle -DGGML_CPU_KLEIDIAI=ON -DCMAKE_C_COMPILER=/usr/bin/clang -DCMAKE_CXX_COMPILER=/usr/bin/clang++
cmake --build build_kle --config Release -j
benchmark命令: taskset -c 0,5,6,7,8,9,10,11 ./build_kle/bin/llama-bench -m asserts/Qwen2.5-3B-Instruct-Q4_0.gguf -p 128 -n 128 -t 8
结果
model | size | params | backend | threads | test | t/s |
---|---|---|---|---|---|---|
qwen2 3B Q4_0 | 1.69 GiB | 3.09 B | CPU | 8 | pp128 | 129.53 ± 6.59 |
qwen2 3B Q4_0 | 1.69 GiB | 3.09 B | CPU | 8 | tg128 | 16.25 ± 0.18 |
打印中有load_tensors: CPU_KLEIDIAI model buffer size = 1488.38 MiB
和KLEIDIAI = 1
表明编译选项正确打开。
全部的打印信息。
build: 5195 (2d451c80) with cc (Debian 12.2.0-14) 12.2.0 for aarch64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 35 key-value pairs and 434 tensors from asserts/Qwen2.5-3B-Instruct-Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen2.5 3B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Qwen2.5
llama_model_loader: - kv 5: general.size_label str = 3B
llama_model_loader: - kv 6: general.license str = other
llama_model_loader: - kv 7: general.license.name str = qwen-research
llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen2.5-3...
llama_model_loader: - kv 9: general.base_model.count u32 = 1
llama_model_loader: - kv 10: general.base_model.0.name str = Qwen2.5 3B
llama_model_loader: - kv 11: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen2.5-3B
llama_model_loader: - kv 13: general.tags arr[str,2] = ["chat", "text-generation"]
llama_model_loader: - kv 14: general.languages arr[str,1] = ["en"]
llama_model_loader: - kv 15: qwen2.block_count u32 = 36
llama_model_loader: - kv 16: qwen2.context_length u32 = 32768
llama_model_loader: - kv 17: qwen2.embedding_length u32 = 2048
llama_model_loader: - kv 18: qwen2.feed_forward_length u32 = 11008
llama_model_loader: - kv 19: qwen2.attention.head_count u32 = 16
llama_model_loader: - kv 20: qwen2.attention.head_count_kv u32 = 2
llama_model_loader: - kv 21: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 22: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 23: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 24: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 26: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 30: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 31: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 32: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 33: general.quantization_version u32 = 2
llama_model_loader: - kv 34: general.file_type u32 = 2
llama_model_loader: - type f32: 181 tensors
llama_model_loader: - type q4_0: 252 tensors
llama_model_loader: - type q6_K: 1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_0
print_info: file size = 1.69 GiB (4.71 BPW)
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch = qwen2
print_info: vocab_only = 0
print_info: n_ctx_train = 32768
print_info: n_embd = 2048
print_info: n_layer = 36
print_info: n_head = 16
print_info: n_head_kv = 2
print_info: n_rot = 128
print_info: n_swa = 0
print_info: n_swa_pattern = 1
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 256
print_info: n_embd_v_gqa = 256
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 11008
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 32768
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 3B
print_info: model params = 3.09 B
print_info: general.name = Qwen2.5 3B Instruct
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 151643 '<|endoftext|>'
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151643 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: CPU_Mapped model buffer size = 1720.63 MiB
load_tensors: CPU_KLEIDIAI model buffer size = 1488.38 MiB
.......................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context: CPU output buffer size = 0.58 MiB
init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 36, can_shift = 1
init: CPU KV buffer size = 144.00 MiB
llama_context: KV self size = 144.00 MiB, K (f16): 72.00 MiB, V (f16): 72.00 MiB
llama_context: CPU compute buffer size = 300.75 MiB
llama_context: graph nodes = 1338
llama_context: graph splits = 1
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 8
main: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistantsystem_info: n_threads = 8 (n_threads_batch = 8) / 12 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | SVE = 1 | DOTPROD = 1 | SVE_CNT = 16 | OPENMP = 1 | KLEIDIAI = 1 | AARCH64_REPACK = 1 | main: interactive mode on.
sampler seed: 3948005486
sampler params: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0== Running in interactive mode. ==- Press Ctrl+C to interject at any time.- Press Return to return control to the AI.- To return control without starting a new line, end your input with '/'.- If you want to submit another line, end your input with '\'.- Not using system message. To change it, set a different value via -sys PROMPT
问题
但是这样编译出来的可执行程序,执行测试的时候,模型效果是有问题,还需要排查。
./build_kle/bin/llama-cli -m asserts/Qwen2.5-3B-Instruct-Q4_0.gguf -c 4096 -t 8 --conversation## 打印
> hello
共和国owan続きMAR composition composition分 mutationorphAug AovOransition""""""""""" "" "" "amyamy.tom Entriesreta_suffix"卫生ventions警MessageBox
4.总结
同样的硬件,同样的模型,从上面的评测可以看到开启了kleidiai相较于armv9在prefill阶段提升了54.49%, decode阶段略有下降13.37。