当前位置: 首页 > news >正文

mindspeed-rl使用注意事项

1、安装

参考1:docs/install_guide.md · R1-CLM/MindSpeed-RL - Gitee.com

参考2:VLLM x Ascend框架_vllm-ascend-CSDN博客

2、SFT微调

整体参考docs/supervised_finetune.md

自定义数据格式同:AUTO-DL 910B + mindspeed-llm 4层DeepSeek V3微调-CSDN博客

第4节,领域语料。

(1)在configs/datasets目录下,新增search_instruction_non_pack.yaml文件(参考alpaca_instruction_non_pack.yaml),注意这里pack和nopack的区别,pack一般用于多轮,含有history等字段,非pack模式下,有instruction,input,output字段即可。

(2)执行sh  examples/data/preprocess_data.sh search_instruction_non_pack,这里preprocess_data脚本有点问题,修改如下:

SCRIPT_DIR=$(cd "$(dirname "$0")" && pwd)
export PYTHONPATH=$SCRIPT_DIR/../..:$PYTHONPATH
PROJECT_PATH=$SCRIPT_DIR/../..# 默认值
default_config="alpaca_pairwise"
config=$1python "$PROJECT_PATH"/cli/preprocess_data.py $config

(3)转换文件格式hf为mcore格式

修改模型目录,设置pp为1,执行:sh examples/ckpt/ckpt_convert_qwen25_hf2mcore.sh

export CUDA_DEVICE_MAX_CONNECTIONS=1# 修改 ascend-toolkit 路径
source /usr/local/Ascend/ascend-toolkit/set_env.sh# 设置需要的权重转换参数
python cli/convert_ckpt.py \--use-mcore-models \--model-type GPT \--load-model-type hf \--save-model-type mg \--target-tensor-parallel-size 1 \--target-pipeline-parallel-size 1 \--add-qkv-bias \--load-dir /root/autodl-tmp/qwen2.5-0.5b \--save-dir /root/autodl-tmp/qwen2.5-0.5b-mcore \--tokenizer-model /root/autodl-tmp/qwen2.5-0.5b/tokenizer.json \--model-type-hf llama2 \--params-dtype bf16

(4)拷贝一份sft_qwen25_0.5b.sh,修改如下:

注意:这里去掉了SOCKET_IFNAME相关设置,改为HCCL_CONNECT_TIMEOUT

#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_CONNECT_TIMEOUT=3600
export HYDRA_FULL_ERROR=1GPUS_PER_NODE=1
MASTER_ADDR=localhost
MASTER_PORT=6005
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE \--nnodes $NNODES \--node_rank $NODE_RANK \--master_addr $MASTER_ADDR \--master_port $MASTER_PORT
"torchrun $DISTRIBUTED_ARGS cli/train_sft.py \--config-name sft_qwen25_0.5b \| tee logs/sft_qwen25_0.5b_rank${NODE_RANK}.log

拷贝一份sft_qwen25_0.5b.yaml修改如下:

defaults:- model:- qwen25_0.5bsft:# tune_args:finetune: truestage: sftis_instruction_dataset: truevariable_seq_lengths: truetokenizer_not_use_fast: trueprompt_type: qwen# gpt_args:norm_epsilon: 1e-6micro_batch_size: 4global_batch_size: 128tokenizer_type: PretrainedFromHFtokenizer_name_or_path: /root/autodl-tmp/qwen2.5-0.5b/train_iters: 5000lr: 5e-5lr_decay_style: cosinemin_lr: 1.25e-7lr_warmup_fraction: 0.01weight_decay: 1e-1clip_grad: 1.0initial_loss_scale: 4096use_distributed_optimizer: truetensor_model_parallel_size: 2pipeline_model_parallel_size: 2sequence_parallel: falseuse_mcore_models: trueuse_fused_rmsnorm: trueuse_flash_attn: trueno_masked_softmax_fusion: trueno_gradient_accumulation_fusion: trueuse_fused_swiglu: trueuse_fused_rotary_pos_emb: truebf16: trueseq_length: 4096adam_beta1: 0.9adam_beta2: 0.95attention_dropout: 0.0init_method_std: 0.01hidden_dropout: 0.0overlap_grad_reduce: trueoverlap_param_gather: true# data_args:data_path: ./data/search/search_trainsplit: 100,0,0no_shuffle: false# ckpt_args:no_load_optim: trueno_load_rng: trueno_save_optim: trueno_save_rng: trueseed: 1234model: qwen25_0.5bload: /root/autodl-tmp/qwen2.5-0.5b-mcoresave: /root/autodl-tmp/output-rl-0.5b-sft# output_args:log_interval: 1save_interval: 5000eval_interval: 5000eval_iters: 0log_throughput: true
qwen25_0.5b:use_mcore_models: truenum_layers: 24hidden_size: 896ffn_hidden_size: 4864num_attention_heads: 14rotary_base: 1000000max_position_embeddings: 32768make_vocab_size_divisible_by: 1padded_vocab_size: 151936untie_embeddings_and_output_weights: trueadd_qkv_bias: truedisable_bias_linear: truegroup_query_attention: truenum_query_groups: 2position_embedding_type: ropenormalization: RMSNormswiglu: trueattention_softmax_in_fp32: true

执行: sh examples/sft/sft_qwen25_0.5b.sh

报错:[rank0]: RuntimeError: Error(s) in loading state_dict for GPTModel:
[rank0]:        Missing key(s) in state_dict: "output_layer.weight". 

这个缺陷2月份已经有人提交,但未解决。MindSpeed-r1加载权重报错output_layer.weight key缺失 · Issue #IBNT8L · Ascend/MindSpeed-LLM - Gitee.com

3、GRPO

使用mindspeed-llm中微调好的单层R1作为推理模型。遇到如下报错:

(1)ttributeError: 'AscendQuantConfig' object has no attribute 'packed_modules_mapping'

参考:https://github.com/vllm-project/vllm-ascend/issues/420

建议升级到vllm-ascend RC2,注意原安装说明是有问题的,需要手工下载rc2文件,然后解压安装。

(2)KeyError: 'model.layers.0.self_attn.q_a_proj.weight'

  File "/root/autodl-tmp/vllm-ascend-0.7.3rc2/vllm_ascend/quantization/quant_config.py", line 93, in get_quant_method
    if self.is_layer_skipped_ascend(prefix,
  File "/root/autodl-tmp/vllm-ascend-0.7.3rc2/vllm_ascend/quantization/quant_config.py", line 135, in is_layer_skipped_ascend
    is_skipped = self.quant_description[prefix + '.weight'] == "FLOAT" 

在config.json配置中,有如下配置:

 而在 /root/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/llm.py中有提示:

因此去掉config.json中的相关配置即可。

(3)权重加载找不到目录:

if args.load_format == "megatron":tp_rank = ps._TP.rank_in_groupweights_path = os.path.join(args.load, f"iter_0000100/mp_rank_{tp_rank:02}/model_optim_rng.pt")这里格式如果设置为megatron,检查点文件需要严格匹配。

(4) File "/root/autodl-tmp/mindspeed-rl/mindspeed_rl/models/rollout/vllm_adapter/megatron_weight_loaders.py", line 101, in _get_model_weight_loader
[rank0]:     raise ValueError(f"Model architectures {arch} are not supported for now. "

部分改动如下:

  • config.json文件:

"auto_map": {
"AutoConfig": "configuration_deepseek.DeepseekV3Config"
去掉模型本地实现
},
"hidden_size": 1024,
"intermediate_size": 1024,
去掉quantization_config配置

  • 修改mindspeed_rl\models\rollout\vllm_adapter\megatron_weight_loaders.py文件:

在这个配置中增加:
MODEL_MEGATRON_WEIGHT_LOADER_REGISTRY = {
"CustomDeepseekV3ForCausalLM": deepseek_megatron_weight_loader,
}
原因:vllm-ascend-0.7.3rc2分支中,这个提交https://github.com/vllm-project/vllm-ascend/pull/391/files,使用CustomDeepseekV3ForCausalLM覆盖了原实现。
ModelRegistry.register_model(
"DeepseekV3ForCausalLM",
"vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM")

4、FAQ

 停止ray相关进程:ray stop

5、VLLM测试

(1)infer_vllm.py修改如下:


def chat_task(inference_engine, query):conversation = [{"role": "user","content": query,},]import timetokenizer = AutoTokenizer.from_pretrained("/root/autodl-tmp/llama3.2-1b")start_time = time.time()outputs = inference_engine.chat(conversation)res = process_outputs(outputs)out = tokenizer(query + res)logger.info(f'out len: {len(out["input_ids"])}')logger.info('Query: {}'.format(query))logger.info('Responses:\n{}'.format(res))logger.info('costs:{} s'.format(time.time() - start_time))import timestart_time = time.time()outputs = inference_engine.chat([conversation,conversation,conversation,conversation])res = process_outputs(outputs)out = tokenizer(query + res)logger.info(f'out len: {len(out["input_ids"])}')logger.info('Query: {}'.format(query))logger.info('Responses:\n{}'.format(res))logger.info('costs:{} s'.format(time.time() - start_time))start_time = time.time()outputs = inference_engine.chat(conversation)res = process_outputs(outputs)out = tokenizer(query + res)logger.info(f'out len: {len(out["input_ids"])}')logger.info('Query: {}'.format(query))logger.info('Responses:\n{}'.format(res))logger.info('costs:{} s'.format(time.time() - start_time))def generate_task(inference_engine, query):outputs = inference_engine.llm.generate(prompts=[query],sampling_params=inference_engine.sampling_params,)res = process_outputs(outputs)logger.info('Query: {}'.format(query))logger.info('Responses:\n{}'.format(res))

(2)新增infer_vllm_llama32_1b.sh

#!/bin/bash#export GLOO_SOCKET_IFNAME="Your SOCKET IFNAME"
#export TP_SOCKET_IFNAME="Your SOCKET IFNAME"
export CUDA_DEVICE_MAX_CONNECTIONS=1GPUS_PER_NODE=1
MASTER_ADDR=localhost
MASTER_PORT=6001
NNODES=1
NODE_RANK="0"DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE \--nnodes $NNODES \--node_rank $NODE_RANK \--master_addr $MASTER_ADDR \--master_port $MASTER_PORT
"INFER_ARGS="--tokenizer-name-or-path /root/autodl-tmp/llama3.2-1b-tp1-pp1/ \--load-format megatron \--load /root/autodl-tmp/llama3.2-1b-tp1-pp1/ \--tensor-parallel-size 1 \--task chat \--prompt-type-path ./configs/model/templates.json \--prompt-type llama3"torchrun $DISTRIBUTED_ARGS cli/infer_vllm.py \$INFER_ARGS \--query "Write an essay about the importance of higher education." \--distributed-backend nccl

(3)llama32_1b模型定义

llama32_1b:use_mcore_models: truesequence_parallel: trueuse_flash_attn: trueuse_rotary_position_embeddings: trueuse_fused_rmsnorm: trueuse_fused_swiglu: truerope_scaling_type: llama3rope_scaling_factor: 32.0low_freq_factor: 1.0high_freq_factor: 4.0original_max_position_embeddings: 8192max_position_embeddings: 8192num_layers: 16hidden_size: 2048ffn_hidden_size: 8192num_attention_heads: 32group_query_attention: truenum_query_groups: 8make_vocab_size_divisible_by: 1padded_vocab_size: 128256disable_bias_linear: trueattention_dropout: 0.0init_method_std: 0.01hidden_dropout: 0.0position_embedding_type: roperotary_base: 500000normalization: RMSNormnorm_epsilon: 1e-5swiglu: trueno_masked_softmax_fusion: trueattention_softmax_in_fp32: trueno_gradient_accumulation_fusion: truebf16: true

启动脚本: sh examples/infer/infer_vllm_llama32_1b.sh

相关文章:

  • 【ESP32】【微信小程序】MQTT物联网智能家居案例
  • Nginx下搭建rtmp流媒体服务 并使用HLS或者OBS测试
  • 相机标定(输出相机内参和畸变参数)
  • 前端实现数据导出成excel
  • RIP动态路由(三层交换机+单臂路由)
  • 【Markdown】【HTML】在Markdown中实现康奈尔笔记模式(右侧留白)
  • 百度暑期实习岗位超3000个,AI相关岗位占比87%,近屿智能携AIGC课程加速人才输出
  • ASP.NET Core 分层项目中EFCore的使用
  • 完美解决Microsoft Edge浏览器无法同步/一直在同步中/更新失败等问题
  • 神经网络直接逆控制:神经网络与控制的结合入门级结合
  • 【C#】.net core 6.0调用MVC API接口时,提示Unsupported Media Type,状态码415
  • 穿透数据迷雾:PR 曲线与 ROC 曲线的深度剖析+面试常见问题及解析
  • spring security +kotlin 实现oauth2.0 认证
  • 加油站小程序实战教程12显示会员信息
  • 【Django】设置让局域网内的人访问
  • 忽略 CS8616 警告在 Visual Studio 2022 中【C# 8.0 】
  • Halcon应用:相机标定之应用
  • AI助理iOS开发:Copilot for Xcode 下载与安装全指南
  • Spark-SQL与Hive集成及数据分析实践
  • Android15沉浸式界面顶部有问题
  • KZ队史首冠,透过春决看CF电竞张扬的生命力
  • 上海市政府常务会议部署多措并举促进消费,提高居民收入,减轻家庭负担
  • 十大券商看后市|A股下行波动风险有限,震荡中有望逐步抬升
  • 美国防部宣布整合驻叙美军部队,将减少至不足千人
  • 变局中,上海浦东何以继续引领?
  • 支持民营企业上市融资,上海将有什么新举措?