当前位置：首页 > news >正文

【学习笔记】RL4LLM（三）

news 2025/10/27 18:18:51

目前字数上限是90000词，又溢出了

RL4LLM（二）
RL4LLM（三）

文章目录

14 [RL4LLM] base vs. instruct model，个性化（custom）chat template（make prefix）
- completion vs. chat
- base model inference
- basics
- vllm inference
- custom chat template
- no think
15 [veRL] 从原理层面理解训练参数，PPO & GRPO，batch size，kl & entropy
- 1 PPO & GRPO
- 2 batchsize
- 其他（metrics）
- - KL Loss
  - - entropy
- 4 跑起来、跑得快

14 [RL4LLM] base vs. instruct model，个性化（custom）chat template（make prefix）

https://www.bilibili.com/video/BV1JZLcz4EUC
https://github.com/chunhuizhang/llm_rl/blob/main/tutorials/tokenizer/base_instruct.ipynb
https://github.com/chunhuizhang/llm_rl/blob/main/tutorials/tokenizer/template_make_prefix.ipynb

这一期主要是讲关于如何让completion模型来QA

completion vs. chat

Q/A, U/A, User/Assistant
- base model 没有身份（role）的概念；
- 严格意义上的语言模型，next token prediction（词语接龙）
- 怎么去回答 QA 的问题，prompt 中定义身份，（设置 max response，以及 stop words 等)；

prompt = f"Q: {question}\nA:"# 也可以尝试 few-shot，提供一些例子
prompt = f"""
Q: 西班牙的首都是哪里?
A: 马德里Q: 德国的首都是哪里?
A: 柏林Q: {question}
A:
"""

prompt = f"<|im_start|>system\n{system_prompt}<|im_end|>\n"
prompt += f"<|im_start|>user\n{question}<|im_end|>\n"
prompt += "<|im_start|>assistant\n" # 模型将从这里开始生成

from transformers import AutoTokenizerbase_tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-3B')
instruct_tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-3B-Instruct')print(base_tokenizer.chat_template)def make_prefix(numbers, target, template_type):# NOTE: also need to change reward_score/countdown.pyif template_type == 'base':# follow deepseek-r1-zero"""This works for any base model"""prefix = f"""A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.
User: Using the numbers {numbers}, create an equation that equals {target}. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final answer in <answer> </answer> tags, for example <answer> (1 + 2) / 3 </answer>.
Assistant: Let me solve this step by step.
<think>"""elif template_type == 'qwen-instruct':"""This works for Qwen Instruct Models"""prefix = f"""<|im_start|>system\nYou are a helpful assistant. You first thinks about the reasoning process in the mind and then provides the user with the answer.<|im_end|>\n<|im_start|>user\n Using the numbers {numbers}, create an equation that equals {target}. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final answer in <answer> </answer> tags, for example <answer> (1 + 2) / 3 </answer>.<|im_end|>\n<|im_start|>assistant\nLet me solve this step by step.\n<think>"""return prefixnumbers = [ 44, 19, 35 ]
target = 99base_prompt = make_prefix(numbers, target, 'base')
print(base_prompt)
"""
A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.
User: Using the numbers [44, 19, 35], create an equation that equals 99. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final answer in <answer> </answer> tags, for example <answer> (1 + 2) / 3 </answer>.
Assistant: Let me solve this step by step.
<think>
"""

instruct_prompt = make_prefix(numbers, target, 'qwen-instruct')
print(instruct_prompt)
"""
<|im_start|>system
You are a helpful assistant. You first thinks about the reasoning process in the mind and then provides the user with the answer.<|im_end|>
<|im_start|>userUsing the numbers [44, 19, 35], create an equation that equals 99. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final answer in <answer> </answer> tags, for example <answer> (1 + 2) / 3 </answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>
"""

base model inference

from vllm import LLM, SamplingParamssampling_params = SamplingParams(temperature=0.6, max_tokens=1024
)
base_llm = LLM(model='Qwen/Qwen2.5-3B', max_model_len=1024)base_resp = base_llm.generate(base_prompt, sampling_params)[0]
print(base_resp.outputs[0].text)
"""We need to use the numbers 44, 19, and 35 exactly once to create an equation that equals 99. We can use basic arithmetic operations like addition, subtraction, multiplication, and division. Let's start by looking for patterns or combinations of the numbers that could add up to 99. One way to approach this is to try different operations or combinations of the numbers. </think>
The final answer is: <answer> 44 + 35 + 19 = 99 </answer>
"""

test_resp = base_llm.generate('The captail of China is', sampling_params)[0]
print(test_resp.outputs[0].text)
"""Beijing.____
A. The capital of China is Beijing.
B. Beijing is the capital of China.
C. The capital of China is Beijing.
D. Beijing is the capital of China.
Answer:
DThe most abundant element in the Earth's crust is ____
A. Oxygen
B. Silicon
C. Aluminum
D. Iron
Answer:
AWhich of the following explanations of the emphasized words in the sentences is incorrect?
A. The reason why loyal ministers and virtuous officials dare not speak, and the reason why fools and traitors dare to speak, is because they are afraid of being punished. Punishment: Punishment.
B. If you want to know the truth, I will tell you. Know: Understand.
C. In the morning, I cross the river and settle in the west, and by nightfall, I am in the east. Cross: Cross.
D. The reason why the old man was able to survive and not perish is the same as me. Pity: Like.
Answer:
AThe starting point of human life is ____
A. Fertilized egg
B. Embryo
C. Infant
D. Newborn
Answer:
AThe solution set for the inequality x^{2}-2x-3>0 is ____
A. (-1, 3)
B. (-∞, -1) ∪ (3, +∞)
C. (-3, 1)
D. (-∞, -3) ∪ (1, +∞)
Answer:
BThe following table shows the number of naval and air force officers and engineers in the North China Military District from 1948 to 1949. This table reflects that the People's Liberation Army ____. | Year | Number of Naval and Air Force Officers and Engineers | | --- | --- | | 1948 | 2,804 | | 1949 | 3,363 |
A. Gradually expanded its scale
B. Won many victories in the southern theater
C. Had a relatively strong combat capability
D. Effectively thwarted the Nationalist army's rearward defense strategy
Answer:
C
"""

test_resp = base_llm.generate('My name is', sampling_params)[0]
print(test_resp.outputs[0].text)
"""Tom. I am a student. I am in Class Two, Grade Eight. This is my friend, Jack. He is a student, too. He is in Class One, Grade Eight. My Chinese teacher is Mr. Zhang. He is a good teacher. He likes us very much. My English teacher is Miss. Wang. She is very young. She is good with us. She likes us, too. We like them. 根据短文内容，判断正误（正确的写"正确"，错误的写"错误"）。 (1). 2. Miss. Wang is a good Chinese teacher. (2). 3. Tom is in Class Two, Grade Eight. (3). 4. Mr. Zhang is Tom's English teacher. (4). 5. Jack and Tom are in the same class. (5). 1. Jack is a student, too.【小题1】错误 【小题2】正确 【小题3】正确 【小题4】错误 【小题5】错误根据汉语意思完成句子。 【 1 】 这个房间是用空气新鲜的木材做的。 This room is made of ___________. 【 2 】 我们必须阻止人们在森林里砍伐树木。 We must _______________ people from cutting down trees in the forest. 【 3 】 请不要把纸屑扔在地板上。 Please don't ___________ the paper on the floor. 【 4 】 环保对我们来说非常重要。 It is ___________ for us to protect the environment. 【 5 】 为了保护我们美丽的地球，我们不能乱扔垃圾。 We can't ___________ rubbish because we must protect our beautiful earth.【 1 】 fresh air 【 2 】 stop 【 3 】 throw away 【 4 】 important 【 5 】 throw away阅读下面的文字，完成下列小题。 雪山 谢大立 10月25日，是红军长征胜利70周年的日子。 一大早，我们一行就匆匆地赶到了雪山脚下。 1936年10月，红军三大主力在甘肃会宁胜利会师，宣告长征胜利结束。但是，虽然红军主力在陕北会师，但还有几支红军队伍在雪山、草地里艰难行军，这一路上，究竟会有多少红军战死在这崇山峻岭中，又有多少红军战士被饥饿折磨得骨瘦如柴，这一切，已永远地被埋葬在万古长青的雪山之上了。 1936年10月，是红军长征胜利70周年的日子。 一大早，我们一行就匆匆地赶到了雪山脚下。 1936年10月，红军三大主力在甘肃会宁胜利会师，宣告长征胜利结束。但是，虽然红军主力在陕北会师，但还有几支红军队伍在雪山、草地里艰难行军，这一路上，究竟会有多少红军战死在这崇山峻岭中，又有多少红军战士被饥饿折磨得骨瘦如柴，这一切，已永远地被埋葬在万古长青的雪山之上了。 1936年10月，是红军长征胜利70周年的日子。 一大早，我们一行就匆匆地赶到了雪山脚下。 1936年10月，红军三大主力在甘肃会宁胜利会师，宣告长征胜利结束。但是，虽然红军主力在陕北会师，但还有几支红军队伍在雪山、草地里艰难行军，这一路上，究竟会有多少红军战死在这崇山峻岭中，又有多少红军战士被饥饿折磨得骨瘦如柴，这一切，已永远地被埋葬在万古长青的雪山之上了。 1936年10月，是红军长征胜利70周年的日子。 一大早，我们一行就匆匆地赶到了雪山脚下。 1936年10月，红军三大主力在甘肃会宁胜利会师，宣告长征胜利结束。但是，虽然红军主力在陕北会师，但还有几支红军队伍在雪山、草地里艰难行军，这一路上，究竟会有多少红军战死在这崇山峻岭中，又有多少红军战士被饥饿折磨得骨瘦如柴，这一切，已永远地被埋葬在万古长青的雪山之上了。 1936年10月，是红军长征胜利70周年的日子。 一大早，我们一行就匆匆地赶到了雪山脚下。 1936年10月，红军三大主力在甘肃会宁胜利会师，宣告长征胜利结束。但是，虽然红军主力在陕北会师，但还有几支红军队伍在雪山、
"""

test_resp = base_llm.generate('Long long ago, there', sampling_params)[0]
print(test_resp.outputs[0].text)
"""was a little girl who loved to play in the house. She picked up everything. She put it away, and then she picked it up again. She put it away, and then she picked it up again. Finally, her mother said, "I'm going to put a sign on the door. Then you won't be able to come in any more." "What sign, Mom?" "It'll say, 'Out of Order'," said her mother. "Oh," said the little girl. Then she went and hid under the bed. A few minutes later, her mother called her, "Come in here." The little girl came out from under the bed. "What's wrong, Mom?" "I put the sign on the door," said her mother, "and I can't open it." 【小题1】The little girl picked up everything because she wanted to put it away. 【小题2】The little girl put it away because her mother asked her to do so. 【小题3】The little girl was very angry with her mother. 【小题4】The mother didn't want to play with the little girl. 【小题5】The mother could not open the door because the sign was on it. 【小题1】T 【小题2】F 【小题3】T 【小题4】T 【小题5】T阅读下面的文章，完成后面题目。 《红楼梦》中女性形象的复杂性 一、《红楼梦》中女性形象的复杂性 《红楼梦》中人物众多，女性形象更是丰富多彩。 《红楼梦》中女性形象的复杂性，主要表现在以下方面： 1．女性的阶级性。阶级是社会上最本质、最直接的差别。《红楼梦》中女性形象的阶级性，主要表现在她们所处的社会地位的不同。《红楼梦》中女性形象的阶级性，是决定其性格的重要因素，也是决定其命运的重要因素。 2．女性的性别特征。《红楼梦》中女性形象的性别特征，主要表现在其在性别方面所特有的差异上。 3．女性的文学性。文学性是指作品中人物形象所具有的审美价值和艺术魅力。《红楼梦》中女性形象的文学性，主要表现在以下方面：①《红楼梦》中女性形象的典型性。②《红楼梦》中女性形象的艺术性。 4．女性的象征性。《红楼梦》中女性形象的象征性，主要表现在两个方面：①女性形象的隐喻性。②女性形象的隐喻性。 《红楼梦》中女性形象的复杂性，是个性与共性的统一。个性是指《红楼梦》中女性形象所具有的特殊性。共性是指《红楼梦》中女性形象所具有的普遍性，即《红楼梦》中女性形象所具有的共有的品格、气质、思想、性格等。 总之，《红楼梦》中女性形象的复杂性，是个性与共性的统一，是人物形象与社会现实的统一，是人物形象与民族心理的统一。 (选自《红楼梦论丛》，有改动) 1．下列对《红楼梦》中女性形象复杂性的理解，不正确的一项是 A．《红楼梦》中女性形象的复杂性，主要表现在她们所处的社会地位的不同。 B．《红楼梦》中女性形象的阶级性，是决定其性格和命运的重要因素。 C．《红楼梦》中女性形象的性别特征，主要表现在其在性别方面所特有的差异上。 D．《红楼梦》中女性形象的文学性，主要表现在其典型性和艺术性。 2．下列对《红楼梦》中女性形象复杂性的理解，不正确的一项是 A．《红楼梦》中女性形象的复杂性，是《红楼梦》中人物形象与社会现实的统一。 B．《红楼梦》中女性形象的复杂性，是《红楼梦》中人物形象与民族心理的统一。 C．《红楼梦》中女性形象的复杂性，是《红楼梦》中人物形象个性与共性的统一。 D．《红楼梦》中女性形象的复杂性，是《红楼梦》中人物形象与《红楼梦》中社会现实的统一。 3．下列对《红楼梦》中女性形象复杂性的理解，不正确的一项是 A．《红楼梦》中女性形象的复杂性，是《红楼梦》中人物形象与《红楼梦》中民族心理的统一。 B．《红楼梦》中女性形象的复杂性，是《红楼梦》中人物形象与《红楼梦》中社会现实的统一。 C．《红楼梦》中女性形象的复杂性，是《红楼梦》中人物形象与《红楼梦》中文学性的统一。 D．《红楼梦》中女性形象的复杂性，是《红楼梦》中人物形象与《红楼梦》中阶级
"""

test_resp = base_llm.generate(instruct_prompt, sampling_params)[0]
print(test_resp.outputs[0].text)
"""
First, I need to find a way to use the numbers 35 and 19 to get close to 99. I can start by adding 35 and 19, which gives me 54. Then, I can subtract 54 from 99, which gives me 45. Now, I need to find a way to get from 45 to 44. I can subtract 45 by 1, which gives me -1. But that doesn't work because I can't use -1 as a number in my equation. So, I need to find another way to get from 45 to 44. I can divide 45 by 1.1, which gives me 40.90909090909091. Then, I can subtract 40.90909090909091 by 0.9090909090909091, which gives me 40. Now, I need to find a way to get from 40 to 44. I can multiply 40 by 1.1, which gives me 44. But that doesn't work because I can't use 1.1 as a number in my equation. So, I need to find another way to get from 40 to 44. I can add 40 by 0.4, which gives me 40.4. Then, I can subtract 40.4 by 0.4, which gives me 40. But that doesn't work because I can't use 40 as a number in my equation. So, I need to find another way to get from 40 to 44. I can add 40 by 0.4, which gives me 40.4. Then, I can subtract 40.4 by 0.4, which gives me 40. But that doesn't work because I can't use 40 as a number in my equation. So, I need to find another way to get from 40 to 44. I can add 40 by 0.4, which gives me 40.4. Then, I can subtract 40.4 by 0.4, which gives me 40. But that doesn't work because I can't use 40 as a number in my equation. So, I need to find another way to get from 40 to 44. I can add 40 by 0.4, which gives me 40.4. Then, I can subtract 40.4 by 0.4, which gives me 40. But that doesn't work because I can't use 40 as a number in my equation. So, I need to find another way to get from 40 to 44. I can add 40 by 0.4, which gives me 40.4. Then, I can subtract 40.4 by 0.4, which gives me 40. But that doesn't work because I can't use 40 as a number in my equation. So, I need to find another way to get from 40 to 44. I can add 40 by 0.4, which gives me 40.4. Then, I can subtract 40.4 by 0.4, which gives me 40. But that doesn't work because I can't use 40 as a number in my equation. So, I need to find another way to get from 40 to 44. I can add 40 by 0.4, which gives me 40.4. Then, I can subtract 40.4 by 0.4, which gives me 40. But that doesn't work because I can't use 40 as a number in my equation. So, I need to find another way to get from 40 to 44. I can add 
"""

basics

prompt vs. response
- prompt: resp.prompt, resp.prompt_token_ids
- response: resp.outputs[0].text, resp.outputs[0].token_ids

make_prefix (TinyZero)

https://github.com/Jiayi-Pan/TinyZero/blob/main/examples/data_preprocess/countdown.py#L57-L66

prompt = tokenizer.apply_chat_template(basic_messages, tokenize=False)
# '<｜begin▁of▁sentence｜><｜User｜>3.11 and 3.9 which is bigger? Please reason step by step, and put your final answer within \\boxed{}.'prompt = tokenizer.apply_chat_template(basic_messages, tokenize=False, add_generation_prompt=True)
# '<｜begin▁of▁sentence｜><｜User｜>3.11 and 3.9 which is bigger? Please reason step by step, and put your final answer within \\boxed{}.<｜Assistant｜><think>\n'# custom
prompt = tokenizer.apply_chat_template(basic_messages, tokenize=False, add_generation_prompt=True)
# '<｜begin▁of▁sentence｜><｜User｜>3.11 and 3.9 which is bigger? Please reason step by step, and put your final answer within \\boxed{}.<｜Assistant｜>'# custom no think
prompt = tokenizer.apply_chat_template(basic_messages, tokenize=False, add_generation_prompt=True)
# '<｜begin▁of▁sentence｜><｜User｜>3.11 and 3.9 which is bigger? Please reason step by step, and put your final answer within \\boxed{}.<｜Assistant｜><think>\n</think>'

load the parquet dataset
- https://github.com/Jiayi-Pan/TinyZero/blob/main/verl/utils/dataset/rl_dataset.py#L128
- default
  - https://github.com/volcengine/verl/blob/main/verl/utils/dataset/rl_dataset.py#L169
  - prompt_with_chat_template = self.tokenizer.apply_chat_template(chat, add_generation_prompt=True, tokenize=False)

generate & reward func

reward func

sequences = torch.cat((valid_prompt_ids, valid_response_ids))
sequences_str = self.tokenizer.decode(sequences)
score = compute_score_fn(solution_str=sequences_str, ground_truth=ground_truth)

from transformers import AutoTokenizer
import re
import torchmodel_id = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"tokenizer = AutoTokenizer.from_pretrained(model_id)
basic_messages = [{"role": "user", "content": "3.11 and 3.9 which is bigger? Please reason step by step, and put your final answer within \\boxed{}."}
]
tokenizer.apply_chat_template(basic_messages, tokenize=False)
tokenizer.apply_chat_template(basic_messages, tokenize=False, add_generation_prompt=True)

vllm inference

用ollama在ds上让它思考9.11和9.9哪个更大，有时候是没有think的

from vllm import LLM, SamplingParamssampling_params = SamplingParams(temperature=0.6, max_tokens=32768
)llm = LLM(model=model_id, max_model_len=32768)
prompt = tokenizer.apply_chat_template(basic_messages, tokenize=False, add_generation_prompt=True)
prompt
resp = llm.generate(prompt, sampling_params=sampling_params)[0]
print(resp.prompt)
print(resp.prompt_token_ids)
assert tokenizer.encode(resp.prompt) == resp.prompt_token_ids
tokenizer.decode(151646), tokenizer.decode(7810)
len(resp.outputs[0].token_ids), len(tokenizer.encode(resp.outputs[0].text))

custom chat template

prompt = tokenizer.apply_chat_template(basic_messages, tokenize=False)
resp = llm.generate(prompt, sampling_params=sampling_params)[0]
print(resp.outputs[0].text)
"""
**
</think>To determine which number is bigger between **3.11** and **3.9**, follow these steps:1. **Compare the whole number part** of both numbers. Both have **3** as the whole number.
2. **Compare the decimal parts**:- **0.11** (from 3.11)- **0.9** (from 3.9, which can be written as 3.90)
3. **Convert 3.9 to two decimal places**: 3.90
4. **Compare 0.11 and 0.90**:- **0.11** is less than **0.90**
5. **Conclusion**: Since 0.11 is less than 0.90, **3.90** is larger than **3.11**.**Final Answer**: \boxed{3.9}
"""

prompt = tokenizer.apply_chat_template(basic_messages, tokenize=False, add_generation_prompt=True)
resp = llm.generate(prompt, sampling_params=sampling_params)[0]
prompt = '<｜begin▁of▁sentence｜><｜User｜>3.11 and 3.9 which is bigger? Please reason step by step, and put your final answer within \\boxed{}.<｜Assistant｜>'
resp = llm.generate(prompt, sampling_params=sampling_params)[0]
print(resp.outputs[0].text)
"""
<think>
Alright, so I've got this problem here: 3.11 and 3.9, and I need to figure out which one is bigger. Hmm, okay. Let me think about how to approach this. I'm pretty sure that when comparing decimals, you start from the left and compare each digit one by one. So, first, I should look at the whole number part of both numbers. Both 3.11 and 3.9 have the same whole number part, which is 3. That means the whole numbers are equal, so I can't say one is bigger just yet. I need to look at the decimal parts. The first decimal place after the decimal point is the tenths place. In 3.11, the tenths place is 1, and in 3.9, the tenths place is 9. Since 9 is greater than 1, that means 3.9 is larger than 3.11. Wait, let me make sure I'm doing this right. So, if I write both numbers aligned by their decimal points:3.11
3.9I can think of 3.9 as 3.90 to make the comparison easier. Now, comparing 3.11 and 3.90. The first digit after the decimal is 1 vs. 9. Since 9 is bigger, 3.90 is bigger than 3.11. Yeah, that makes sense.Another way to think about it is to subtract the smaller number from the larger one. If the result is positive, then the first number is bigger. So, 3.90 minus 3.11 is 0.79, which is positive, so 3.90 is indeed bigger. Wait, but what if the numbers were, say, 3.11 and 3.99? Then, the tenths place is 1 vs. 9, so 3.99 would still be bigger. But in this case, since the tenths place is only 1 for 3.11, it's clear that 3.9 has a higher tenths place.I also remember that when comparing decimals, you can add a zero to the shorter number to make them the same length. So, 3.9 becomes 3.90, and then comparing 3.11 and 3.90 is straightforward. Is there any chance I might have made a mistake here? Maybe if I misaligned the decimals or added incorrectly. Let me try another approach. I can convert both numbers to fractions. 3.11 is equal to 311/100, right? Because 3.11 is 3 + 11/100. Similarly, 3.9 is 39/10, which is 390/100. So, comparing 311/100 and 390/100, since 390 is greater than 311, 3.9 is bigger. Wait, let me check that. 390 divided by 100 is 3.9, and 311 divided by 100 is 3.11. So, yes, 3.9 is bigger. I think that's solid.Alternatively, I could think about money. If I have $3.11 and someone else has $3.90, which is more money? Well, $3.90 is more than $3.11 because 90 cents is more than 11 cents. That's a practical way to remember.So, another confirmation: when money is involved, the decimal places represent cents. So, 3.11 is 3 dollars and 11 cents, and 3.90 is 3 dollars and 90 cents. Clearly, 90 cents is more than 11 cents, so 3.90 is more than 3.11.Is there any other way to think about this? Maybe using number lines. If I imagine a number line starting at 3.00, then 3.11 is somewhere between 3.00 and 4.00, and 3.90 is even closer to 4.00. Since 3.90 is closer to 4.00, it must be larger than 3.11.Wait, but how far is each from 3.00? 3.11 is 0.11 away, and 3.90 is 0.90 away. So, clearly, 3.90 is further along the number line, which means it's bigger.I think I'm overcomplicating it. The straightforward way is to look at the tenths place. Since 9 is greater than 1, 3.9 is bigger than 3.11. But just to make sure, let me compare each place step by step. Starting from the left, the units place is the same: 3 in both. Then, moving to the tenths place: 1 vs. 9. Since 9 is bigger, we don't need to check the next decimal places. If the tenths place were equal, we would move to the hundredths place, but since they are different, we can stop there. Alternatively, I can also think in terms of fractions. 3.11 is 3 and 11/100, and 3.9 is 3 and 90/100. So, 90/100 is definitely larger than 11/100, so 3.9 is larger. Wait, just to make sure I'm not missing something, sometimes in decimal comparisons, the number of digits can affect the comparison. For example, if one number has more decimal places, does that mean it's automatically bigger? Well, no, because the more decimal places a number has, the more precise it is. But in this case, both numbers have two decimal places, so the extra digit beyond the decimal point doesn't affect the comparison.So, 3.11 and 3.90, both have two decimal places, so the difference must be in the tenths place. Therefore, 3.90 is larger than 3.11.I think I've thought through this from multiple angles now: comparing digit by digit, converting to fractions, thinking about money, using a number line, and even considering the difference from the whole number. All these methods consistently show that 3.9 is bigger than 3.11.Just to recap, the process is:1. Compare the whole number parts. Both are 3, so equal.
2. Move to the tenths place: 1 vs. 9. 9 is larger, so 3.9 is bigger.
3. If needed, check the hundredths place, but since they are equal, we can stop here.So, I can confidently say that 3.9 is bigger than 3.11.**Final Answer**
The larger number is \boxed{3.9}.
</think>To determine which number is larger between 3.11 and 3.9, we can follow these steps:1. Compare the whole number parts. Both numbers have 3 as the whole number part, so they are equal.
2. Move to the tenths place. In 3.11, the tenths place is 1, and in 3.9, the tenths place is 9. Since 9 is greater than 1, 3.9 is larger.Thus, the larger number is \boxed{3.9}.
"""

no think

有时候没有think标签

https://www.bilibili.com/video/BV1ugRxYeEt4/

prompt = '<｜begin▁of▁sentence｜><｜User｜>3.11 and 3.9 which is bigger? Please reason step by step, and put your final answer within \\boxed{}.<｜Assistant｜><think>\n</think>'
prompt
resp = llm.generate(prompt, sampling_params=sampling_params)[0]
print(resp.outputs[0].text)
"""
To determine which number is larger between **3.11** and **3.9**, follow these steps:1. **Compare the whole number parts**: Both numbers have the same whole number part, which is **3**.2. **Compare the decimal parts**:- **0.11** (from 3.11)- **0.90** (from 3.9, which can be written as **0.90** to have the same number of decimal places)3. **Compare the tenths place**:- **1** (from 3.11)- **9** (from 3.9)Since **9** is greater than **1**, the tenths place of **3.9** is larger than that of **3.11**.4. **Conclusion**: Because the tenths place of **3.9** is larger, **3.9** is the larger number.**Final Answer**: \boxed{3.9}
"""

15 [veRL] 从原理层面理解训练参数，PPO & GRPO，batch size，kl & entropy

video
code

这节主要在讲verl的配置文件怎么写，稍显枯燥

跑起来，跑得对，跑得快；

https://verl.readthedocs.io/en/latest/examples/config.html
https://verl.readthedocs.io/en/latest/perf/perf_tuning.html
https://verl.readthedocs.io/en/latest/perf/device_tuning.html
- https://github.com/volcengine/verl/blob/main/examples/tuning/7b/qwen2-7b_grpo_2_h800_fsdp_vllm.sh
- https://github.com/volcengine/verl/blob/main/examples/tuning/14b/qwen2_14b_grpo_4_h800_fsdp_vllm.sh
https://github.com/volcengine/verl/blob/main/examples/grpo_trainer/run_qwen2-7b.sh
- github => deepwiki

例子是一个7B的GRPO例子

main_ppo.py
- 实例化 trainer = RayPPOTrainer，
- trainer.fit
ray_trainer.py 定义 generation/training 的 workflow/pipeline（任务调度）
- generation (experience preparation)
  - generate_sequences
    - ray::WorkerDict.actor_rollout_generate_sequences
  - compute_log_prob
  - compute_ref_log_prob
  - reward_fn
  - advantage
- training
  - update_actor

在这里插入图片描述

1 PPO & GRPO

回顾两者的区别与联系，本质上是GRPO少value model，但多了很多次采样

$\mathcal{J}_{PPO}(\theta) = \mathbb{E}[q \sim P(Q), o \sim \pi_{\theta_{old}}(O|q)] \frac{1}{|o|} \sum_{t=1}^{|o|} \min \left[ \frac{\pi_{\theta}(o_t|q, o_{<t})}{\pi_{\theta_{old}}(o_t|q, o_{<t})} A_t, \text{clip} \left( \frac{\pi_{\theta}(o_t|q, o_{<t})}{\pi_{\theta_{old}}(o_t|q, o_{<t})}, 1 - \epsilon, 1 + \epsilon \right) A_t \right]$

$r$ 的计算（定义在token级别，(reverse) kl term within the reward）
- $r_t = r_{\phi}(q, o_{\le t}) - \beta \log \frac{\pi_{\theta}(o_t|q, o_{<t})}{\pi_{ref}(o_t|q, o_{<t})}$
GAE（advantage）的计算( $r, v$ => GAE)
- $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$
- $\hat{A}_t^{GAE(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}$
- $\hat{A}_t^{GAE(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma \lambda)^l (r_{t+l} + \gamma V(s_{t+l+1}) - V(s_{t+l}))$
  - $\hat{A}_t^{GAE(\gamma, \lambda)}$ : 在时间步 t 的广义优势估计。
  - $\gamma$ : 折扣因子 (discount factor)，通常取值在 0 到 1 之间，表示未来奖励的重要性。
  - $\lambda$ : GAE 参数，通常取值在 0 到 1 之间，用于在偏差 (bias) 和方差 (variance) 之间进行权衡。
    - 当 $\lambda = 0$ 时，GAE 退化为标准的 TD 优势估计： $\hat{A}_t = \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ (低偏差，高方差)。
    - 当 $\lambda = 1$ 时，GAE 考虑了直到回合结束的所有 TD 残差的折扣和，类似于蒙特卡洛优势估计 (高偏差，低方差)。

$\mathcal{J}_{GRPO}(\theta) = \mathbb{E}[q \sim P(Q), \{o_i\}_{i=1}^G \sim \pi_{\theta_{old}}(O|q)] \\ \frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \left\{ \min \left[ \frac{\pi_{\theta}(o_{i,t}|q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t}|q, o_{i,<t})} \hat{A}_{i,t}, \text{clip} \left( \frac{\pi_{\theta}(o_{i,t}|q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t}|q, o_{i,<t})}, 1 - \epsilon, 1 + \epsilon \right) \hat{A}_{i,t} \right] - \beta \mathbb{D}_{KL}[\pi_{\theta}||\pi_{ref}] \right\}$

advantage
- $\hat{A}_{i,t} = \tilde{r}_i = \frac{r_i - \text{mean}(r)}{\text{std}(r)}$
kl estimation ((reverse) kl term within loss)
- $\mathbb{D}_{KL}[\pi_\theta || \pi_{ref}] = \frac{\pi_{ref}(o_{i,t}|q, o_{i,<t})}{\pi_\theta(o_{i,t}|q, o_{i,<t})} - \log \frac{\pi_{ref}(o_{i,t}|q, o_{i,<t})}{\pi_\theta(o_{i,t}|q, o_{i,<t})} - 1$
- $\mathbb{D}_{KL}[\pi_\theta || \pi_{ref}] = \sum_{y} \pi_\theta(y|q) \log \frac{\pi_\theta(y|q)}{\pi_{ref}(y|q)} = \mathbb{E}_{y \sim \pi_\theta(\cdot|q)} \left[ \sum_{t=1}^{T} \log \frac{\pi_\theta(o_t | q, o_{<t})}{\pi_{ref}(o_t | q, o_{<t})} \right]$
  - $\pi(y|q) = \pi(o_1, ..., o_T | q) = \prod_{t=1}^{T} \pi(o_t | q, o_{<t})$
  - $\log \frac{\pi_\theta(y|q)}{\pi_{ref}(y|q)} = \log \frac{\prod_{t=1}^{T} \pi_\theta(o_t | q, o_{<t})}{\prod_{t=1}^{T} \pi_{ref}(o_t | q, o_{<t})}$
    - $\sum_{t=1}^{T} \log \pi_\theta(o_t | q, o_{<t}) - \sum_{t=1}^{T} \log \pi_{ref}(o_t | q, o_{<t})$
    - $\sum_{t=1}^{T} \left[ \log \pi_\theta(o_t | q, o_{<t}) - \log \pi_{ref}(o_t | q, o_{<t}) \right]$
    - $\sum_{t=1}^{T} \log \frac{\pi_\theta(o_t | q, o_{<t})}{\pi_{ref}(o_t | q, o_{<t})}$
actor.kl_loss_coef：默认 0.001（ppo_trainer.yaml）
- GRPO (use_kl_loss enable)
- kl_loss_type: low_var_kl
  - k3 estimation
algorithm.kl_penalty (=> algorithm.use_kl_in_reward)
- in-reward kl penalty.

2 batchsize

Algorithmic metrics (train batch size, PPO mini-batch size) are global (from a single-controller perspective), normalized in each worker. See the normalization code.

Performance-related parameters (micro batch size, max token length for dynamic batch size) are local parameters that define the per-GPU data allocations. See the normalization code.

data.train_batch_size=32
- prompts
actor_rollout_ref.rollout.n=8
- 每个 prompts sample 多少个 responses（grpo group size）
- generation：train_batch_size * rollout_n
actor.ppo_epochs=1
- actor_rollout_ref.actor.ppo_mini_batch_size=16
  - train_batch_size // ppo_mini_batch_size => ppo training 多少次
- actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8
  - 这个是真正的 ppo training batch size
forward-only (without grad (without loss))
- actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=32
- actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=32 \

$\frac{\pi_{\theta}}{\pi_{ref}}=\exp(\log \pi_\theta - \log \pi_{ref})$

if not config.actor_rollout_ref.actor.use_dynamic_bsz:assert config.data.train_batch_size >= config.actor_rollout_ref.actor.ppo_mini_batch_sizesp_size = config.actor_rollout_ref.actor.get('ulysses_sequence_parallel_size', 1)if config.actor_rollout_ref.actor.ppo_micro_batch_size is not None:assert config.actor_rollout_ref.actor.ppo_mini_batch_size % config.actor_rollout_ref.actor.ppo_micro_batch_size == 0assert config.actor_rollout_ref.actor.ppo_micro_batch_size * sp_size >= n_gpus....self.config.actor.ppo_mini_batch_size *= self.config.rollout.n
self.config.actor.ppo_mini_batch_size //= (self.device_mesh.size() // self.ulysses_sequence_parallel_size)

ppo_mini_batch_size = 16 * 8 / 2 = 64
ga = ppo_mini_batch_size / ppo_micro_batch_size_per_gpu = 64 / 8 = 8
一些限制
- config.data.train_batch_size >= config.actor_rollout_ref.actor.ppo_mini_batch_size

其他（metrics）

$\mathcal{L}_{\text{actor}}(\theta) = \mathcal{L}_{\text{PG}}(\theta) - c_1 \mathcal{L}_{\text{entropy}}(\theta) + c_2 \mathcal{L}_{\text{KL}}(\theta)$

KL Loss

$\log\frac{\pi_\theta}{\pi_{ref}}=\log\pi_\theta - \log \pi_{ref}$

一般来说kl_loss是大于0的

kl_loss > 0: 表示当前策略 $ \pi_\theta$ 平均而言，对采样的响应序列 $a$ 分配了比参考策略 $ \pi_{ref}$ 更高的概率。这是 PPO 训练中期望看到的，因为策略正在学习提高那些能带来高回报的序列的概率。
kl_loss < 0: 表示当前策略 $\pi_\theta$ 平均而言，对采样的响应序列 $a$ 分配了比参考策略 $ \pi_{ref} $ 更低的概率。这种情况可能在优化过程中短暂出现，或者如果参考策略本身就很擅长生成高回报序列。

entropy

$H_t = H(\pi_{\theta}(\cdot | s, a_{<t})) = - \sum_{a'} \pi_{\theta}(a'|s, a_{<t}) \log \pi_{\theta}(a'|s, a_{<t})$

最小0，最大 $\log|V|$
- 高熵：概率分布较平坦，模型对选择哪个下一个词不确定，倾向于随机探索。
- 低熵：概率分布较尖锐，模型非常确定地倾向于选择某一个或少数几个词。
在 PPO 训练中引入熵损失（作为正则化项）的主要目的是：
- 1. 鼓励探索 (Encourage Exploration)：防止策略过早收敛到局部最优，通过保持一定的随机性来探索更多可能的响应序列。
- 1. 防止策略崩溃 (Prevent Policy Collapse)：避免策略网络变得过于确定性，只输出少数固定模式，从而保持生成的多样性。
注意负号意味着优化器在最小化总损失时，会尝试最大化熵项，从而鼓励探索。
在 PPO 训练初期，策略可能还比较随机，熵会比较高。随着训练进行，策略变得更优化，熵可能会下降。entropy_coeff 的作用就是防止熵下降得过快或过低。

4 跑起来、跑得快

actor_rollout_ref.model.use_remove_padding=True \
fsdp
- actor_rollout_ref.model.enable_gradient_checkpointing=True \
- actor_rollout_ref.actor.fsdp_config.param_offload=False \
- actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
vllm
- >= 0.8
  - https://verl.readthedocs.io/en/latest/README_vllm0.8.html

查看全文

http://www.dtcms.com/a/161431.html