当前位置：首页 > news >正文

【强化学习（实践篇）】#1 多臂赌博机网格世界

news 来源：原创 2025/7/1 6:13:31

本系列是基于萨顿的《强化学习（第2版）》的实践内容，不详细介绍理论基础，可搭配【强化学习】系列食用。

博主参考书中的算法和案例，提炼了个人认为比较典型的问题及其求解方法尝试使用Python实现。由于该系列内容是博主在学习过程中自行布置并完成的作业，若讲解和代码有不足之处还请多多包涵指教，更新优先级也会放在其他系列之下。

本文前置知识对应【强化学习】系列的第1~3篇。

多臂赌博机

问题概述

一个 $k$ 臂赌博机，智能体在每个时刻拉下 $k$ 个拉杆的其中一个，拉下拉杆 $i$ 有 $p_i$ 的概率获得数值为 $1$ 的奖励，还有 $1-p_i$ 的概率没有任何奖励。

程序导览

为 $k$ 臂赌博机定义一个MAB类，存储每个拉杆的获奖概率、最优拉杆和最优期望奖励，并为其定义step方法根据拉下的拉杆随机返回奖励。

求解方法为 $\epsilon$ -贪心算法，同样定义一个EpsilonGreedy类，存储每个拉杆的拉下次数和动作价值，并为其定义增量式更新的update方法和采取动作的action方法。

求解过程的伪代码如下：

过程中记录算法的平均收益和拉下最优拉杆的概率。

代码实现

import numpy as np  
import matplotlib.pyplot as plt  #K臂赌博机模型  
class MAB:  def __init__(self, k):  #拉杆数量  self.k = k  #拉杆获奖概率  self.probs = np.random.uniform(size=k)  #最优拉杆  self.best_action = np.argmax(self.probs)  #最优期望奖励  self.best_prob = self.probs[self.best_action]  #拉动拉杆  def step(self, action):  if np.random.rand() < self.probs[action]:  return 1  else:  return 0  #Epsilon贪心算法  
class EpsilonGreedy:  def __init__(self, mab, epsilon):  #赌博机模型  self.mab = mab  #小概率值  self.epsilon = epsilon  #初始化动作价值  self.q = np.zeros(self.mab.k)  #拉杆拉下次数  self.n = np.zeros(self.mab.k)  #更新动作价值  def update(self, action, reward):  self.n[action] += 1  self.q[action] = self.q[action] + (reward - self.q[action]) / self.n[action]  #选择动作  def action(self):  if np.random.rand() < self.epsilon:  return np.random.randint(0, self.mab.k)  else:  return np.argmax(self.q)  #求解  
def solve(bandit, solver, times):  #记录平均奖励变化  total_reward, avg_reward = 0, []  total_acc, acc_rate = 0, []  for t in range(times):  action = solver.action()  reward = bandit.step(action)  total_reward += reward  avg_reward.append(total_reward / (t + 1))  solver.update(action, reward)  if action == bandit.best_action:  total_acc += 1  acc_rate.append(total_acc / (t + 1))  return avg_reward, acc_rate  np.random.seed(1)  
K = 10  
Bandit = MAB(K)  
print("Best Action:", Bandit.best_action)  
print("Best Prob:", Bandit.best_prob)  Epsilon = 0.1  
Times = 1000  
Solver = EpsilonGreedy(Bandit, Epsilon)  
Avg_Reward, Acc_Rate = solve(Bandit, Solver, Times)  plt.title("Average Reward")  
plt.plot(Avg_Reward)  
plt.show()

运行结果

最优拉杆和最优期望奖励：

Best Action: 1
Best Prob: 0.7203244934421581

网格世界

问题概述

在一个Width×Height的网格世界中，终点（终结状态）位于Goal State，障碍分布在Block State，状态集合是网格世界中除了Block State的格子。在到达终点前，智能体每移动一步都会得到Step Reward的惩罚（负数奖励），状态也相应地转移到下一个格子；撞到障碍物或边界则得到Block Reward的惩罚，位置不变；到达终点则得到Goal Reward的奖励。

红底X格为障碍，绿底G格为终点。

该网格世界具有最简单的环境动态特性，只要当前状态和动作固定，后继状态和奖励也固定，通过给格子添加更丰富的属性可以得到更复杂的模型。

程序导览

网格世界按照问题概述进行建模。

求解方法为价值迭代，其伪代码如下：

$V (s)$ 采用全 $0$ 初始化，状态遍历过程为：对网格世界中的每个坐标 $(i, j)$ ，如果坐标属于障碍物或终点则跳过（障碍物不在状态集合，终点价值函数始终为 $0$ ），其余进行伪代码中的内层循环操作。动作带来的状态转移（坐标变化）通过Action_effects字典存储，对当前状态的每个动作，先计算后继状态的坐标，再根据后继状态位于终点、障碍物或界外、界内非终点三种情况分别设置奖励，并计算其动作价值。遍历所有动作后选取最优动作价值作为当前状态价值。

策略生成过程为：障碍物和终点的位置排除，对其余状态按同样的操作对动作进行遍历，并选取最优动作作为当前状态的策略。

代码实现

import numpy as np  
from matplotlib import pyplot as plt  
from matplotlib.patches import Arrow, Rectangle  #网格世界模型
Width = 4  
Height = 4  
State = [(i, j) for i in range(Width) for j in range(Height)]  
Block_State = [(1, 2), (2, 1)]  
Goal_State = [(3, 3)]  
Action = ["up", "down", "left", "right"]  
Action_effects = {  "up": (-1, 0),  "down": (1, 0),  "left": (0, -1),  "right": (0, 1)  
}  Goal_Reward = 0  
Block_Reward = -1  
Step_Reward = -1  #求解参数（小阈值，折扣系数）
Theta = 1e-7  
Gamma = 0.9  #状态价值与策略
v = np.zeros([Width, Height])  
pi = np.empty([Width, Height], dtype=object)  #价值迭代
while True:  delta = 0  for i in range(Width):  for j in range(Height):  if (i,j) in Goal_State + Block_State:  continue  v_temp = v[i, j]  max_value = -np.inf  for action in Action:  next_i = i + Action_effects[action][0]  next_j = j + Action_effects[action][1]  next_state = (next_i, next_j)  if next_state in Goal_State:  reward = Goal_Reward  elif next_state in Block_State or next_state not in State:  reward = Block_Reward  next_state = (i, j)  else:  reward = Step_Reward  value = reward + Gamma * v[next_state]  if value > max_value:  max_value = value  delta = max(delta, np.abs(max_value - v_temp))  v[i, j] = max_value  if delta < Theta:  break  #策略生成
for i in range(Width):  for j in range(Height):  if (i, j) in Goal_State:  pi[i, j] = "Goal"  elif (i, j) in Block_State:  pi[i, j] = "Block"  else:  best_action = None  best_value = -np.inf  for action in Action:  next_i = i + Action_effects[action][0]  next_j = j + Action_effects[action][1]  next_state = (next_i, next_j)  if next_state in Goal_State:  reward = Goal_Reward  elif next_state in Block_State or next_state not in State:  reward = Block_Reward  next_state = (i, j)  else:  reward = Step_Reward  value = reward + Gamma * v[next_state]  if value > best_value:  best_value = value  best_action = action  pi[i, j] = best_action  print(pi)  
print(v)  #可视化部分  
fig, ax = plt.subplots(figsize=(10, 8))  
ax.set_xlim(-0.5, Width - 0.5)  
ax.set_ylim(-0.5, Height - 0.5)  
ax.set_xticks(np.arange(-0.5, Width, 1), minor=True)  
ax.set_yticks(np.arange(-0.5, Height, 1), minor=True)  
ax.grid(which="minor", color="black", linestyle="-", linewidth=2)  
ax.set_xticks(np.arange(Width))  
ax.set_yticks(np.arange(Height))  
ax.set_xticklabels([])  
ax.set_yticklabels([])  
ax.invert_yaxis() arrow_length = 0.3  
text_offset = 0.1  for i in range(Width):  for j in range(Height):  center_x, center_y = j, i   if (i, j) in Block_State:  ax.add_patch(Rectangle((center_x - 0.5, center_y - 0.5), 1, 1, color="red", alpha=0.3))  ax.text(center_x, center_y, "X", ha="center", va="center", fontsize=50, weight="bold")  continue  if (i, j) in Goal_State:  ax.add_patch(Rectangle((center_x - 0.5, center_y - 0.5), 1, 1, color="green", alpha=0.3))  ax.text(center_x, center_y, "G", ha="center", va="center", fontsize=50, weight="bold")  continue  action = pi[i, j]  if action == "up":  dx, dy = 0, -arrow_length  elif action == "down":  dx, dy = 0, arrow_length  elif action == "left":  dx, dy = -arrow_length, 0  elif action == "right":  dx, dy = arrow_length, 0  ax.arrow(center_x, center_y, dx, dy, head_width=0.1, head_length=0.15, fc="blue", ec="blue", lw=5)  ax.text(center_x, center_y + text_offset, f"{v[i, j]:.1f}",  ha="center", va="center", fontsize=40, color="lightgray", weight="bold", alpha=0.8)  for i in range(Width):  for j in range(Height):  ax.text(j, -0.55, f"{j}", ha="center", va="center", fontsize=20)  ax.text(-0.55, i, f"{i}", ha="center", va="center", fontsize=20, rotation=90)  plt.title("Grid World", pad=20)  
plt.tight_layout()  
plt.show()

运行结果

[['down' 'right' 'right' 'down']['down' 'up' 'Block' 'down']['down' 'Block' 'down' 'down']['right' 'right' 'right' 'Goal']]
[[-4.0951 -3.439  -2.71   -1.9   ][-3.439  -4.0951  0.     -1.    ][-2.71    0.     -1.      0.    ][-1.9    -1.      0.      0.    ]]