当前位置：首页 > news >正文

PyTorch深度学习框架60天进阶学习计划 - 第51天：扩散模型原理（二）

news 来源：原创 2025/4/26 6:35:24

PyTorch深度学习框架60天进阶学习计划 - 第51天：扩散模型原理（二）

第二部分：扩散模型的高级理论与优化方法

在第一部分中，我们详细介绍了DDPM的基本原理、变分下界推导和基本实现。在这第二部分中，我们将深入探讨扩散模型的高级理论、加速采样方法、连续时间建模，以及各种优化技巧。我们还将分析不同变体模型的核心思想，为读者提供全面的理论理解和实践指导。

1. DDIM: 确定性采样与加速生成

DDPM的一个主要缺点是需要很多采样步骤（通常是1000步），这使得生成过程相当慢。去噪扩散隐式模型(DDIM)提出了一种巧妙的方法来加速采样过程，同时保持生成质量。

1.1 从DDPM到DDIM的理论推导

DDIM的核心思想是将DDPM重新解释为一个更一般的非马尔可夫过程，这样可以设计出更高效的采样方案。

在DDPM中，前向过程定义了从x₀到xₜ的转移关系：

$q(x_t|x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t}x_0, (1-\bar{\alpha}_t)I)$

而反向过程被参数化为：

$p_\theta(x_{t-1}|x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I)$

DDIM的关键洞见是：我们可以定义一个更一般的条件分布 $q_\sigma(x_{t-1}|x_t, x_0)$ ：

$q_\sigma(x_{t-1}|x_t, x_0) = \mathcal{N}(x_{t-1}; \mu_t(x_t, x_0), \sigma_t^2 I)$

当 $\sigma_t = \sqrt{\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\beta_t}$ 时，这个分布与DDPM一致。但如果我们设置 $\sigma_t = 0$ ，则得到一个确定性的过程，这正是DDIM的关键。

1.2 DDIM的确定性采样算法

DDIM的采样公式为：

$x_{t-1} = \sqrt{\alpha_{t-1}}\left(\frac{x_t - \sqrt{1-\alpha_t}\epsilon_\theta(x_t, t)}{\sqrt{\alpha_t}}\right) + \sqrt{1-\alpha_{t-1}}\epsilon_\theta(x_t, t)$

通过跳过中间步骤，我们可以大大减少采样所需的步数，从而加速生成过程。

以下是DDIM采样算法的PyTorch实现：

def ddim_sample(model, n_samples, image_size, channels=3, device="cuda", n_steps=100, eta=0.0):"""使用DDIM进行加速采样参数:model: 噪声预测模型n_samples: 样本数量image_size: 图像大小channels: 通道数device: 计算设备n_steps: 采样步数 (通常远小于训练使用的步数)eta: 随机性参数 (0为完全确定性, 1为DDPM)返回:生成的样本"""# 设置采样步长with torch.no_grad():# 初始化为纯噪声x = torch.randn(n_samples, channels, image_size, image_size).to(device)# 设置采样时间步（为了加速，我们使用更少的步骤）timesteps = torch.linspace(1, 999, n_steps).long().to(device)# 初始化进度条progress_bar = tqdm(timesteps, desc="DDIM Sampling")# 逐步去噪for i, t in enumerate(progress_bar):# 预测噪声predicted_noise = model(x, t.expand(n_samples))# 计算当前时间步的alpha和alpha_baralpha = 1 - betas[t]alpha_bar = alpha_bars[t]# 如果这不是最后一步，获取下一个时间步的值if i < len(timesteps) - 1:next_t = timesteps[i + 1]alpha_next = 1 - betas[next_t]alpha_bar_next = alpha_bars[next_t]else:next_t = torch.tensor([0]).to(device)alpha_next = 1.0alpha_bar_next = 1.0# 计算x0预测值（denoised image）pred_x0 = (x - torch.sqrt(1 - alpha_bar) * predicted_noise) / torch.sqrt(alpha_bar)# 指定方差（随机性）sigma = eta * torch.sqrt((1 - alpha_bar_next) / (1 - alpha_bar) * (1 - alpha_bar / alpha_bar_next))# 计算均值c1 = torch.sqrt(alpha_bar_next / alpha_bar)c2 = torch.sqrt(1 - alpha_bar_next - sigma**2)mean = c1 * pred_x0 + c2 * predicted_noise# 添加噪声（如果eta > 0）noise = torch.randn_like(x) if eta > 0 else torch.zeros_like(x)x = mean + sigma * noise# 每隔一定步数显示中间结果if i % (n_steps // 5) == 0 or i == len(timesteps) - 1:progress_bar.set_postfix({"step": f"{i+1}/{n_steps}"})# 将图像剪裁到正确的范围 [-1, 1]x = torch.clamp(x, -1.0, 1.0)# 转换到 [0, 1]x = (x + 1) / 2return x

1.3 DDPM与DDIM的对比分析

特性	DDPM	DDIM
理论基础	马尔可夫链	非马尔可夫过程
采样步数	通常1000步	可以减少到10-50步
确定性	随机过程	可以是确定性的
计算复杂度	高	中等
内插/外插	不支持	支持图像编辑和内插
生成质量	高	略低，但在步数增加时接近DDPM

2. 连续时间扩散模型与SDE表示

扩散模型的另一个强大表述是将其视为连续时间随机微分方程(SDE)的解。这一视角不仅提供了更优雅的理论框架，还启发了新的采样算法。

2.1 扩散模型的SDE表示

当时间步长趋于零时，DDPM的离散过程收敛到一个连续时间的SDE：

$d x = f (x, t) d t + g (t) d w$

其中 $f (x, t)$ 是漂移项， $g (t)$ 是扩散系数， $w$ 是标准维纳过程。对于方差保持(VP)SDE，这些项为：

$-\frac{\beta(t)}{2}x$
$\sqrt{\beta(t)}$

其中 $\beta(t)$ 是连续时间上的噪声调度。

2.2 反向SDE和采样

最令人惊讶的是，生成过程对应的是原始SDE的时间反向版本：

$g(t)^2\nabla_x \log p_t(x)]dt + g(t)d\bar{w}$

其中 $\nabla_x \log p_t(x)$ 是分数函数（score function）， $\bar{w}$ 是反向时间的维纳过程。

通过估计分数函数，我们可以使用各种数值求解器来求解这个反向SDE，从而实现更高效的采样。

def sde_sample(score_model, n_samples, image_size, channels=3, device="cuda", n_steps=100, sde_type="VP", solver="euler"):"""使用SDE方法采样参数:score_model: 分数估计模型n_samples: 样本数量image_size: 图像大小channels: 通道数device: 计算设备n_steps: 积分步数sde_type: SDE类型，"VP"或"VE"solver: 积分求解器，"euler"或"heun"返回:生成的样本"""# 初始化为标准正态分布x = torch.randn(n_samples, channels, image_size, image_size).to(device)# 定义SDE参数if sde_type == "VP":# 方差保持SDEbeta_min, beta_max = 0.1, 20.0beta_fn = lambda t: beta_min + t * (beta_max - beta_min)drift_fn = lambda x, t: -0.5 * beta_fn(t) * xdiffusion_fn = lambda t: torch.sqrt(torch.tensor(beta_fn(t)))else:# 方差爆炸SDE（简化）sigma_min, sigma_max = 0.01, 50.0sigma_fn = lambda t: sigma_min * (sigma_max / sigma_min) ** tdrift_fn = lambda x, t: torch.zeros_like(x)diffusion_fn = lambda t: torch.sqrt(torch.tensor(sigma_fn(t) * 2 * torch.log(sigma_max / sigma_min)))# 设置积分时间点time_steps = torch.linspace(1.0, 0.0, n_steps + 1).to(device)dt = time_steps[0] - time_steps[1]# 逆向SDE积分with torch.no_grad():for i in range(n_steps):t = time_steps[i]# 获取分数估计score = score_model(x, t.expand(n_samples))# 计算漂移项drift = drift_fn(x, t)diffusion = diffusion_fn(t)# 反向漂移（添加分数项）drift_with_score = drift - diffusion**2 * scoreif solver == "euler":# Euler-Maruyama方法x = x - drift_with_score * dtif i < n_steps - 1:  # 最后一步不添加噪声x = x + diffusion * torch.sqrt(dt) * torch.randn_like(x)elif solver == "heun":# Heun方法（二阶Runge-Kutta）x_prime = x - drift_with_score * dtscore_prime = score_model(x_prime, time_steps[i+1].expand(n_samples))drift_prime = drift_fn(x_prime, time_steps[i+1])drift_with_score_prime = drift_prime - diffusion_fn(time_steps[i+1])**2 * score_primex = x - 0.5 * (drift_with_score + drift_with_score_prime) * dtif i < n_steps - 1:x = x + diffusion * torch.sqrt(dt) * torch.randn_like(x)if i % (n_steps // 10) == 0:print(f"Step {i+1}/{n_steps}, t={t.item():.4f}")# 将图像剪裁到正确的范围x = torch.clamp(x, -1.0, 1.0)x = (x + 1) / 2  # 转换到 [0, 1] 范围return x

2.3 不同数值求解器的对比

SDE表示的一个主要优势是可以使用各种高级数值求解器来提高采样效率：

求解器	描述	优点	缺点
Euler-Maruyama	一阶方法	简单、计算量小	精度较低
Heun	二阶Runge-Kutta	精度提高	计算量增加一倍
DPM-Solver	高阶求解器	高精度、加速	实现复杂
PNDM	伪数值方法	加速采样	适用性受限

3. Score-Based生成模型与扩散模型的统一

Score-Based生成模型(SGM)和扩散模型(DM)虽然起源不同，但已被证明在数学上是等价的。这种统一观点不仅加深了我们的理论理解，还促进了更高效算法的发展。

3.1 评分匹配与去噪扩散

评分匹配的目标是估计数据分布的对数梯度（评分函数）：

$\nabla_x \log p(x)$

通过扭曲数据分布（添加噪声），我们可以在一系列噪声水平上训练评分估计器。对于多个噪声水平的加权评分匹配目标，可以重写为：

$L_{SM} = \mathbb{E}_{t\sim\mathcal{U}[0,1], x_0, \epsilon}\left[w(t)\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]$

这与DDPM的目标函数惊人地相似，表明两种方法本质上是一致的。

3.2 统一视角下的损失函数

从统一的视角来看，不同的权重函数 $w (t)$ 对应不同的训练目标：

$w (t) = 1$ : 简化的DDPM目标
$\sigma_t^2$ : 对数似然的变分下界
$\sigma_t$ : 改进的SGM目标

这种灵活性允许我们根据需要调整训练重点。

def unified_diffusion_loss(model, x_0, t, noise_schedule, loss_type="simple"):"""统一的扩散模型损失函数参数:model: 神经网络模型x_0: 原始数据t: 时间步noise_schedule: 噪声调度loss_type: 损失类型: "simple", "vlb", "sgm"返回:计算的损失"""# 计算噪声参数alpha_bars = noise_schedule.alpha_bars[t]sqrt_alpha_bars = torch.sqrt(alpha_bars)sqrt_one_minus_alpha_bars = torch.sqrt(1 - alpha_bars)# 添加噪声epsilon = torch.randn_like(x_0)x_t = sqrt_alpha_bars.view(-1, 1, 1, 1) * x_0 + sqrt_one_minus_alpha_bars.view(-1, 1, 1, 1) * epsilon# 预测噪声predicted_noise = model(x_t, t)# 根据损失类型选择权重if loss_type == "simple":weight = 1.0elif loss_type == "vlb":weight = sqrt_one_minus_alpha_bars ** 2elif loss_type == "sgm":weight = sqrt_one_minus_alpha_barselse:raise ValueError(f"未知的损失类型: {loss_type}")# 加权MSE损失loss = torch.mean(weight.view(-1, 1, 1, 1) * (epsilon - predicted_noise) ** 2)return loss

4. 变分下界的深入解析和改进

虽然DDPM使用了简化的目标函数，但变分下界(ELBO)的完整形式包含了更多信息，对理解和改进模型很有价值。

4.1 完整ELBO的组成部分

DDPM的完整ELBO可以分解为：

$L_{ELBO} = L_0 + L_1 + ... + L_T$

其中：

$L_0$ 是重构项，衡量 $p_\theta(x_0|x_1)$ 的准确性
$L_1$ 到 $L_{T-1}$ 是KL项，衡量每一步预测的准确性
$L_T$ 是先验匹配项，衡量 $q(x_T|x_0)$ 与 $p(x_T)$ 的接近程度

4.2 完整ELBO的PyTorch实现

def compute_full_elbo(model, x_0, noise_schedule, n_samples=1):"""计算完整的ELBO损失参数:model: 神经网络模型x_0: 原始数据noise_schedule: 噪声调度n_samples: 蒙特卡洛采样数量返回:完整的ELBO损失"""batch_size = x_0.shape[0]device = x_0.deviceT = len(noise_schedule.betas)# 预先计算噪声参数betas = noise_schedule.betasalphas = 1 - betasalpha_bars = noise_schedule.alpha_bars# 初始化损失L_0 = torch.zeros(batch_size, device=device)L_kl = torch.zeros(batch_size, device=device)L_T = torch.zeros(batch_size, device=device)for s in range(n_samples):# 计算L_0（重构项）t = torch.ones(batch_size, device=device).long()noise = torch.randn_like(x_0)x_1 = torch.sqrt(alpha_bars[t]).view(-1, 1, 1, 1) * x_0 + \torch.sqrt(1 - alpha_bars[t]).view(-1, 1, 1, 1) * noisepredicted_noise = model(x_1, t)predicted_x0 = (x_1 - torch.sqrt(1 - alpha_bars[t]).view(-1, 1, 1, 1) * predicted_noise) / \torch.sqrt(alpha_bars[t]).view(-1, 1, 1, 1)# 对于简单起见，使用离散正态分布的负对数似然variance = betas[1] * (1 - alpha_bars[0]) / (1 - alpha_bars[1])L_0 += 0.5 * torch.sum((predicted_x0 - x_0) ** 2, dim=[1, 2, 3]) / variance# 计算L_1到L_{T-1}（KL项）for t in range(2, T):t_tensor = torch.ones(batch_size, device=device).long() * tnoise = torch.randn_like(x_0)x_t = torch.sqrt(alpha_bars[t_tensor]).view(-1, 1, 1, 1) * x_0 + \torch.sqrt(1 - alpha_bars[t_tensor]).view(-1, 1, 1, 1) * noisepredicted_noise = model(x_t, t_tensor)predicted_x0 = (x_t - torch.sqrt(1 - alpha_bars[t_tensor]).view(-1, 1, 1, 1) * predicted_noise) / \torch.sqrt(alpha_bars[t_tensor]).view(-1, 1, 1, 1)# 计算均值和方差mu_t = predicted_x0 * torch.sqrt(alpha_bars[t_tensor-1]).view(-1, 1, 1, 1) + \predicted_noise * torch.sqrt(1 - alpha_bars[t_tensor-1]).view(-1, 1, 1, 1)posterior_variance = betas[t] * (1 - alpha_bars[t-1]) / (1 - alpha_bars[t])posterior_log_variance = torch.log(posterior_variance)# 计算KL散度x_t_1 = torch.sqrt(alpha_bars[t_tensor-1]).view(-1, 1, 1, 1) * x_0 + \torch.sqrt(1 - alpha_bars[t_tensor-1]).view(-1, 1, 1, 1) * noisekl = 0.5 * torch.sum((x_t_1 - mu_t) ** 2, dim=[1, 2, 3]) / posterior_variance - \0.5 * np.prod(x_0.shape[1:]) - \0.5 * posterior_log_varianceL_kl += kl# 计算L_T（先验匹配项）x_T = torch.sqrt(alpha_bars[-1]).view(-1, 1, 1, 1) * x_0 + \torch.sqrt(1 - alpha_bars[-1]).view(-1, 1, 1, 1) * noiseL_T += 0.5 * torch.sum(x_T ** 2, dim=[1, 2, 3])# 平均多个样本L_0 /= n_samplesL_kl /= n_samplesL_T /= n_samples# 总ELBOelbo = L_0 + L_kl + L_Treturn elbo.mean(), (L_0.mean(), L_kl.mean(), L_T.mean())

4.3 改进的变分目标

研究表明，标准ELBO可能不是最优训练目标。已经提出了几种改进方案：

混合损失: 结合简化目标和变分下界
重新加权目标: 根据时间步调整权重
级联重新加权: 在训练过程中动态调整重点

def reweighted_elbo_loss(model, x_0, noise_schedule, gamma=1.0):"""重新加权的ELBO损失参数:model: 神经网络模型x_0: 原始数据noise_schedule: 噪声调度gamma: 重新加权系数返回:重新加权的ELBO损失"""batch_size = x_0.shape[0]device = x_0.deviceT = len(noise_schedule.betas)# 采样时间步t = torch.randint(1, T, (batch_size,), device=device)# 添加噪声noise = torch.randn_like(x_0)x_t = noise_schedule.q_sample(x_0, t, noise)# 预测噪声predicted_noise = model(x_t, t)# 计算SNR权重SNR = noise_schedule.alpha_bars[t] / (1 - noise_schedule.alpha_bars[t])weight = (SNR ** gamma) / (1 + SNR)# 加权MSE损失loss = torch.mean(weight.view(-1, 1, 1, 1) * (noise - predicted_noise) ** 2)return loss

5. 分析离散与连续时间模型的实际差异

虽然理论上离散和连续时间模型在极限情况下是等价的，但在实际应用中它们有显著差异。下面我们通过代码和实验来分析这些差异。

5.1 噪声调度的影响

不同的噪声调度对模型性能有显著影响：

def compare_noise_schedules():"""比较不同噪声调度的影响"""# 定义不同类型的噪声调度schedules = {"线性": lambda t: 1e-4 + t * (0.02 - 1e-4),"余弦": lambda t: 0.008 * (1 - torch.cos(t * math.pi / 2)),"二次": lambda t: 1e-4 + (t ** 2) * (0.02 - 1e-4),"sigmoid": lambda t: 1e-4 + (0.02 - 1e-4) * torch.sigmoid(10 * (t - 0.5))}# 创建时间步长t = torch.linspace(0, 1, 1000)# 计算每种调度的beta值plt.figure(figsize=(12, 8))for name, schedule_fn in schedules.items():beta_t = schedule_fn(t)alpha_t = 1 - beta_talpha_bar_t = torch.cumprod(alpha_t, dim=0)plt.subplot(2, 2, 1)plt.plot(t.numpy(), beta_t.numpy(), label=name)plt.xlabel('t')plt.ylabel('β(t)')plt.title('噪声强度')plt.legend()plt.subplot(2, 2, 2)plt.plot(t.numpy(), alpha_bar_t.numpy(), label=name)plt.xlabel('t')plt.ylabel('α̅(t)')plt.title('信号保留率')plt.legend()# 可视化不同时间步的噪声水平plt.subplot(2, 2, 3)plt.plot(t.numpy(), torch.sqrt(1 - alpha_bar_t).numpy(), label=name)plt.xlabel('t')plt.ylabel('√(1-α̅(t))')plt.title('噪声水平')plt.legend()# 可视化SNRplt.subplot(2, 2, 4)snr = alpha_bar_t / (1 - alpha_bar_t)plt.plot(t.numpy(), torch.log(snr).numpy(), label=name)plt.xlabel('t')plt.ylabel('log(SNR)')plt.title('信噪比(对数尺度)')plt.legend()plt.tight_layout()plt.savefig("noise_schedules_comparison.png")plt.show()

5.2 离散DDPM与连续SDE的实验对比

下面我们设计一个实验，直接比较DDPM和SDE方法在相同数据集上的性能：

def discrete_vs_continuous_experiment(n_steps_list=[1000, 250, 100, 50, 20, 10]):"""对比离散DDPM和连续SDE方法在不同采样步数下的性能"""# 假设我们已经有训练好的模型device = torch.device("cuda" if torch.cuda.is_available() else "cpu")model = load_pretrained_model().to(device)# 设置评估参数n_samples = 16image_size = 32# 为每种方法和步数组合生成样本results = {"DDPM": {},"DDIM": {},"SDE-Euler": {},"SDE-Heun": {}}for n_steps in n_steps_list:print(f"生成样本，步数: {n_steps}")# DDPM采样start_time = time.time()ddpm_samples = ddpm_sample(model, n_samples, image_size, n_steps=n_steps)ddpm_time = time.time() - start_timeresults["DDPM"][n_steps] = {"samples": ddpm_samples.cpu(), "time": ddpm_time}# DDIM采样start_time = time.time()ddim_samples = ddim_sample(model, n_samples, image_size, n_steps=n_steps, eta=0.0)ddim_time = time.time() - start_timeresults["DDIM"][n_steps] = {"samples": ddim_samples.cpu(), "time": ddim_time}# SDE-Euler采样start_time = time.time()sde_euler_samples = sde_sample(model, n_samples, image_size, n_steps=n_steps, solver="euler")sde_euler_time = time.time() - start_timeresults["SDE-Euler"][n_steps] = {"samples": sde_euler_samples.cpu(), "time": sde_euler_time}# SDE-Heun采样 (对于非常小的步数使用更高级的求解器)if n_steps <= 100:start_time = time.time()sde_heun_samples = sde_sample(model, n_samples, image_size, n_steps=n_steps, solver="heun")sde_heun_time = time.time() - start_timeresults["SDE-Heun"][n_steps] = {"samples": sde_heun_samples.cpu(), "time": sde_heun_time}# 计算FID评分 (假设有一个计算FID的函数)for method in results:for n_steps in results[method]:if "samples" in results[method][n_steps]:fid = compute_fid(results[method][n_steps]["samples"])results[method][n_steps]["fid"] = fid# 可视化结果plt.figure(figsize=(20, 15))# 采样时间比较plt.subplot(2, 2, 1)for method in results:steps = sorted(results[method].keys())times = [results[method][s]["time"] for s in steps if "time" in results[method][s]]if times:  # 只有当有数据时才绘制plt.plot(steps[:len(times)], times, marker='o', label=method)plt.xlabel('采样步数')plt.ylabel('采样时间 (秒)')plt.title('不同方法的采样时间')plt.legend()plt.grid(True)# FID比较plt.subplot(2, 2, 2)for method in results:steps = sorted(results[method].keys())fids = [results[method][s]["fid"] for s in steps if "fid" in results[method][s]]if fids:  # 只有当有数据时才绘制plt.plot(steps[:len(fids)], fids, marker='o', label=method)plt.xlabel('采样步数')plt.ylabel('FID评分 (越低越好)')plt.title('不同方法的生成质量')plt.legend()plt.grid(True)# 样本可视化middle_steps_idx = len(n_steps_list) // 2middle_steps = n_steps_list[middle_steps_idx]plt.subplot(2, 2, 3)plot_samples_grid(results["DDPM"][middle_steps]["samples"][:4], results["DDIM"][middle_steps]["samples"][:4],title=f"DDPM vs DDIM ({middle_steps}步)")plt.subplot(2, 2, 4)if "samples" in results["SDE-Euler"][middle_steps] and "samples" in results["SDE-Heun"].get(middle_steps, {}):plot_samples_grid(results["SDE-Euler"][middle_steps]["samples"][:4], results["SDE-Heun"][middle_steps]["samples"][:4],title=f"SDE-Euler vs SDE-Heun ({middle_steps}步)")plt.tight_layout()plt.savefig("discrete_vs_continuous_comparison.png")plt.show()return resultsdef plot_samples_grid(samples1, samples2, title="样本对比"):"""绘制样本网格进行视觉比较"""n = len(samples1)fig, axes = plt.subplots(2, n, figsize=(n*2, 4))for i in range(n):axes[0, i].imshow(samples1[i].squeeze(), cmap='viridis')axes[0, i].axis('off')if i == 0:axes[0, i].set_title("方法1")axes[1, i].imshow(samples2[i].squeeze(), cmap='viridis')axes[1, i].axis('off')if i == 0:axes[1, i].set_title("方法2")plt.suptitle(title)plt.tight_layout()

6. 变分下界与梯度流形

扩散模型的变分下界可以从随机过程的梯度流角度来理解，这为我们提供了另一种理论视角。

6.1 分数匹配与梯度流

在Score-SDE框架中，我们可以将扩散过程解释为梯度流：

$-\nabla_x U(x)dt + \sqrt{2}dw$

其中 $U (x)$ 是能量函数，满足 $\propto e^{-U(x)}$ 。

分数函数 $\nabla_x \log p(x)$ 正是 $-\nabla_x U(x)$ ，表示数据分布的梯度流方向。

6.2 连续时间ELBO

在连续时间设定下，ELBO可以表示为：

$\mathcal{L} = \mathbb{E}_{q}\left[\int_0^T \|\nabla_x \log p_t(x_t) - s_\theta(x_t, t)\|^2 dt\right]$

其中 $s_\theta(x_t, t)$ 是我们的分数估计器。这表明，我们的目标是使估计的分数尽可能接近真实分数函数。

def continuous_time_elbo_loss(score_model, x_0, t, noise_schedule):"""连续时间ELBO损失参数:score_model: 分数估计模型x_0: 原始数据t: 连续时间点 (0到1)noise_schedule: 噪声调度返回:ELBO损失"""batch_size = x_0.shape[0]device = x_0.device# 获取连续时间噪声参数alpha_bar_t = noise_schedule.alpha_bar_continuous(t)# 添加噪声noise = torch.randn_like(x_0)x_t = torch.sqrt(alpha_bar_t).view(-1, 1, 1, 1) * x_0 + \torch.sqrt(1 - alpha_bar_t).view(-1, 1, 1, 1) * noise# 真实分数true_score = -noise / torch.sqrt(1 - alpha_bar_t).view(-1, 1, 1, 1)# 估计分数estimated_score = score_model(x_t, t)# 计算损失loss = 0.5 * torch.mean((true_score - estimated_score) ** 2)return loss

7. 离散与连续表示的统一理解

离散和连续时间表示看似不同，但在理论上可以统一理解。让我们建立这两种表示之间的桥梁。

7.1 噪声预测与分数估计

DDPM训练中的噪声预测与分数估计可以通过以下关系联系起来：

$\epsilon_\theta(x_t, t) = -\sqrt{1 - \bar{\alpha}_t}s_\theta(x_t, t)$

这表明，噪声预测网络其实就是在预测分数函数的缩放版本。

7.2 ODE表示与确定性流

Song等人指出，扩散模型也可以表示为一个常微分方程(ODE)：

$\frac{1}{2}g(t)^2\nabla_x \log p_t(x)]dt$

这种表示使我们能够通过求解ODE实现确定性采样，类似于DDIM的确定性过程。

def ode_sample(score_model, n_samples, image_size, channels=3, device="cuda", n_steps=50):"""使用ODE求解器进行确定性采样参数:score_model: 分数估计模型n_samples: 样本数量image_size: 图像大小channels: 通道数device: 计算设备n_steps: 积分步数返回:生成的样本"""# 初始化为标准正态分布x = torch.randn(n_samples, channels, image_size, image_size).to(device)# 设置噪声调度beta_min, beta_max = 0.1, 20.0beta_fn = lambda t: beta_min + t * (beta_max - beta_min)# 设置积分时间点time_steps = torch.linspace(1.0, 0.0, n_steps + 1).to(device)dt = time_steps[0] - time_steps[1]# ODE积分with torch.no_grad():for i in range(n_steps):t = time_steps[i]# 获取分数估计score = score_model(x, t.expand(n_samples))# 计算ODE右边项drift = -0.5 * beta_fn(t) * xdiffusion_term = -0.5 * beta_fn(t) * score# 更新xx = x + (drift + diffusion_term) * dtif i % (n_steps // 10) == 0:print(f"Step {i+1}/{n_steps}, t={t.item():.4f}")# 将图像剪裁到正确的范围x = torch.clamp(x, -1.0, 1.0)x = (x + 1) / 2  # 转换到 [0, 1] 范围return x

8. 实用化与应用优化

在实际应用中，扩散模型面临的主要挑战是生成速度慢。下面我们介绍几种实用化优化技术。

8.1 加速采样的技术

预训练快速采样器: 训练一个专门的采样模型，用更少的步骤生成高质量样本
进步式蒸馏: 将大模型知识蒸馏到更小、更快的模型中
自适应步长: 根据生成过程中的不确定性动态调整步长

def adaptive_step_sampling(model, n_samples, image_size, channels=3, device="cuda", min_steps=10, max_steps=100, uncertainty_threshold=0.1):"""使用自适应步长的采样方法参数:model: 噪声预测模型n_samples: 样本数量image_size: 图像大小channels: 通道数device: 计算设备min_steps: 最小步数max_steps: 最大步数uncertainty_threshold: 不确定性阈值返回:生成的样本"""# 初始化为纯噪声x = torch.randn(n_samples, channels, image_size, image_size).to(device)# 设置噪声调度beta_min, beta_max = 0.1, 20.0beta_fn = lambda t: beta_min + t * (beta_max - beta_min)# 初始时间步t = torch.ones(n_samples, device=device)# 步数计数step_count = 0# 记录每个样本使用的步数sample_steps = torch.zeros(n_samples, device=device)# 自适应采样with torch.no_grad():while torch.any(t > 0) and step_count < max_steps:# 预测噪声predicted_noise = model(x, t)# 计算不确定性（这里使用一个简单的启发式方法）if step_count > 0:uncertainty = torch.mean((predicted_noise - prev_noise) ** 2, dim=[1, 2, 3])uncertainty = uncertainty / torch.mean(predicted_noise ** 2, dim=[1, 2, 3])else:uncertainty = torch.ones(n_samples, device=device) * 2 * uncertainty_threshold# 保存当前噪声预测prev_noise = predicted_noise.clone()# 确定步长dt = torch.where(uncertainty > uncertainty_threshold,torch.ones_like(t) * (1.0 / max_steps),  # 小步长torch.ones_like(t) * (1.0 / min_steps)   # 大步长)# 确保不会超过0dt = torch.min(dt, t)# 更新时间步t = t - dt# 更新样本步数sample_steps = torch.where(t > 0, sample_steps + 1, sample_steps)# 计算去噪步骤alpha_t = 1 - beta_fn(t)alpha_t_minus_dt = 1 - beta_fn(torch.max(t - dt, torch.zeros_like(t)))# 更新xx_0_pred = (x - torch.sqrt(1 - alpha_t).view(-1, 1, 1, 1) * predicted_noise) / \torch.sqrt(alpha_t).view(-1, 1, 1, 1)mean = torch.sqrt(alpha_t_minus_dt).view(-1, 1, 1, 1) * x_0_pred + \torch.sqrt(1 - alpha_t_minus_dt).view(-1, 1, 1, 1) * predicted_noise# 添加噪声（如果需要）noise = torch.randn_like(x)sigma = torch.sqrt(beta_fn(t)).view(-1, 1, 1, 1) * dt.view(-1, 1, 1, 1)x = mean + sigma * noisestep_count += 1print(f"Step {step_count}, Avg steps per sample: {torch.mean(sample_steps).item():.2f}")# 将图像剪裁到正确的范围x = torch.clamp(x, -1.0, 1.0)x = (x + 1) / 2  # 转换到 [0, 1] 范围print(f"完成采样，平均步数: {torch.mean(sample_steps).item():.2f}")return x

8.2 内存优化

生成高分辨率图像时，内存消耗是一个重要问题。下面是一些内存优化技术：

def memory_efficient_sampling(model, n_samples, image_size, channels=3, device="cuda", n_steps=100):"""内存高效的采样方法参数:model: 噪声预测模型n_samples: 样本数量image_size: 图像大小channels: 通道数device: 计算设备n_steps: 采样步数返回:生成的样本"""# 分块处理大图像max_batch_size = 4  # 根据GPU内存调整all_samples = []for i in range(0, n_samples, max_batch_size):batch_size = min(max_batch_size, n_samples - i)# 初始化为纯噪声x = torch.randn(batch_size, channels, image_size, image_size).to(device)# 设置噪声调度betas = torch.linspace(0.0001, 0.02, 1000).to(device)alphas = 1. - betasalphas_cumprod = torch.cumprod(alphas, dim=0)# 逐步去噪for t in tqdm(reversed(range(1, 1000, 1000 // n_steps)), desc=f"Batch {i//max_batch_size + 1}"):t_tensor = torch.ones(batch_size, device=device).long() * t# 预测噪声with torch.no_grad():predicted_noise = model(x, t_tensor)# 计算去噪参数alpha = alphas[t]alpha_bar = alphas_cumprod[t]beta = betas[t]if t > 1:noise = torch.randn_like(x)else:noise = torch.zeros_like(x)# 更新x（使用较少的中间变量）x = (1 / torch.sqrt(alpha)) * (x - ((1 - alpha) / torch.sqrt(1 - alpha_bar)) * predicted_noise) + \torch.sqrt(beta) * noise# 手动释放内存if t % 100 == 0:torch.cuda.empty_cache()# 将图像剪裁到正确的范围x = torch.clamp(x, -1.0, 1.0)x = (x + 1) / 2  # 转换到 [0, 1] 范围all_samples.append(x.cpu())# 合并所有批次samples = torch.cat(all_samples, dim=0)return samples

9. 扩散模型的实际应用案例

9.1 图像生成与编辑

扩散模型已经成功应用于多种图像生成和编辑任务：

def image_inpainting(model, image, mask, device="cuda", n_steps=100):"""使用扩散模型进行图像修复参数:model: 噪声预测模型image: 待修复的图像 (带有缺失区域)mask: 二进制掩码，指示哪些区域需要修复 (1表示保留，0表示缺失)device: 计算设备n_steps: 采样步数返回:修复后的图像"""# 确保图像和掩码在正确的设备上image = image.to(device)mask = mask.to(device)# 初始化为纯噪声x = torch.randn_like(image).to(device)# 设置噪声调度betas = torch.linspace(0.0001, 0.02, 1000).to(device)alphas = 1. - betasalphas_cumprod = torch.cumprod(alphas, dim=0)# 逐步去噪for t in tqdm(reversed(range(1, 1000, 1000 // n_steps)), desc="Image Inpainting"):t_tensor = torch.ones(image.shape[0], device=device).long() * t# 预测噪声with torch.no_grad():predicted_noise = model(x, t_tensor)# 计算去噪参数alpha = alphas[t]alpha_bar = alphas_cumprod[t]beta = betas[t]if t > 1:noise = torch.randn_like(x)else:noise = torch.zeros_like(x)# 更新xx_update = (1 / torch.sqrt(alpha)) * (x - ((1 - alpha) / torch.sqrt(1 - alpha_bar)) * predicted_noise) + \torch.sqrt(beta) * noise# 对已知区域进行引导（使用原始图像）known_update = imageif t > 1:  # 如果不是最后一步，为已知区域添加相应的噪声t_prev = t - 1000 // n_stepst_prev = max(t_prev, 0)alpha_bar_prev = alphas_cumprod[t_prev] if t_prev > 0 else torch.tensor(1.0).to(device)known_update = torch.sqrt(alpha_bar_prev) * image + \torch.sqrt(1 - alpha_bar_prev) * torch.randn_like(image)# 组合已知区域和生成区域x = mask * known_update + (1 - mask) * x_update# 将图像剪裁到正确的范围x = torch.clamp(x, -1.0, 1.0)x = (x + 1) / 2  # 转换到 [0, 1] 范围return x

9.2 多模态扩散模型

扩散模型也可以扩展到处理多模态数据：

class MultimodalDiffusionModel(nn.Module):"""多模态扩散模型"""def __init__(self, image_size=64, text_dim=768):super().__init__()self.image_size = image_sizeself.text_dim = text_dim# 文本编码器（假设我们使用预训练的CLIP模型）self.text_encoder = None  # 实际中会加载预训练模型# U-Net骨干网络self.unet = SimpleUNet(channels=3, time_dim=256)# 添加文本条件self.text_proj = nn.Sequential(nn.Linear(text_dim, 512),nn.SiLU(),nn.Linear(512, 256),)def forward(self, x, t, text_embed):"""前向传播参数:x: 噪声图像t: 时间步text_embed: 文本嵌入"""# 处理文本嵌入text_features = self.text_proj(text_embed)# 获取时间嵌入（在U-Net内部实现）# 添加文本条件unet_out = self.unet(x, t, text_features)return unet_out

10. 扩散模型未来方向与当前挑战

10.1 主要挑战与解决方向

挑战	现有解决方案	未来研究方向
生成速度慢	DDIM、高级ODE求解器、蒸馏	单步生成、并行推理
内存消耗大	分块处理、梯度检查点	更高效的架构设计、稀疏注意力
训练不稳定	重新加权目标、学习率调度	自适应训练策略、改进的正则化
文本条件控制	CLIP引导、交叉注意力	更强的语义理解、可解释控制
3D生成能力	NeRF+扩散、视图一致性约束	统一的3D生成框架

10.2 扩散模型研究的未来方向

def diffusion_future_research():"""可视化扩散模型未来研究方向"""research_areas = {"速度优化": ["单步或少步生成","并行解码策略","预计算和模型缓存","自适应采样"],"架构创新": ["混合模型架构","稀疏注意力机制","视觉-语言-音频联合建模","模块化设计"],"理论延伸": ["更统一的生成理论","与能量模型的连接","与最优传输理论的联系","贝叶斯观点的扩展"],"应用拓展": ["3D和视频生成","科学数据建模","医疗应用","工业设计辅助"]}# 创建方向图plt.figure(figsize=(15, 10))# 使用雷达图表示研究方向categories = list(research_areas.keys())N = len(categories)# 创建角度均匀分布的点angles = [n / float(N) * 2 * np.pi for n in range(N)]angles += angles[:1]  # 闭合图形# 初始化雷达图ax = plt.subplot(111, polar=True)# 绘制每个类别的轴并标记plt.xticks(angles[:-1], categories)# 绘制边界max_areas = max([len(areas) for areas in research_areas.values()])ax.set_ylim(0, max_areas + 1)values = [len(research_areas[c]) for c in categories]values += values[:1]  # 闭合多边形ax.plot(angles, values)ax.fill(angles, values, alpha=0.1)# 为每个类别添加研究点for i, category in enumerate(categories):angle = angles[i]for j, area in enumerate(research_areas[category]):radius = j + 1x = angley = radiusplt.plot([x], [y], 'o', markersize=10)plt.text(x, y + 0.1, area, horizontalalignment='center' if np.cos(x) < 0.1 else ('right' if np.cos(x) < 0 else 'left'),verticalalignment='center')plt.title("扩散模型未来研究方向", size=20)plt.tight_layout()plt.savefig("diffusion_future_research.png")plt.show()

结论

在本文的第二部分，我们深入探讨了扩散模型的高级理论和优化方法。我们从DDIM的确定性采样开始，讨论了连续时间扩散模型的SDE表示，分析了Score-Based生成模型与扩散模型的统一观点，并详细推导了变分下界的数学基础。我们还比较了离散与连续时间建模的差异，介绍了多种加速采样和内存优化技术，并探讨了扩散模型的实际应用案例和未来发展方向。

通过这两部分的学习，我们现在对扩散模型有了全面的理解，从基本原理到高级应用。扩散模型作为一个强大的生成模型框架，不仅在图像生成领域取得了突破性进展，还在多模态生成、科学数据建模等方面展现出巨大潜力。

清华大学全五版的《DeepSeek教程》完整的文档需要的朋友，关注我私信：deepseek 即可获得。

怎么样今天的内容还满意吗？再次感谢朋友们的观看，关注GZH：凡人的AI工具箱，回复666，送您价值199的AI大礼包。最后，祝您早日实现财务自由，还请给个赞，谢谢！