Actor-Critic 方法

1. 核心思想

Actor-Critic 方法是基于策略的强化学习与基于价值的强化学习的结合体。它同时使用两个神经网络：

Actor（演员）：策略网络 $\pi(a|s; \theta)$ ，负责选择动作
Critic（评论家）：价值网络 $q(s, a; w)$ ，负责评估动作的好坏

Actor 和 Critic 互相配合——Actor 学习如何行动，Critic 学习如何评判；Actor 根据评判结果改进自己的策略。

为什么需要 Critic？ 在纯策略梯度（如 REINFORCE）中，需要玩完整局游戏才能获得 Q 值估计，方差很大。Critic 提供了一个在线的、低方差的 Q 值估计。

Actor-Critic overview

2. Actor：策略网络

2.1 定义

策略网络 (Actor) 使用神经网络来近似策略函数：

$\pi(a|s) \approx \pi(a|s; \theta)$

其中 $\theta$ 是神经网络的可训练参数。

2.2 输入与输出

输入：状态 $s$ ，例如游戏画面的一帧截图
输出：动作空间上的概率分布

2.3 网络架构

状态 s (游戏画面)
    ↓
卷积层 (Conv) → 提取图像特征
    ↓
全连接层 (Dense)
    ↓
Softmax 层 → 输出概率分布
    ↓
{"left": 0.2, "right": 0.1, "up": 0.7}

Policy Network Actor

2.4 Softmax 输出

输出是合法的概率分布，所有动作概率之和为 1：

$\sum_{a \in \mathcal{A}} \pi(a|s; \theta) = 1$

这就是为什么输出层使用 Softmax 激活函数。

3. Critic：价值网络

3.1 定义

价值网络 (Critic) 使用神经网络来近似动作价值函数：

$Q_\pi(s, a) \approx q(s, a; w)$

其中 $w$ 是神经网络的可训练参数。

3.2 输入与输出

输入：状态 $s$ 和动作 $a$
输出：近似的动作价值（一个标量 $q(s, a; w)$ ）

3.3 网络架构

价值网络的核心设计是双流输入、拼接特征：

状态 s (图像)           动作 a (离散值)
    ↓                      ↓
  Conv                    Dense
    ↓                      ↓
  特征向量 ──→ [拼接 concatenate] ──→ Dense ──→ q(s, a; w)

具体流程：

状态 $s$ 通过卷积层提取图像特征
动作 $a$ 通过全连接层编码为特征
两组特征拼接后，通过全连接层输出一个标量值

Value Network Critic

3.4 为什么需要动作作为输入？

与 DQN 不同（只需要状态作为输入），Critic 网络需要同时输入状态和动作，因为它要估计的是 $Q(s, a)$ ，即在特定状态下采取特定动作的价值。

4. 状态价值函数近似

4.1 真实状态价值函数

$V_\pi(s) = \sum_a \pi(a|s) \cdot Q_\pi(s, a)$

4.2 使用神经网络近似

用 Actor 网络 $\pi(a|s; \theta)$ 近似策略，用 Critic 网络 $q(s, a; w)$ 近似 Q 值：

$V(s; \theta, w) = \sum_a \pi(a|s; \theta) \cdot q(s, a; w)$

State-Value Function Approximation

4.3 双网络协作

组件	网络	参数	功能
Actor	$\pi(a\\|s; \theta)$	$\theta$	选择动作
Critic	$q(s, a; w)$	$w$	评估动作

这两个网络共享同一个优化目标——最大化 $V(s; \theta, w)$ ，但通过不同的方式更新。

5. 训练流程

Actor-Critic 的每个时间步需要完成以下操作：

Training flow

整体步骤

观察状态 $s_t$
采样动作 $a_t \sim \pi(\cdot|s_t; \theta_t)$
执行动作 $a_t$ ，环境返回新状态 $s_{t+1}$ 和奖励 $r_t$
更新 Critic：用时序差分 (TD) 更新价值网络参数 $w$
更新 Actor：用策略梯度更新策略网络参数 $\theta$

关键点：每个时间步都可以进行更新，不需要等到 episode 结束。这是相比 REINFORCE 的一大优势。

6. 更新价值网络：时序差分

6.1 TD 目标

Critic 使用时序差分 (Temporal Difference, TD) 方法进行训练。

首先计算两个 Q 值估计：

当前 Q 值： $q(s_t, a_t; w_t)$
下一状态 Q 值： $q(s_{t+1}, a_{t+1}; w_t)$

TD 目标定义为：

$y_t = r_t + \gamma \cdot q(s_{t+1}, a_{t+1}; w_t)$

其中 $\gamma$ 是折扣因子。

6.2 损失函数

TD 损失使用均方误差：

$L(w) = \frac{1}{2} \left[ q(s_t, a_t; w) - y_t \right]^2$

6.3 梯度下降更新

对 $w$ 使用梯度下降最小化损失：

$w_{t+1} = w_t - \alpha \cdot \left. \frac{\partial L(w)}{\partial w} \right|_{w = w_t}$

其中 $\alpha$ 是 Critic 的学习率。

Update value network using TD

6.4 TD 误差

将损失函数展开后的梯度为：

$\frac{\partial L(w)}{\partial w} = \left[ q(s_t, a_t; w) - y_t \right] \cdot \frac{\partial q(s_t, a_t; w)}{\partial w}$

其中方括号中的项就是 TD 误差 (TD error)：

$\delta_t = q(s_t, a_t; w) - \left( r_t + \gamma \cdot q(s_{t+1}, a_{t+1}; w) \right)$

因此最终的更新公式为：

$w_{t+1} = w_t - \alpha \cdot \delta_t \cdot \left. \frac{\partial q(s_t, a_t; w)}{\partial w} \right|_{w = w_t}$

7. 更新策略网络：策略梯度

7.1 策略梯度回顾

状态价值函数对 $\theta$ 的导数（策略梯度）为：

$\frac{\partial V(s; \theta, w)}{\partial \theta} = \mathbb{E}_A \left[ \frac{\partial \log \pi(A|s; \theta)}{\partial \theta} \cdot q(s, A; w) \right]$

7.2 定义辅助函数

令：

$g(a, \theta) = \frac{\partial \log \pi(a|s; \theta)}{\partial \theta} \cdot q(s, a; w)$

则策略梯度可以写成：

$\frac{\partial V(s; \theta, w)}{\partial \theta} = \mathbb{E}_A [g(A, \theta)]$

7.3 随机策略梯度上升

由于期望无法精确计算，通过采样来近似：

随机采样： $a \sim \pi(\cdot|s_t; \theta_t)$ （因此 $g(a, \theta)$ 是无偏估计）
梯度上升：

$\theta_{t+1} = \theta_t + \beta \cdot g(a, \theta_t)$

其中 $\beta$ 是 Actor 的学习率。

展开写即为：

$\theta_{t+1} = \theta_t + \beta \cdot q(s_t, a_t; w) \cdot \left. \frac{\partial \log \pi(a_t|s_t; \theta)}{\partial \theta} \right|_{\theta = \theta_t}$

Update policy network using policy gradient

8. 完整算法

8.1 Actor-Critic 算法（9步）

每一步的具体操作如下：

步骤 1-3：交互与采样

$\begin{aligned} &\text{1. 观察状态 } s_t，\text{ 采样动作 } a_t \sim \pi(\cdot|s_t; \theta_t) \\ &\text{2. 执行 } a_t，\text{ 环境返回 } s_{t+1} \text{ 和 } r_t \\ &\text{3. 采样虚拟动作 } \tilde{a}_{t+1} \sim \pi(\cdot|s_{t+1}; \theta_t) \quad (\text{不执行！}) \end{aligned}$

注意：步骤 3 中 $\tilde{a}_{t+1}$ 是虚拟动作，仅用于计算 Q 值估计，不会在环境中执行。

步骤 4-5：计算 TD 误差

$\begin{aligned} &\text{4. 评估价值网络：} \\ &\quad q_t = q(s_t, a_t; w_t), \quad q_{t+1} = q(s_{t+1}, \tilde{a}_{t+1}; w_t) \\ &\text{5. 计算 TD 误差：} \\ &\quad \delta_t = q_t - (r_t + \gamma \cdot q_{t+1}) \end{aligned}$

步骤 6-7：更新 Critic

$\begin{aligned} &\text{6. 计算 Critic 梯度：} \\ &\quad \mathbf{d}_{w,t} = \left. \frac{\partial q(s_t, a_t; w)}{\partial w} \right|_{w = w_t} \\ &\text{7. 更新价值网络：} \\ &\quad \mathbf{w}_{t+1} = \mathbf{w}_t - \alpha \cdot \delta_t \cdot \mathbf{d}_{w,t} \end{aligned}$

步骤 8-9：更新 Actor

$\begin{aligned} &\text{8. 计算 Actor 梯度：} \\ &\quad \mathbf{d}_{\theta,t} = \left. \frac{\partial \log \pi(a_t|s_t; \theta)}{\partial \theta} \right|_{\theta = \theta_t} \\ &\text{9. 更新策略网络：} \\ &\quad \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \beta \cdot q_t \cdot \mathbf{d}_{\theta,t} \end{aligned}$

Summary of Algorithm

8.2 算法流程图

┌────────────────────────────────────────────────────────────┐
│                Actor-Critic 算法流程                        │
├────────────────────────────────────────────────────────────┤
│                                                            │
│   ┌──────────────────────────────────────────────┐         │
│   │  1. 观察状态 s_t                              │         │
│   │  2. 采样动作 a_t ~ π(·|s_t; θ_t)            │         │
│   │  3. 执行 a_t → 获得 s_{t+1}, r_t             │         │
│   │  4. 采样虚拟动作 ã_{t+1} ~ π(·|s_{t+1}; θ_t)  │         │
│   └──────────────────┬───────────────────────────┘         │
│                      ↓                                     │
│   ┌──────────────────────────────────────────────┐         │
│   │  Critic 更新 (TD学习)                        │         │
│   │  5. q_t = q(s_t, a_t; w)                    │         │
│   │     q_{t+1} = q(s_{t+1}, ã_{t+1}; w)       │         │
│   │  6. δ_t = q_t - (r_t + γ·q_{t+1})          │         │
│   │  7. w_{t+1} = w_t - α·δ_t·∂q/∂w            │         │
│   └──────────────────┬───────────────────────────┘         │
│                      ↓                                     │
│   ┌──────────────────────────────────────────────┐         │
│   │  Actor 更新 (策略梯度)                        │         │
│   │  8. d_θ,t = ∂log π(a_t|s_t,θ)/∂θ            │         │
│   │  9. θ_{t+1} = θ_t + β·q_t·d_θ,t            │         │
│   └──────────────────┬───────────────────────────┘         │
│                      ↓                                     │
│              进入下一个时间步 t+1                           │
│                                                            │
└────────────────────────────────────────────────────────────┘

9. 带Baseline的策略梯度

9.1 问题动机

在原始算法步骤 9 中，Actor 使用 $q_t$ 作为策略梯度的权重：

$\theta_{t+1} = \theta_t + \beta \cdot q_t \cdot \mathbf{d}_{\theta, t}$

问题在于 $q_t$ 的绝对值可能很大，导致梯度更新不稳定。我们需要一个基线 (baseline) 来减小方差。

9.2 使用 TD 误差作为 Baseline

改进方法：用 TD 误差 $\delta_t$ 代替 $q_t$ ：

$\theta_{t+1} = \theta_t + \beta \cdot \delta_t \cdot \mathbf{d}_{\theta, t}$

展开为：

$\theta_{t+1} = \theta_t + \beta \cdot \delta_t \cdot \left. \frac{\partial \log \pi(a_t|s_t; \theta)}{\partial \theta} \right|_{\theta = \theta_t}$

Policy Gradient with Baseline

9.3 TD 误差的作用

$\delta_t = q_t - (r_t + \gamma \cdot q_{t+1})$

TD 误差的含义：

$\delta_t > 0$ ：当前 Q 值估计高于实际回报，当前动作的实际效果不如预期
$\delta_t < 0$ ：当前 Q 值估计低于实际回报，当前动作的实际效果好于预期

通过 $\delta_t$ 作为权重，策略梯度可以更准确地判断哪些动作应该被增强（ $\delta_t < 0$ 时），哪些应该被抑制（ $\delta_t > 0$ 时）。

注意：严格来说， $(r_t + \gamma \cdot q_{t+1})$ 才是 baseline， $\delta_t$ 是相对于 baseline 的偏差。

9.4 两种版本对比

	原始版本	带 Baseline 版本
Actor 更新	$\theta_{t+1} = \theta_t + \beta \cdot q_t \cdot \mathbf{d}_{\theta,t}$	$\theta_{t+1} = \theta_t + \beta \cdot \delta_t \cdot \mathbf{d}_{\theta,t}$
权重	$q_t$ （原始 Q 值）	$\delta_t$ （TD 误差）
方差	较大	较小
稳定性	较差	更好

9.5 完整算法（带 Baseline）

1. 观察状态 s_t，采样 a_t ~ π(·|s_t; θ_t)
2. 执行 a_t → 获得 s_{t+1}, r_t
3. 采样虚拟动作 ã_{t+1} ~ π(·|s_{t+1}; θ_t)  (不执行！)
4. q_t = q(s_t, a_t; w_t), q_{t+1} = q(s_{t+1}, ã_{t+1}; w_t)
5. δ_t = q_t - (r_t + γ·q_{t+1})
6. d_w,t = ∂q(s_t,a_t;w)/∂w |_{w=w_t}
7. w_{t+1} = w_t - α·δ_t·d_w,t              ← Critic 更新
8. d_θ,t = ∂log π(a_t|s_t,θ)/∂θ |_{θ=θ_t}
9. θ_{t+1} = θ_t + β·δ_t·d_θ,t              ← Actor 更新 (带 Baseline)

10. 概念关系总结

┌──────────────────────────────────────────────────────────────┐
│                   Actor-Critic 方法                          │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌─────────────┐                    ┌─────────────┐         │
│  │   Actor     │                    │   Critic    │         │
│  │  策略网络   │                    │  价值网络   │         │
│  │             │                    │             │         │
│  │ π(a|s; θ)  │                    │ q(s,a; w)  │         │
│  │             │                    │             │         │
│  │ 输出：概率  │                    │ 输出：标量  │         │
│  │ 用途：选动作│                    │ 用途：评分  │         │
│  └──────┬──────┘                    └──────┬──────┘         │
│         │                                  │                │
│         │  ────── V(s;θ,w) ────────────── │                │
│         │       = Σ π(a|s;θ)·q(s,a;w)     │                │
│         │                                  │                │
│    策略梯度更新                       TD 梯度下降更新       │
│         │                                  │                │
│    θ ← θ + β·δ_t·∂logπ/∂θ          w ← w - α·δ_t·∂q/∂w   │
│    (带Baseline)                       (最小化TD误差)        │
│                                                              │
│  ─────────────────────────────────────────────────           │
│  共享信息：δ_t = q_t - (r_t + γ·q_{t+1})                    │
│  TD 误差同时驱动两个网络的更新                                │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Actor-Critic vs 纯策略梯度 vs 纯价值方法

特性	REINFORCE (纯策略)	Actor-Critic	DQN (纯价值)
学习目标	$\pi(a\\|s; \theta)$	$\pi(a\\|s; \theta)$ + $q(s,a; w)$	$Q(s, a)$
Q 值来源	完整轨迹回报	Critic 网络	目标网络
更新时机	Episode 结束后	每个时间步	每个时间步
方差	高	中	低
适用动作空间	离散 + 连续	离散 + 连续	主要离散
偏差	无偏	有偏	有偏

数学符号汇总

符号	含义
$\pi(a\\|s; \theta)$	Actor 策略网络
$\theta$	Actor 网络参数
$\beta$	Actor 学习率
$q(s, a; w)$	Critic 价值网络
$w$	Critic 网络参数
$\alpha$	Critic 学习率
$V(s; \theta, w)$	近似状态价值函数
$y_t$	TD 目标
$\delta_t$	TD 误差
$\mathbf{d}_{w,t}$	Critic 梯度
$\mathbf{d}_{\theta,t}$	Actor 梯度
$\gamma$	折扣因子
$\tilde{a}_{t+1}$	虚拟采样动作（不执行）

核心公式速查

1. 状态价值近似