【一眼复习系列】DDQN和DQN-overestimate简单估算 markdown书写

时间：2023-02-04 21:15:52

Double DQN

interaction：

choose&exe action with $\omega$ (main $Q_{w}$ network's parameter）at state $S_t$

【资料图】

take reward $r$ and next state $S_{t+1}$

update：

choose argmax action with $\omega$ estimate at state $S_{t+1}$

update main $Q_w$ network with $w_{-}$ (target $Q_{w_{-}}$ network's parameter）:

$$Q_{\omega}=r+\gamma Q_{\omega_{-}}$$

every C steps:

$$\omega = \omega_{-}$$

总结：主网抽$S_t$, $S_{t+1}$动作，目标网络估值用于主网更新。

DQN overestimate

when $Q^{*}(s,a)=V^{*}(s)$ (no real advantage) and estimate error \epsilon_a uniformly distribute in (-1,1).

If the size of action space $A$ equal to $m$, we have DQN overestimation:

$$E_{a\sim A}[\max_{a}\epsilon_{a}]=E[\max_aQ_{w_{-}}(S_{t+1},a)-\max_aQ^{*}(S_{t+1},a)]=\frac{m-1}{m+1}$$

Proof:

$$\begin{align}P(\max_a\epsilon_a\le x)&=\prod^{m}_{a=1}P(\epsilon_a\le x) \\&=(\frac{1+x}{2})^{m}\end{align}$$

$$\begin{align}E[\max_{a}\epsilon_{a}]&=\int_{-1}^{1}x\frac{d}{dx}P(\max_a\epsilon_a\le x)dx\\&=\int_{-1}^{1}x\frac{d}{dx}(\frac{1+x}{2})^mdx\\&=\frac{1}{2^{m}}\int_{-1}^{1}xdx(1+x)^{m}\\&=\frac{1}{2^{m}}[x(1+x)^m|^{1}_{-1}-\int_{-1}^{1}(1+x)^{m}dx]\\&=\frac{1}{2^{m}}[x(1+x)^m-\frac{1}{m+1}(1+x)^{m+1}]^{1}_{-1}\\&=\frac{1}{2^{m}}[1\cdot 2^{m}-\frac{2^{m+1}}{m+1}]\\&=1-\frac{2}{m+1}=\frac{m-1}{m+1}\end{align}$$

总结：无实际优势，均匀分布情况，动作空间越大，overestimation越大（接近单次上界）。

这里无实际优势的假设条件比较重要，

标签：均匀分布

来源：哔哩哔哩