interaction:
choose&exe action with $\omega$ (main $Q_{w}$ network's parameter)at state $S_t$
【资料图】
take reward $r$ and next state $S_{t+1}$
update:
choose argmax action with $\omega$ estimate at state $S_{t+1}$
update main $Q_w$ network with $w_{-}$ (target $Q_{w_{-}}$ network's parameter):
$$Q_{\omega}=r+\gamma Q_{\omega_{-}}$$
every C steps:
$$\omega = \omega_{-}$$
总结:主网抽$S_t$, $S_{t+1}$动作,目标网络估值 用于主网更新。
when $Q^{*}(s,a)=V^{*}(s)$ (no real advantage) and estimate error \epsilon_a uniformly distribute in (-1,1).
If the size of action space $A$ equal to $m$, we have DQN overestimation:
$$E_{a\sim A}[\max_{a}\epsilon_{a}]=E[\max_aQ_{w_{-}}(S_{t+1},a)-\max_aQ^{*}(S_{t+1},a)]=\frac{m-1}{m+1}$$
Proof:
$$\begin{align}P(\max_a\epsilon_a\le x)&=\prod^{m}_{a=1}P(\epsilon_a\le x) \\&=(\frac{1+x}{2})^{m}\end{align}$$
$$\begin{align}E[\max_{a}\epsilon_{a}]&=\int_{-1}^{1}x\frac{d}{dx}P(\max_a\epsilon_a\le x)dx\\&=\int_{-1}^{1}x\frac{d}{dx}(\frac{1+x}{2})^mdx\\&=\frac{1}{2^{m}}\int_{-1}^{1}xdx(1+x)^{m}\\&=\frac{1}{2^{m}}[x(1+x)^m|^{1}_{-1}-\int_{-1}^{1}(1+x)^{m}dx]\\&=\frac{1}{2^{m}}[x(1+x)^m-\frac{1}{m+1}(1+x)^{m+1}]^{1}_{-1}\\&=\frac{1}{2^{m}}[1\cdot 2^{m}-\frac{2^{m+1}}{m+1}]\\&=1-\frac{2}{m+1}=\frac{m-1}{m+1}\end{align}$$
总结:无实际优势,均匀分布情况,动作空间越大,overestimation越大(接近单次上界)。
这里无实际优势的假设条件比较重要,
标签: 均匀分布
Copyright @ 2001-2013 www.caixunnews.com All Rights Reserved 彩迅新闻网 版权所有 京ICP备12018864号-1
网站所登新闻、资讯等内容, 均为相关单位具有著作权,转载请注明出处
未经彩迅新闻网书面授权,请勿建立镜像 联系我们: 291 32 36@qq.com