标签:9.png nat policy sample trouble other bubuko led ural

--------------------------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------------------------------



understand that correlated samples cause problem. and how paralled solve the problem
another solution is replay buffers, fully ultilizing the advantage of off policy in Q-learning.




there‘s still a problem: Q learning is not gradient descent


divide Q function into two parts: the target net and the evolving net.
sacrifice speed to get the convergence.







overestimation of Natural DQN








get trouble in left and right dilemma of avoiding bumping on a tree









CS294-112 深度强化学习 秋季学期(伯克利)NO.6 Value functions introduction NO.7 Advanced Q learning
标签:9.png nat policy sample trouble other bubuko led ural
原文地址:https://www.cnblogs.com/ecoflex/p/9094123.html