TY - GEN
T1 - Maximum Entropy Bayesian Actor Critic
AU - Homer, Steven Thomas
PY - 2019/11/6
Y1 - 2019/11/6
N2 - In recent years Deep Reinforcement Learning has achieved human-like performance or better on a variety of benchmarks such as the Atari Arcade; however, Deep RL often has problems with sample efficiency and convergence brittleness. That is, to learn even the simplest tasks, Deep RL requires a huge amount of meaningful samples, and will only converge if the parameters are tuned just right. This paper seeks to ameliorate these problems of sample inefficiency and convergence brittleness with the combination of two different reinforcement learning paradigms: Bayesian RL and Maximum Entropy RL.Bayesian reinforcement learning utilizes Bayesian statistics to model the confidence in a given model, which has been shown to greatly increase sample efficiency. Maximum entropy RL is a technique that modifies the standard reward to promote more exploration in the agent. Hopefully, combining the two will retain the best of both of these properties and avoid the problems faced in deep RL altogether.This paper first derives a soft policy gradient that introduces a entropy-weighted term to the standard policy gradient function, and then applies this to the the Bayesian actor critic paradigm to augment the parameter update rule to account for the entropy-weighted value function. After determining a closed-form solution of the gradient with the softmax policy, the method was implemented and evaluated on the Cartpole environment, signalling that there are avenues ripe for future research in this area.
AB - In recent years Deep Reinforcement Learning has achieved human-like performance or better on a variety of benchmarks such as the Atari Arcade; however, Deep RL often has problems with sample efficiency and convergence brittleness. That is, to learn even the simplest tasks, Deep RL requires a huge amount of meaningful samples, and will only converge if the parameters are tuned just right. This paper seeks to ameliorate these problems of sample inefficiency and convergence brittleness with the combination of two different reinforcement learning paradigms: Bayesian RL and Maximum Entropy RL.Bayesian reinforcement learning utilizes Bayesian statistics to model the confidence in a given model, which has been shown to greatly increase sample efficiency. Maximum entropy RL is a technique that modifies the standard reward to promote more exploration in the agent. Hopefully, combining the two will retain the best of both of these properties and avoid the problems faced in deep RL altogether.This paper first derives a soft policy gradient that introduces a entropy-weighted term to the standard policy gradient function, and then applies this to the the Bayesian actor critic paradigm to augment the parameter update rule to account for the entropy-weighted value function. After determining a closed-form solution of the gradient with the softmax policy, the method was implemented and evaluated on the Cartpole environment, signalling that there are avenues ripe for future research in this area.
UR - http://www.scopus.com/inward/record.url?scp=85075070688&partnerID=8YFLogxK
M3 - Conference paper
VL - 2491
T3 - CEUR Workshop Proceedings
BT - BNAIC/Benelearn 2019
T2 - BNAIC 2019
Y2 - 7 November 2019 through 8 November 2019
ER -