Maximum Entropy Bayesian Actor Critic

Research output: Chapter in Book/Report/Conference proceedingConference paperResearch

Abstract

In recent years Deep Reinforcement Learning has achieved human-like performance or better on a variety of benchmarks such as the Atari Arcade; however, Deep RL often has problems with sample efficiency and convergence brittleness. That is, to learn even the simplest tasks, Deep RL requires a huge amount of meaningful samples, and will only converge if the parameters are tuned just right. This paper seeks to ameliorate these problems of sample inefficiency and convergence brittleness with the combination of two different reinforcement learning paradigms: Bayesian RL and Maximum Entropy RL.

Bayesian reinforcement learning utilizes Bayesian statistics to model the confidence in a given model, which has been shown to greatly increase sample efficiency. Maximum entropy RL is a technique that modifies the standard reward to promote more exploration in the agent. Hopefully, combining the two will retain the best of both of these properties and avoid the problems faced in deep RL altogether.

This paper first derives a soft policy gradient that introduces a entropy-weighted term to the standard policy gradient function, and then applies this to the the Bayesian actor critic paradigm to augment the parameter update rule to account for the entropy-weighted value function. After determining a closed-form solution of the gradient with the softmax policy, the method was implemented and evaluated on the Cartpole environment, signalling that there are avenues ripe for future research in this area.
Original languageEnglish
Title of host publicationBNAIC/Benelearn 2019
Number of pages12
Volume2491
Publication statusPublished - 6 Nov 2019
EventBNAIC 2019 - Brussels, Belgium
Duration: 7 Nov 20198 Nov 2019

Publication series

NameCEUR Workshop Proceedings
ISSN (Print)1613-0073

Conference

ConferenceBNAIC 2019
CountryBelgium
CityBrussels
Period7/11/198/11/19

Fingerprint Dive into the research topics of 'Maximum Entropy Bayesian Actor Critic'. Together they form a unique fingerprint.

Cite this