Maximum Entropy Bayesian Actor Critic

Onderzoeksoutput: Conference paperResearch

Samenvatting

In recent years Deep Reinforcement Learning has achieved human-like performance or better on a variety of benchmarks such as the Atari Arcade; however, Deep RL often has problems with sample efficiency and convergence brittleness. That is, to learn even the simplest tasks, Deep RL requires a huge amount of meaningful samples, and will only converge if the parameters are tuned just right. This paper seeks to ameliorate these problems of sample inefficiency and convergence brittleness with the combination of two different reinforcement learning paradigms: Bayesian RL and Maximum Entropy RL.

Bayesian reinforcement learning utilizes Bayesian statistics to model the confidence in a given model, which has been shown to greatly increase sample efficiency. Maximum entropy RL is a technique that modifies the standard reward to promote more exploration in the agent. Hopefully, combining the two will retain the best of both of these properties and avoid the problems faced in deep RL altogether.

This paper first derives a soft policy gradient that introduces a entropy-weighted term to the standard policy gradient function, and then applies this to the the Bayesian actor critic paradigm to augment the parameter update rule to account for the entropy-weighted value function. After determining a closed-form solution of the gradient with the softmax policy, the method was implemented and evaluated on the Cartpole environment, signalling that there are avenues ripe for future research in this area.
Originele taal-2English
TitelBNAIC/Benelearn 2019
Aantal pagina's12
Volume2491
StatusPublished - 6 nov 2019
EvenementBNAIC 2019 - Brussels, Belgium
Duur: 7 nov 20198 nov 2019

Publicatie series

NaamCEUR Workshop Proceedings
ISSN van geprinte versie1613-0073

Conference

ConferenceBNAIC 2019
Land/RegioBelgium
StadBrussels
Periode7/11/198/11/19

Vingerafdruk

Duik in de onderzoeksthema's van 'Maximum Entropy Bayesian Actor Critic'. Samen vormen ze een unieke vingerafdruk.

Citeer dit