TY - GEN
T1 - Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics
AU - Steckelmacher, Denis
AU - Plisnier, Helene
AU - Roijers, Diederik
AU - Nowe, Ann
N1 - Pages (from-to) not filled, because this information has not been disclosed by ECML to the authors, and Springer does not publish a table of contents of the proceedings book. The book would need to be bought to get this information.
PY - 2020
Y1 - 2020
N2 - Value-based reinforcement-learning algorithms provide state-of-the-art results in model-free discrete-action settings, and tend to outperform actor-critic algorithms. We argue that actor-critic algorithms are limited by their need for an on-policy critic. We propose Bootstrapped Dual Policy Iteration (BDPI), a novel model-free reinforcement-learning algorithm for continuous states and discrete actions, with an actor and several off-policy critics. Off-policy critics are compatible with experience replay, ensuring high sample-efficiency, without the need for off-policy corrections. The actor, by slowly imitating the average greedy policy of the critics, leads to high-quality and state-specific exploration, which we compare to Thompson sampling. Because the actor and critics are fully decoupled, BDPI is remarkably stable, and unusually robust to its hyper-parameters. BDPI is significantly more sample-efficient than Bootstrapped DQN, PPO, and ACKTR, on discrete, continuous and pixel-based tasks.
AB - Value-based reinforcement-learning algorithms provide state-of-the-art results in model-free discrete-action settings, and tend to outperform actor-critic algorithms. We argue that actor-critic algorithms are limited by their need for an on-policy critic. We propose Bootstrapped Dual Policy Iteration (BDPI), a novel model-free reinforcement-learning algorithm for continuous states and discrete actions, with an actor and several off-policy critics. Off-policy critics are compatible with experience replay, ensuring high sample-efficiency, without the need for off-policy corrections. The actor, by slowly imitating the average greedy policy of the critics, leads to high-quality and state-specific exploration, which we compare to Thompson sampling. Because the actor and critics are fully decoupled, BDPI is remarkably stable, and unusually robust to its hyper-parameters. BDPI is significantly more sample-efficient than Bootstrapped DQN, PPO, and ACKTR, on discrete, continuous and pixel-based tasks.
UR - http://www.scopus.com/inward/record.url?scp=85084826247&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-46133-1_2
DO - 10.1007/978-3-030-46133-1_2
M3 - Conference paper
SN - 978-3-030-46132-4
VL - 11908
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 19
EP - 34
BT - Lecture Notes in Artificial Intelligence
A2 - Brefeld, Ulf
A2 - Fromont, Elisa
A2 - Hotho, Andreas
A2 - Knobbe, Arno
A2 - Maathuis, Marloes
A2 - Robardet, Céline
PB - Springer
T2 - European Conference on Machine Learning 2019
Y2 - 16 September 2019 through 20 September 2019
ER -