Learning with options that terminate off-policy

Anna Harutyunyan, Peter Vrancx, Pierre Luc Bacon, Doina Precup, Ann Nowé

Research output: Chapter in Book/Report/Conference proceedingConference paper

15 Citations (Scopus)

Abstract

A temporally abstract action, or an option, is specified by a policy and a termination condition: the policy guides the option behavior, and the termination condition roughly determines its length. Generally, learning with longer options (like learning with multi-step returns) is known to be more efficient. However, if the option set for the task is not ideal, and cannot express the primitive optimal policy well, shorter options offer more flexibility and can yield a better solution. Thus, the termination condition puts learning efficiency at odds with solution quality. We propose to resolve this dilemma by decoupling the behavior and target terminations, just like it is done with policies in off-policy learning. To this end, we give a new algorithm, Q(β), that learns the solution with respect to any termination condition, regardless of how the options actually terminate. We derive Q(β) by casting learning with options into a common framework with well-studied multi-step off-policy learning. We validate our algorithm empirically, and show that it holds up to its motivating claims.

Original languageEnglish
Title of host publication32nd AAAI Conference on Artificial Intelligence, AAAI 2018
PublisherAAAI Press
Pages3173-3182
Number of pages10
ISBN (Electronic)9781577358008
Publication statusPublished - 1 Jan 2018
Event32nd AAAI Conference on Artificial Intelligence - New Orleans, United States
Duration: 2 Feb 20187 Feb 2018

Conference

Conference32nd AAAI Conference on Artificial Intelligence
Abbreviated titleAAAI 2018
Country/TerritoryUnited States
CityNew Orleans
Period2/02/187/02/18

Fingerprint

Dive into the research topics of 'Learning with options that terminate off-policy'. Together they form a unique fingerprint.

Cite this