Annealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm

Saba Yahyaa, Madalina Drugan, Bernard Manderick

Research output: Chapter in Book/Report/Conference proceedingConference paper

Abstract

In the stochastic multi-objective multi-armed bandit (or MOMAB), arms generate a vector of stochastic rewards, one per objective, instead of a single scalar reward.
As a result, there is not only one optimal arm,
but there is a set of optimal arms (Pareto front) of reward vectors using the Pareto dominance relation
and
there is a
trade-off between finding the optimal arm set (exploration) and selecting fairly or evenly the optimal arms (exploitation).
To trade-off between exploration and exploitation, either Pareto knowledge gradient (or Pareto-KG for short), or Pareto upper confidence bound (or Pareto-UCB1 for short) can be used. They combine the KG-policy and UCB1-policy,
respectively with the Pareto dominance relation.
In this paper,
we propose
Pareto Thompson sampling that uses Pareto dominance relation to find the Pareto front.
We also propose
annealing-Pareto algorithm
that trades-off between the exploration and exploitation
by using a decaying parameter $\epsilon_{t}$ in combination with Pareto dominance relation.
The annealing-Pareto algorithm uses the decaying parameter to explore the Pareto optimal arms and uses Pareto dominance relation to exploit the Pareto front.
We
experimentally compare Pareto-KG, Pareto-UCB1, Pareto Thompson sampling and the annealing-Pareto algorithms
on
multi-objective Bernoulli distribution problems
and we conclude that the annealing-Pareto is the best performing algorithm.
Original languageEnglish
Title of host publicationproceedings of IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL)
PublisherIEEE
Pages1-8
ISBN (Electronic)978-1-4799-4552-8
ISBN (Print)978-1-4799-4551-1
Publication statusPublished - 2014
Event2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL) - Orlando, FL, United States
Duration: 9 Dec 201412 Dec 2014

Publication series

Nameproceedings of IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL)

Conference

Conference2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL)
Country/TerritoryUnited States
CityOrlando, FL
Period9/12/1412/12/14

Keywords

  • Multi objective optimization
  • Multi-armed bandit problems
  • Annealing algorithm

Fingerprint

Dive into the research topics of 'Annealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm'. Together they form a unique fingerprint.

Cite this