Annealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm

Saba Yahyaa, Madalina Drugan, Bernard Manderick

Onderzoeksoutput: Conference paper

Samenvatting

In the stochastic multi-objective multi-armed bandit (or MOMAB), arms generate a vector of stochastic rewards, one per objective, instead of a single scalar reward.
As a result, there is not only one optimal arm,
but there is a set of optimal arms (Pareto front) of reward vectors using the Pareto dominance relation
and
there is a
trade-off between finding the optimal arm set (exploration) and selecting fairly or evenly the optimal arms (exploitation).
To trade-off between exploration and exploitation, either Pareto knowledge gradient (or Pareto-KG for short), or Pareto upper confidence bound (or Pareto-UCB1 for short) can be used. They combine the KG-policy and UCB1-policy,
respectively with the Pareto dominance relation.
In this paper,
we propose
Pareto Thompson sampling that uses Pareto dominance relation to find the Pareto front.
We also propose
annealing-Pareto algorithm
that trades-off between the exploration and exploitation
by using a decaying parameter $\epsilon_{t}$ in combination with Pareto dominance relation.
The annealing-Pareto algorithm uses the decaying parameter to explore the Pareto optimal arms and uses Pareto dominance relation to exploit the Pareto front.
We
experimentally compare Pareto-KG, Pareto-UCB1, Pareto Thompson sampling and the annealing-Pareto algorithms
on
multi-objective Bernoulli distribution problems
and we conclude that the annealing-Pareto is the best performing algorithm.
Originele taal-2English
Titelproceedings of IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL)
UitgeverijIEEE
Pagina's1-8
ISBN van elektronische versie978-1-4799-4552-8
ISBN van geprinte versie978-1-4799-4551-1
StatusPublished - 2014
EvenementUnknown - Orlando, FL, United States
Duur: 9 dec. 201412 dec. 2014

Publicatie series

Naamproceedings of IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL)

Conference

ConferenceUnknown
Land/RegioUnited States
StadOrlando, FL
Periode9/12/1412/12/14

Vingerafdruk

Duik in de onderzoeksthema's van 'Annealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm'. Samen vormen ze een unieke vingerafdruk.

Citeer dit