Samenvatting
In the stochastic multi-objective multi-armed bandit (or MOMAB), arms generate a vector of stochastic rewards, one per objective, instead of a single scalar reward.
As a result, there is not only one optimal arm,
but there is a set of optimal arms (Pareto front) of reward vectors using the Pareto dominance relation
and
there is a
trade-off between finding the optimal arm set (exploration) and selecting fairly or evenly the optimal arms (exploitation).
To trade-off between exploration and exploitation, either Pareto knowledge gradient (or Pareto-KG for short), or Pareto upper confidence bound (or Pareto-UCB1 for short) can be used. They combine the KG-policy and UCB1-policy,
respectively with the Pareto dominance relation.
In this paper,
we propose
Pareto Thompson sampling that uses Pareto dominance relation to find the Pareto front.
We also propose
annealing-Pareto algorithm
that trades-off between the exploration and exploitation
by using a decaying parameter $\epsilon_{t}$ in combination with Pareto dominance relation.
The annealing-Pareto algorithm uses the decaying parameter to explore the Pareto optimal arms and uses Pareto dominance relation to exploit the Pareto front.
We
experimentally compare Pareto-KG, Pareto-UCB1, Pareto Thompson sampling and the annealing-Pareto algorithms
on
multi-objective Bernoulli distribution problems
and we conclude that the annealing-Pareto is the best performing algorithm.
As a result, there is not only one optimal arm,
but there is a set of optimal arms (Pareto front) of reward vectors using the Pareto dominance relation
and
there is a
trade-off between finding the optimal arm set (exploration) and selecting fairly or evenly the optimal arms (exploitation).
To trade-off between exploration and exploitation, either Pareto knowledge gradient (or Pareto-KG for short), or Pareto upper confidence bound (or Pareto-UCB1 for short) can be used. They combine the KG-policy and UCB1-policy,
respectively with the Pareto dominance relation.
In this paper,
we propose
Pareto Thompson sampling that uses Pareto dominance relation to find the Pareto front.
We also propose
annealing-Pareto algorithm
that trades-off between the exploration and exploitation
by using a decaying parameter $\epsilon_{t}$ in combination with Pareto dominance relation.
The annealing-Pareto algorithm uses the decaying parameter to explore the Pareto optimal arms and uses Pareto dominance relation to exploit the Pareto front.
We
experimentally compare Pareto-KG, Pareto-UCB1, Pareto Thompson sampling and the annealing-Pareto algorithms
on
multi-objective Bernoulli distribution problems
and we conclude that the annealing-Pareto is the best performing algorithm.
| Originele taal-2 | English |
|---|---|
| Titel | proceedings of IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL) |
| Uitgeverij | IEEE |
| Pagina's | 1-8 |
| ISBN van elektronische versie | 978-1-4799-4552-8 |
| ISBN van geprinte versie | 978-1-4799-4551-1 |
| Status | Published - 2014 |
| Evenement | Unknown - Orlando, FL, United States Duur: 9 dec. 2014 → 12 dec. 2014 |
Publicatie series
| Naam | proceedings of IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL) |
|---|
Conference
| Conference | Unknown |
|---|---|
| Land/Regio | United States |
| Stad | Orlando, FL |
| Periode | 9/12/14 → 12/12/14 |