Abstract
In the stochastic multi-objective multi-armed bandit (or MOMAB), arms generate a vector of stochastic rewards, one per objective, instead of a single scalar reward.
As a result, there is not only one optimal arm,
but there is a set of optimal arms (Pareto front) of reward vectors using the Pareto dominance relation
and
there is a
trade-off between finding the optimal arm set (exploration) and selecting fairly or evenly the optimal arms (exploitation).
To trade-off between exploration and exploitation, either Pareto knowledge gradient (or Pareto-KG for short), or Pareto upper confidence bound (or Pareto-UCB1 for short) can be used. They combine the KG-policy and UCB1-policy,
respectively with the Pareto dominance relation.
In this paper,
we propose
Pareto Thompson sampling that uses Pareto dominance relation to find the Pareto front.
We also propose
annealing-Pareto algorithm
that trades-off between the exploration and exploitation
by using a decaying parameter $\epsilon_{t}$ in combination with Pareto dominance relation.
The annealing-Pareto algorithm uses the decaying parameter to explore the Pareto optimal arms and uses Pareto dominance relation to exploit the Pareto front.
We
experimentally compare Pareto-KG, Pareto-UCB1, Pareto Thompson sampling and the annealing-Pareto algorithms
on
multi-objective Bernoulli distribution problems
and we conclude that the annealing-Pareto is the best performing algorithm.
As a result, there is not only one optimal arm,
but there is a set of optimal arms (Pareto front) of reward vectors using the Pareto dominance relation
and
there is a
trade-off between finding the optimal arm set (exploration) and selecting fairly or evenly the optimal arms (exploitation).
To trade-off between exploration and exploitation, either Pareto knowledge gradient (or Pareto-KG for short), or Pareto upper confidence bound (or Pareto-UCB1 for short) can be used. They combine the KG-policy and UCB1-policy,
respectively with the Pareto dominance relation.
In this paper,
we propose
Pareto Thompson sampling that uses Pareto dominance relation to find the Pareto front.
We also propose
annealing-Pareto algorithm
that trades-off between the exploration and exploitation
by using a decaying parameter $\epsilon_{t}$ in combination with Pareto dominance relation.
The annealing-Pareto algorithm uses the decaying parameter to explore the Pareto optimal arms and uses Pareto dominance relation to exploit the Pareto front.
We
experimentally compare Pareto-KG, Pareto-UCB1, Pareto Thompson sampling and the annealing-Pareto algorithms
on
multi-objective Bernoulli distribution problems
and we conclude that the annealing-Pareto is the best performing algorithm.
Original language | English |
---|---|
Title of host publication | proceedings of IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL) |
Publisher | IEEE |
Pages | 1-8 |
ISBN (Electronic) | 978-1-4799-4552-8 |
ISBN (Print) | 978-1-4799-4551-1 |
Publication status | Published - 2014 |
Event | 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL) - Orlando, FL, United States Duration: 9 Dec 2014 → 12 Dec 2014 |
Publication series
Name | proceedings of IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL) |
---|
Conference
Conference | 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL) |
---|---|
Country/Territory | United States |
City | Orlando, FL |
Period | 9/12/14 → 12/12/14 |
Keywords
- Multi objective optimization
- Multi-armed bandit problems
- Annealing algorithm