Abstract
We propose a novel multi-objective reinforcement learning algorithm that successfully learns the optimal policy even for non-linear utility functions. Non-linear utility functions pose a challenge for SOTA approaches, both in terms of learning efficiency as well as the solution concept. A key insight is that, by proposing a critic that learns a multi-variate distribution over the returns, which is then combined with accumulated rewards, we can directly optimize on the utility function, even if it is non-linear. This allows us to vastly increase the range of problems that can be solved compared to those which can be handled by single-objective methods or multi-objective methods requiring linear utility functions, yet avoiding the need to learn the full Pareto front. We demonstrate our method on multiple multi-objective benchmarks, and show that it learns effectively where baseline approaches fail.
Original language | English |
---|---|
Article number | 23 |
Number of pages | 30 |
Journal | Autonomous Agents and Multi-Agent Systems |
Volume | 37 |
Issue number | 2 |
DOIs | |
Publication status | Published - 23 Apr 2023 |
Bibliographical note
Funding Information:Conor F. Hayes is funded by the University of Galway Hardiman Scholarship. This research was supported by funding from the Flemish Government under the “Onderzoeksprogramma Artificiële Intelligentie (AI) Vlaanderen” program.
Publisher Copyright:
© 2023, Springer Science+Business Media, LLC, part of Springer Nature.
Copyright:
Copyright 2023 Elsevier B.V., All rights reserved.