Learning Optimal Advantage from Preferences and Mistaking it for Reward

VIEW PUBLICATION

W. Bradley Knox*

Stephane Hatgis-Kessell*

Sigurdur Orn Adalgeirsson*

Serena Booth*

Anca Dragan*

Peter Stone

Scott Niekum*

* External authors

AAAI 2024

2024

Abstract

We consider algorithms for learning reward functions from human preferences over pairs of trajectory segments---as used in reinforcement learning from human feedback (RLHF)---including those used to fine tune ChatGPT and other contemporary language models. Most recent work on such algorithms assumes that human preferences are generated based only upon the reward accrued within those segments, which we call their partial return function. But if this assumption is false because people base their preferences on information other than partial return, then what type of function is their algorithm learning from preferences? We argue that this function is better thought of as an approximation of the optimal advantage function, not as a partial return function as previously believed.

Related Publications

Building Minimal and Reusable Causal State Abstractions for Reinforcement Learning

AAAI, 2024
Zizhao Wang*, Caroline Wang*, Xuesu Xiao*, Yuke Zhu*, Peter Stone

Two desiderata of reinforcement learning (RL) algorithms are the ability to learn from relatively little experience and the ability to learn policies that generalize to a range of problem specifications. In factored state spaces, one approach towards achieving both goals is …

Minimum Coverage Sets for Training Robust Ad Hoc Teamwork Agents

AAAI, 2024
Arrasy Rahman*, Jiaxun Cui*, Peter Stone

Robustly cooperating with unseen agents and human partners presents significant challenges due to the diverse cooperative conventions these partners may adopt. Existing Ad Hoc Teamwork (AHT) methods address this challenge by training an agent with a population of diverse tea…

VaryNote: A Method to Automatically Vary the Number of Notes in Symbolic Music

CMMR, 2023
Juan M. Huerta*, Bo Liu*, Peter Stone

Automatically varying the number of notes in symbolic music has various applications in assisting music creators to embellish simple tunes or to reduce complex music to its core idea. In this paper, we formulate the problem of varying the number of notes while preserving the…

SEE ALL

HOME
Publications
Learning Optimal Advantage from Preferences and Mistaking it for Reward

JOIN US

Shape the Future of AI with Sony AI

We want to hear from those of you who have a strong desire
to shape the future of AI.

LEARN MORE