Abstract

The dominant practice of AI alignment assumes (1) that preferences are an adequate representation of human values, (2) that human rationality can be understood in terms of maximizing the satisfaction of preferences (represented as a utility or reward function), and (3) that AI systems should be aligned with the preferences of one or more humans to ensure that they behave safely and in accordance with our values. This talk critically examines these assumptions, arguing that preferences should not be understood as the basis of human welfare or aligned AI behavior. Instead, preference judgments only provide one source of data about the goals, values, and norms that humans truly care about, and AI should be aligned with the goals and normative standards that we agree are appropriate for each type of AI system. As an example, I will illustrate how we can build AI assistants that infer the goals and norms of human principals by modeling humans as taking actions on the basis of these reasons. Such assistants can thus rapidly adapt to context-specific goals and norms, while complying with the meta-norms of safe and helpful assistance.