Interpretability Agents

Abstract

Currently, answering a new question about a model requires an enormous amount of effort by experts. Researchers must formalize their question, formulate hypotheses about a model’s decision-making process, design datasets on which to evaluate model behavior, then use these datasets to refine and validate hypotheses. Consequently, intensive explanatory auditing is beyond the reach of most model users and providers, and applications of mechanistic interpretability are bottlenecked by the need for human labor. How can we usefully automate and scale model interpretation?

I will introduce Automated Interpretability Agents (AIAs) that, given a question about a model of interest, design and perform experiments on the model to answer the question. This paradigm encompasses both behavioral testing (as commonly applied in fairness and safety applications) and more basic, mechanistic research questions. AIAs are built from language models equipped with tools, and compose interpretability subroutines into Python programs. They operationalize hypotheses about models as code, and update those hypotheses after observing model behavior on inputs for which they make different predictions. AIAs are designed modularly such that their toolkit can evolve as bottom-up work introduces new interpretability techniques, and as users encounter new applications. I will present recent work showing that AIAs reach human-level performance on a variety of model understanding tasks. My hope is that this line of research helps lay the groundwork for a richer interface for interpretability: one that is iterative, modular, allows real-time testing of hypotheses, and scales to large and complex models.

Interpretability Agents

Abstract

Video Recording