Predictability and Surprise in Language Model Benchmarks

Abstract

As we push the limits of the size and performance of large language models, there is an increasing focus on how well we, as a community, can predict the performance of models with respect to scale. Although well-established literature exists on how pretraining performance scales with compute and data, the literature on how particular downstream benchmarks and capabilities scale is significantly muddier. In this work, we take a step back and ask: Why has predicting specific downstream capabilities with scale remained elusive? While many factors are certainly responsible, we identify how discontinuous metrics (e.g., exact string match, accuracy) explain the “unpredictable” performance of models, leading to inferences of epi-phenomena such as termed emergence. In contrast, we show how continuous metrics lead to more predictable benchmarks. This work is a first step towards a science of scaling-predictable and human-interpretable evaluation.

Predictability and Surprise in Language Model Benchmarks

Abstract

Video Recording