Controlling distribution shifts in language models: a data-centric approach.

Workshop

Emerging Generalization Settings

Speaker(s)

Tatsunori Hashimoto (Stanford University)

Location

Calvin Lab Auditorium

Date

Thursday, Sept. 12, 2024

Time

10:15 – 11:15 a.m. PT

Abstract

Language model pretraining has been a remarkably strong recipe for cross-task and cross-domain generalization in NLP. However, these gains have come at the expense of control: we rarely control the training data for language models, and gaps between pretraining and our target evaluation lead to distribution shifts. We present two complementary approaches to control this gap — algorithmically filtering data to focus training on the most benchmark-relevant parts of the distribution, as well as adapting to new domains by synthesizing domain-specific pretraining data at scale.

Attachment

Slides

Controlling distribution shifts in language models: a data-centric approach.

Abstract

Attachment

Video Recording