Data curation for LLM Pre-training and Post-Training: Creating massive datasets for the AI open-source community with the Datacomp project.

Workshop

Joint IFML/MPG Symposium

Speaker(s)

Alex Dimakis (UT Austin)

Location

Calvin Lab Auditorium

Date

Monday, Nov. 18, 2024

Time

2:30 – 3:10 p.m. PT

Abstract

I will survey our recent work with the Datacomp community on curating the largest public datasets for multimodal models (Datacomp) and LLMs (DCLM), scaling to billions of images and trillions of tokens. I will also discuss how iterating on dataset curation while fixing data pools and evaluations is a good scientific method for data-centric AI. I will also discuss our vision that in the future dataset curation will be the way to create Small Specialized Models (SSMs) on internal datasets.

Attachment

IFML Slides - Alex Dimakis