Abstract

I will survey our recent work with the Datacomp community on curating the largest public datasets for multimodal models (Datacomp) and LLMs (DCLM), scaling to billions of images and trillions of tokens. I will also discuss how iterating on dataset curation while fixing data pools and evaluations is a good scientific method for data-centric AI. I will also discuss our vision that in the future dataset curation will be the way to create Small Specialized Models (SSMs) on internal datasets. 

Attachment