SDE Approximations and Scaling Rules for Distributed Deep Learning

Abstract

This talk presents our recent results that analyze the trajectory and implicit bias of distributed training algorithms via SDE approximation. First, distributed deep learning requires a large batch size to fully exploit data parallelism, but how do we tune the learning rate when changing the batch size? Our work studies the SDE approximations for large-batch RMSprop and Adam and derives the Square Root Scaling Rule (SRSR): batch size ~ sqrt(learning rate). Second, I will present results to understand and improve the generalization of local gradient methods, including Local SGD. Local gradient methods are communication-efficient variants of the standard data parallel training methods, but Lin et al. (2020) observed that these imperfect surrogates can surprisingly generalize better. Our recent works analyze the implicit bias of Local SGD via Slow SDE approximation and propose the Quadratic Synchronization Rule (QSR) to further improve the generalization by dynamically setting synchronization period ~ 1/(learning rate)^2 as the learning rate decays over time.