Abstract
The main story of this talk is how theoretical ideas on randomized sampling algorithms became part of production code at Twitter. The specific context is finding pairs of similar items: a classic algorithmic problem that is an integral part of recommendation systems. In most incarnations, it boils down to finding high inner products among a large collection of vectors, or alternately high entries in a matrix product. Despite a rich literature on this topic (and despite Twitter's significant compute resources), none of the existing methods scaled to "industrial sized" inputs, which exceed hundreds of billions of non-zeros. I will talk about a distributed algorithm for this problem, that combines low-dimension projections (hashes) with path-sampling techniques (wedges). There is some cute math behind the algorithm, and we were able to run it in production on Twitter's recommendation system. Joint work with Aneesh Sharma (Twitter) and Ashish Goel (Stanford).