Abstract

We consider streaming algorithms for k-median and k-means clustering in a data stream for a stream of n points and an aspect ratio of Delta. We show that the number of words in the space can be made independent of n and Delta, giving an algorithm with an optimal constant number of words for any constant k, dimension d, and accuracy parameter epsilon. Moreover, our algorithm also achieves an amortized time of poly(log log (n*Delta)) to process each stream item in the unit cost RAM model. Our approach is based on a new compression of a merge and reduce tree and has a number of other applications. For example, we improve the space and update time for computing approximate subspace embeddings in a stream.

Joint work with Vincent Cohen-Addad, Liudeng Wang, and David P. Woodruff.

Attachment

Video Recording