Abstract
Frequency estimation in data streams is one of the classical problems in streaming algorithms. Following much research, there are now almost matching upper and lower bounds for the trade-off needed between the number of samples and the space complexity of the algorithm, when the data streams are adversarial. However, in the case where the data stream is given in a random order, or is stochastic, only weaker lower bounds exist. In this talk, I will present a tight lower bound for the space complexity of frequency estimation in the random order. To prove this lower bound, we consider the needle problem, which is a naturally hard problem for frequency estimation studied (Andoni et al. 2008, Crouch et al. 2016). Here, the goal is to distinguish between two distributions over data streams with $t$ samples. The first is uniform over a large enough domain. The second is a planted model; a secret ''needle'' is uniformly chosen, and then each element in the stream equals the needle with probability $p$, and otherwise is uniformly chosen from the domain. It is simple to design streaming algorithms that distinguish the distributions using space $s \approx 1/(p^2 t)$. We show this lower bound is actually almost tight. This talk is based on joint works with Shachar Lovett, Qian Li and Shuo Wang.