Information Theory Strikes Back: New Development in the Theory of Cardinality Estimation

Abstract

Cardinality estimation is the problem of estimating the size of the output of a query, without actually evaluating the query. The cardinality estimator is a critical piece of any query optimizer, and is often the main culprit when the optimizer chooses a poor plan.

In this talk, I will introduce LpBound, a pessimistic cardinality estimator that computes a guaranteed upper bound on the size of the query output. Besides common statistics used for cardinality estimation, such as the sizes of the input relations, the domain sizes for various columns, histograms, and the most common key values in join columns, LpBound also uses Lp norms on the degree sequences of the join columns. The computed cardinality bound is an optimal solution of a linear program whose constraints encode such statistics and Shannon inequalities.

I will also briefly report on an experimental evaluation of LpBound, traditional, pessimistic, and learned cardinality estimators, for a variety of queries from the JOB, STATS, and subgraph matching benchmarks. The range of LpBound overestimates is much smaller than the range of the estimates of other estimators, which can both underestimate and overestimate by orders of magnitude. When fed the LpBound cardinalities, PostgreSQL can take less time for query evaluation than when using its own estimates.

(Joint work with Mahmoud Abo Khamis, Chris Mayer, Vasileios Nakos, Dan Suciu, and Haozhe Zhang.)