HyperLogLog

HyperLogLog is an algorithm for the count-distinct problem, approximating the number of distinct elements in a multiset.[1] Calculating the exact cardinality of the distinct elements of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. Probabilistic cardinality estimators, such as the HyperLogLog algorithm, use significantly less memory than this, but can only approximate the cardinality. The HyperLogLog algorithm is able to estimate cardinalities of > 109 with a typical accuracy (standard error) of 2%, using 1.5 kB of memory.[1] HyperLogLog is an extension of the earlier LogLog algorithm,[2] itself deriving from the 1984 Flajolet–Martin algorithm.[3]

  1. ^ a b Flajolet, Philippe; Fusy, Éric; Gandouet, Olivier; Meunier, Frédéric (2007). "Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm" (PDF). Discrete Mathematics and Theoretical Computer Science Proceedings. AH. Nancy, France: 137–156. CiteSeerX 10.1.1.76.4286. Retrieved 2016-12-11.
  2. ^ Durand, M.; Flajolet, P. (2003). "LogLog counting of large cardinalities." (PDF). In G. Di Battista and U. Zwick (ed.). Lecture Notes in Computer Science. Annual European Symposium on Algorithms (ESA03). Vol. 2832. Springer. pp. 605–617.
  3. ^ Flajolet, Philippe; Martin, G. Nigel (1985). "Probabilistic counting algorithms for data base applications" (PDF). Journal of Computer and System Sciences. 31 (2): 182–209. doi:10.1016/0022-0000(85)90041-8.