MinHash

In computer science and data mining, MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are. The scheme was published by Andrei Broder in a 1997 conference,[1] and initially used in the AltaVista search engine to detect duplicate web pages and eliminate them from search results.[2] It has also been applied in large-scale clustering problems, such as clustering documents by the similarity of their sets of words.[1]

  1. ^ a b Broder, Andrei Z. (1998), "On the resemblance and containment of documents", Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171) (PDF), IEEE, pp. 21–29, CiteSeerX 10.1.1.24.779, doi:10.1109/SEQUEN.1997.666900, ISBN 978-0-8186-8132-5, S2CID 11748509, archived from the original (PDF) on 2015-01-31, retrieved 2014-01-18.
  2. ^ Broder, Andrei Z.; Charikar, Moses; Frieze, Alan M.; Mitzenmacher, Michael (1998), "Min-wise independent permutations", Proc. 30th ACM Symposium on Theory of Computing (STOC '98), New York, NY, USA: Association for Computing Machinery, pp. 327–336, CiteSeerX 10.1.1.409.9220, doi:10.1145/276698.276781, ISBN 978-0897919623, S2CID 465847.