Fingerprint (computing)

In computer science, a fingerprinting algorithm is a procedure that maps an arbitrarily large data item (such as a computer file) to a much shorter bit string, its fingerprint, that uniquely identifies the original data for all practical purposes^[1] just as human fingerprints uniquely identify people for practical purposes. This fingerprint may be used for data deduplication purposes. This is also referred to as file fingerprinting, data fingerprinting, or structured data fingerprinting.

Fingerprints are typically used to avoid the comparison and transmission of bulky data. For instance, a web browser or proxy server can efficiently check whether a remote file has been modified, by fetching only its fingerprint and comparing it with that of the previously fetched copy.^[2]^[3]^[4]^[5]^[6]

Fingerprint functions may be seen as high-performance hash functions used to uniquely identify substantial blocks of data where cryptographic hash functions may be unnecessary.

Special algorithms exist for audio fingerprinting and video fingerprinting.

^ A. Z. Broder. Some applications of Rabin's fingerprinting method. In Sequences II: Methods in Communications, Security, and Computer Science, pages 143--152. Springer-Verlag, 1993
^ Detecting duplicate and near-duplicate files. US Patent 6658423 Issued on December 2, 2003
^ A. Z. Broder (1998). "On the resemblance and containment of documents". Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171). IEEE Computer Society. pp. 21–27. CiteSeerX 10.1.1.24.779. doi:10.1109/SEQUEN.1997.666900. ISBN 978-0-8186-8132-5. S2CID 11748509.
^ Brin, S. and Davis, J. and Garcia-Molina, H. (1995) Copy Detection Mechanisms for Digital Documents Archived 2016-08-18 at the Wayback Machine. In: ACM International Conference on Management of Data (SIGMOD 1995), May 22–25, 1995, San Jose, California, from stanford.edu. 18/08/2016. Retrieved 11/01/2019.
^ L. Fan, P. Cao, J. Almeida and A. Broder, Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol, IEEE/ACM Transactions on Networking, vol. 8, No. 3 (2000)
^ U. Manber, Finding Similar Files in a Large File System. Proceedings of the USENIX Winter Technical Conf. (1994)

[bro2-1] A. Z. Broder. Some applications of Rabin's fingerprinting method. In Sequences II: Methods in Communications, Security, and Computer Science, pages 143--152. Springer-Verlag, 1993

[fingpat-2] Detecting duplicate and near-duplicate files. US Patent 6658423 Issued on December 2, 2003

[broder-3] A. Z. Broder (1998). "On the resemblance and containment of documents". Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171). IEEE Computer Society. pp. 21–27. CiteSeerX 10.1.1.24.779. doi:10.1109/SEQUEN.1997.666900. ISBN 978-0-8186-8132-5. S2CID 11748509.

[brin-4] Brin, S. and Davis, J. and Garcia-Molina, H. (1995) Copy Detection Mechanisms for Digital Documents Archived 2016-08-18 at the Wayback Machine. In: ACM International Conference on Management of Data (SIGMOD 1995), May 22–25, 1995, San Jose, California, from stanford.edu. 18/08/2016. Retrieved 11/01/2019.

[sumcach-5] L. Fan, P. Cao, J. Almeida and A. Broder, Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol, IEEE/ACM Transactions on Networking, vol. 8, No. 3 (2000)

[mnb-6] U. Manber, Finding Similar Files in a Large File System. Proceedings of the USENIX Winter Technical Conf. (1994)

[1]

[2]

[3]

[4]

[5]

[6]