Similarity Hashing Based on Levenshtein Distances

Frank Breitinger; Georg Ziroff; Steffen Lange; Harald Baier

doi:10.1007/978-3-662-44952-3_10

Conference Papers Year : 2014

Similarity Hashing Based on Levenshtein Distances

(1, 2) , (1) , (1) , (1, 2)

1
2

Frank Breitinger

Function : Author

Hochschule Darmstadt

Center for Advanced Security Research Darmstadt [Darmstadt]

Georg Ziroff

Function : Author

Hochschule Darmstadt

Steffen Lange

Function : Author

Hochschule Darmstadt

Harald Baier

Function : Author

Hochschule Darmstadt

Center for Advanced Security Research Darmstadt [Darmstadt]

Abstract

It is increasingly common in forensic investigations to use automated pre-processing techniques to reduce the massive volumes of data that are encountered. This is typically accomplished by comparing fingerprints (typically cryptographic hashes) of files against existing databases. In addition to finding exact matches of cryptographic hashes, it is necessary to find approximate matches corresponding to similar files, such as different versions of a given file.This paper presents a new stand-alone similarity hashing approach called saHash, which has a modular design and operates in linear time. saHash is almost as fast as SHA-1 and more efficient than other approaches for approximate matching. The similarity hashing algorithm uses four sub-hash functions, each producing its own hash value. The four sub-hashes are concatenated to produce the final hash value. This modularity enables sub-hash functions to be added or removed, e.g., if an exploit for a sub-hash function is discovered. Given the hash values of two byte sequences, saHash returns a lower bound on the number of Levenshtein operations between the two byte sequences as their similarity score. The robustness of saHash is verified by comparing it with other approximate matching approaches such as +sdhash+.

Keywords

Fuzzy hashing similarity digest Levenshtein distance

Domains

Computer Science [cs]

Fichier principal

978-3-662-44952-3_10_Chapter.pdf (1.13 Mo)

Origin	Files produced by the author(s)

Hal Ifip : Connect in order to contact the contributor

https://inria.hal.science/hal-01393767

Submitted on : Tuesday, November 8, 2016-10:48:06 AM

Last modification on : Wednesday, September 7, 2022-3:36:04 PM

Long-term archiving on : Tuesday, March 14, 2017-11:33:50 PM

Dates and versions

hal-01393767 , version 1 (08-11-2016)

Licence

Attribution

Identifiers

HAL Id : hal-01393767 , version 1
DOI : 10.1007/978-3-662-44952-3_10

Cite

Frank Breitinger, Georg Ziroff, Steffen Lange, Harald Baier. Similarity Hashing Based on Levenshtein Distances. 10th IFIP International Conference on Digital Forensics (DF), Jan 2014, Vienna, Austria. pp.133-147, ⟨10.1007/978-3-662-44952-3_10⟩. ⟨hal-01393767⟩

Export

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

IFIP-LNCS IFIP IFIP-AICT IFIP-TC IFIP-WG IFIP-TC11 IFIP-DF IFIP-WG11-9 IFIP-AICT-433

185 View

3747 Download

Similarity Hashing Based on Levenshtein Distances

Abstract

Keywords

Domains

Dates and versions

Licence

Identifiers

Cite

Export

Collections

Altmetric

Share