UCSC-SOE-11-07: SimiHash: Hash-based Similarity Detection

Caitlin Sadowski, Greg Levin
02/20/2011 09:00 AM
Computer Science
Most hash functions are used to separate and obscure data, so that similar data hashes to very different keys. We propose to use hash functions for the opposite purpose: to detect similarities between data.

Detecting similar files and classifying documents is a well-studied problem, but typically involves complex heuristics or O(n^2) pair-wise comparisons. Using a hash function that hashes similar files to similar values, file similarity can be determined simply by comparing pre-sorted hash key values. The challenge is to find a similarity hash that minimizes false positives.

We have implemented a family of similarity hash functions with this intent. We have further enhanced their performance by storing the auxiliary data used to compute our hash keys. This data is used as a second filter after a hash key comparison indicates that two files are potentially similar. We use these tests to explore the notion of similarity.

UCSC-SOE-11-07