UCSC-SOE-11-07: SimiHash: Hash-based Similarity Detection

Authors Caitlin Sadowski, Greg Levin

Published On 02/20/2011 09:00 AM

Department Computer Science

Abstract Most hash functions are used to separate and obscure data, so that similar data hashes to very different keys. We propose to use hash functions for the opposite purpose: to detect similarities between data.

Detecting similar files and classifying documents is a well-studied problem, but typically involves complex heuristics or O(n^2) pair-wise comparisons. Using a hash function that hashes similar files to similar values, file similarity can be determined simply by comparing pre-sorted hash key values. The challenge is to find a similarity hash that minimizes false positives.

We have implemented a family of similarity hash functions with this intent. We have further enhanced their performance by storing the auxiliary data used to compute our hash keys. This data is used as a second filter after a hash key comparison indicates that two files are potentially similar. We use these tests to explore the notion of similarity.

Download

UCSC-SOE-11-07