Duplicate detection. (Near) duplicate detection is a difficult task because, generally, it is a quadratic problem: each new candidate document before being added to the corpus it must be checked against all other documents appearing in the corpus (e.g. by document similarity measures). Although such methods are quite accurate, the speed becomes a serious problem in large document collections. Therefore, several authors proposed methods that reduce the time complexity to sub-quadratic: Shingling (▇▇▇▇▇▇, 1997), I-Match (▇▇▇▇▇▇▇▇▇ et al., 2002), Locality Sensitive Hashing (▇▇▇▇▇▇ et al., 1999) and SpotSigs (▇▇▇▇▇▇▇▇ et. al., 2008). SpotSigs, which specifically targets duplicate detection for web crawling, represents each web page as a set of spot signatures. A spot signature is a chain of words that follow frequent words as these are attested in a corpus. These signatures are rarely present in advertisements and navigational components of web pages. Thus, the signatures are built from portions of pages with ―real‖ content. Then, SpotSigs adopts an efficient and self-tuning matching algorithm based on Jaccard similarity of sets of spot signatures, in order to derive an optimal partitioning of the web page collection into buckets of potentially matching documents, and thus to reduce the problem of identifying duplicates into a sub-quadratic one. ▇▇▇▇▇▇▇▇ et al. (2008) report that SpotSigs outperformed Shingling and I-Match algorithms in terms of recall and precision, and Locality Sensitive Hashing in efficiency over the TREC WT10g Web collection.

Appears in 2 contracts

Sources: Grant Agreement, Grant Agreement

Common use of Duplicate detection Clause in Contracts