Duplicate detection. The Web contains many duplicate pages, texts and their parts. Ignoring this phenomenon and including duplicate documents (or their parts) in the corpus could have negative effect on training the MT system. Duplicate detection involves identification of documents (or their parts) already appearing in the corpus and their elimination. In the area of web page crawling, the attention is focused on detection of near duplicate pages. Two pages with the same main content can differ in other parts (boilerplate) and therefore duplicate detection algorithms would fail in identifying them as full duplicates.
Appears in 2 contracts
Sources: Grant Agreement, Grant Agreement