De-duplicator Sample Clauses

De-duplicator. The De-duplicator module described in 2.1.9 is also available as a standalone web service accessible from ▇▇▇▇://▇▇▇.▇▇▇▇.▇▇/soaplab2-axis/#ilsp.ilsp_deduplicatormd5_row. The service has two mandatory parameters: 1. The input denotes a file containing a list with URLs to the files to be de-duplicated. 2. The inputType denotes the type of the files to be de-duplicated. These files could be text or TO1 XML files similar to the ones generated by the FMC. The service also has two optional parameters: 1. minimumTokenLength During the calculation of the page profile, all tokens equal or shorter than this value are discarded. The default value is 2. 2. quantValue. Tokens with frequency (after quantization) below this value are discarded. The default value is 3. The output is a text file containing a list with URLs pointing to the files that have remained after de-duplication.
De-duplicator. The Web contains many duplicate (parts of) pages. For instance, ▇▇▇▇▇▇ et al. (2009) reported that during building of the Wacky corpora the amount of documents was reduced by more than 50% after de-duplication. Ignoring this phenomenon and including duplicate documents could have a negative effect in creating a representative corpus. Therefore, the De-duplicator examines the main content of the stored documents in order to detect and remove near- duplicates. This module employs the de-duplication strategy12 included in the Nutch framework, which involves the construction of a text profile based on quantized word frequencies, and an MD5 hash for each page (see section 3.2). An additional step has been integrated into the final version of FMC for detection and removal of (near) duplicates. Each document is represented as a list with size equal to the number of paragraphs (without crawlinfo attribute) of the document. The elements of the list are the MD5 hashes of the paragraphs. Then, each list is checked against all other lists. For each candidate pair, the intersection of the lists is calculated. If the ratio of the intersection cardinality with the cardinality of the shortest list is over a predefined threshold, the documents are considered near- duplicates and the shortest is discarded.

Related to De-duplicator

  • Non-duplication In the event that the Executive shall perform services for the Bank or any other direct or indirect subsidiary or affiliate of the Company or the Bank, any compensation or benefits provided to the Executive by such other employer shall be applied to offset the obligations of the Company hereunder, it being intended that this Agreement set forth the aggregate compensation and benefits payable to the Executive for all services to the Company, the Bank and all of their respective direct or indirect subsidiaries and affiliates.

  • No Duplication The remedies provided in this Article 8 shall not be duplicative of any remedy available under the indemnification provisions of the Purchase Agreement.

  • No Duplicative Payment The Company shall not be liable under this Agreement to make any payment of amounts otherwise indemnifiable hereunder if and to the extent that Indemnitee has otherwise actually received such payment under any insurance policy, contract, agreement or otherwise.

  • Previously Reviewed Receivable; Duplicative Tests If any Review Receivable was included in a prior Review, the Asset Representations Reviewer will not conduct additional Tests on such Review Receivable, but will include the previously reported Test results in the Review Report for the current Review. If the same Test is required for more than one representation and warranty, the Asset Representations Reviewer will only perform the Test once for each Review Receivable, but will report the results of the Test for each applicable representation and warranty on the Review Report.

  • No Duplicative Payments It is intended that the provisions of this Agreement will not result in duplicative payment of any amount (including interest) required under this Agreement. The provisions of this Agreement shall be construed in the appropriate manner to ensure such intentions are realized.