Cleaner. The Cleaner module described in 2.1.3 is also available as a standalone web service accessible from ▇▇▇▇://▇▇▇.▇▇▇▇.▇▇/soaplab2-axis/#ilsp.ilsp_cleaner_row. The service has one mandatory parameter: 1. The input parameter is the URL of a web document to be cleaned. The Cleaner also uses five optional parameters: 1. The outputType parameter sets the type of the output. It can be: i) a text file containing only the clean text, ii) an XML file containing metadata of the web document and the clean text only, and iii) an XML file containing metadata of the web document and the content of the web document annotated as boilerplate or text. Users can select the type of output according to their needs. For example, the first type might be useful for somebody who has already downloaded web documents and would like to apply de-duplication on document level by using only the clean text of the downloaded web documents. The second type could be useful for someone who would like to extract metadata from the source web documents and keep only the clean text from these sources. If the user is interested in both boilerplate and clean text, the third type should be selected. It is worth mentioning that both the second and third types provide structural information about the web document, by using the attribute type and the values title, heading or listitem. 2. The methodsList parameter sets the method for removing boilerplate. Boilerpipe provides six methods: ArticleExtractor, ArticleSentencesExtractor, DefaultExtractor, KeepEverythingExtractor, LargestContentExtractor, and NumWordsRulesExtractor (default). Short descriptions of the methods are reported at ▇▇▇▇://▇▇▇▇▇▇▇▇▇▇.▇▇▇▇▇▇▇▇▇▇.▇▇▇/ svn/trunk/boilerpipe-core/ javadoc/1.0/index.html. The attribute crawlinfo with value boilerplate will be added to every paragraph of the web document which has been classified as boilerplate. Remaining paragraphs constitute the clean text. 3. The minimumLength parameter defines the minimum accepted length in terms of tokens for each paragraph of the clean text. Users not interested in short paragraphs can set the value of this parameter accordingly. The attribute crawlinfo with value ooi-length will be added to every paragraph of the clean text with length less than minimumLength. The default value is 10. 4. The language parameter sets the targeted language. The current list of ISO 639 codes for supported languages includes en, el, es, fr, it and de. Selecting one of these languages implies that the user is only interested in content in this language. Therefore, the embedded language identifier will be applied on each “accepted” paragraph (i.e. each paragraph that has not been classified as boilerplate and has length over the minimumLength), and a crawlinfo attribute with value ▇▇▇-▇▇▇▇ will be added to every paragraph that is not in the targeted language. If there is no targeted language (default), the embedded language identifier will be applied on the main content (clean text) of the web document, and the ISO code of the identified language code will fill the element <language>. 5. The termList is a list of triplets (<relevance weight, term, topic-class>) that define the domain, or the sub-domains. This parameter can be provided by uploading an already existing file with a list of terms as described in section 2.1.10 above. The embedded text to topic classifier will be applied on the document and, if the document is classified as relevant to a sub-domain, the <subdomain> container will be filled accordingly. In addition, the Cleaner will search for these terms in each “accepted” paragraph. If one or more terms are found in a paragraph, the attribute topic will be added to this paragraph, and found terms will be stored as the attribute value.

Appears in 2 contracts

Sources: Grant Agreement, Grant Agreement

Common use of Cleaner Clause in Contracts