Cleaner. Apart from its textual content, a typical web page also contains certain “noise” elements including navigation links, advertisements, disclaimers, etc. (often called boilerplate) of only limited or no use for linguistic purposes. Such irrelevant parts should be removed or marked as such to ensure the production of good-quality language resources. For this task FMC uses a modified version of Boilerpipe9 (Kohlschütter et al, 2010) that also extracts structural information like title, heading and list item. It also segments text in paragraphs exploiting the presence of specific HTML tags like <p>, </br> and <li>. Paragraphs judged to be boilerplate and/or detected as titles, etc. are properly annotated (see subsection 2.1.8)
Appears in 2 contracts
Sources: Grant Agreement, Grant Agreement