Web-page cleaning Sample Clauses

Web-page cleaning. Current state of the art in the area of corpus cleaning is briefly described in Spousta et al. (2008): Interest in web page cleaning originated in the area of web mining and search engines (see e.
Web-page cleaning. Apart from a main textual content, a typical web page also contains certain ―noise‖ including navigation links, advertisements, disclaimers, etc. (often called boilerplate) of only limited or no use for the purposes of training an MT system. Such irrelevant parts should be removed and only the main content should be kept in order to produce good-quality language resources. This is the most challenging task of the CNC and special attention will be paid to it in WP4.