Text normalization. Web is highly heterogeneous not only in terms of content but also in terms of form. Documents and pages available on-line can have different file formats (html, pdf, doc, txt, etc.) and text encodings (UTF-8, ISO-8859-x, etc.). Text normalization involves detection of the formats and text encodings of the downloaded web pages and converting them into unified format (plain text) and text encoding (UTF-8).
Appears in 2 contracts
Sources: Grant Agreement, Grant Agreement