Preparing Data. The SMT step of the pipeline receives sentence-aligned data (covered in the previous subsection). The data is then tokenised and lowercased using Europarl tools. The following table give details of the amount of sentences through the preparation process: • “Provided“ is the amount of sentences output of the aligner without threshold. • “Unique“ is the amount of sentences after removing duplicate sentence pairs. • “Clean“ is the amount of sentences after applying the threshold, which removes those sentence alignments with confidence score below 0.4.
Appears in 2 contracts
Sources: Grant Agreement, Grant Agreement