Word alignment. In order to avoid accidental co-occurrence of a ▇▇-▇▇ pair, the subcorpora were filtered using the criterion of word alignment: Only ▇▇-▇▇ word pairs which could be word-aligned were kept in the data4. For word alignment, GIZA++ was used. All sentence pair candidates which could not be word-aligned were removed from the subcorpora. This operation removed another 280K sentences from the text base, leaving 2.68 mio sentences for the following steps. It would be worth looking at the difference; it could result either from real accidental co-occurrences, or from word alignment errors. More importantly, this step also removed entries, and whole packages, for which no word alignment could be found, either because they did not co-occur in any sentence pair, or because they could not be word-aligned. Table 3-3 shows the remaining data sets.
Appears in 2 contracts
Sources: Grant Agreement, Grant Agreement