Parallel corpora. Machine translation systems generally use two separate parallel corpora: at training time, a large corpus is used for extracting translation rules and collecting statistics (usually rule counts), and then at test time, the system is evaluated on a smaller held-out corpus. Systems that need to set parameters (including the one used in our experiments) also require an additional held-out tuning corpus, typically of about the same size as the test set. Our MT training corpus was a 22 million word mixed genre (though primarily newswire) corpus that had been previously assembled and processed to enforce a con- sistent tokenization scheme. The tuning and test sets were taken from the NIST MT04 and MT05 development sets, with some light processing to match the tokenization of the train- ing data. All these datasets were received from BBN as part of the DARPA ▇▇▇▇ (now BOLT) program. Details are in Table 2.2.
Appears in 2 contracts
Sources: Dissertation, Dissertation