Baseline Setup. We trained a baseline system using the English- German Europarl and News Commentary data from the ACL 2010 Joint Fifth Workshop on Statistical Machine Translation and Metrics MATR1. The German-side of the parallel corpus was parsed using the BitPar2 parser. Where a parse failed the pair was discarded, leaving a total of 1,516,961 sentence pairs. These were aligned using GIZA++ 1▇▇▇▇://▇▇▇.▇▇▇▇▇▇.▇▇▇/wmt10/ translation-task.html 2▇▇▇▇://▇▇▇.▇▇▇.▇▇▇-▇▇▇▇▇▇▇▇▇.▇▇/tcl/ SOFTWARE/BitPar.html and SCFG rules were extracted as described in sec- tion 3.1 using the Moses toolkit. The resulting gram- mar contained just under 140 million synchronous rules. We used all of the available monolingual Ger- man data to train three 5-gram language models (one each for the Europarl, News Commentary, and News data sets). These were interpolated using weights optimised against the development set and the re- sulting language model was used in experiments. We used the SRILM toolkit (▇▇▇▇▇▇▇, 2002) with ▇▇▇▇▇▇-▇▇▇ smoothing (▇▇▇▇ and ▇▇▇▇▇▇▇, 1998). The baseline system’s feature weights were tuned on the news-test2008 dev set (2,051 sentence pairs) using minimum error rate training (Och, 2003).
Appears in 2 contracts
Sources: Research Paper, Research Paper