Parallel Treebanks Sample Clauses

Parallel Treebanks. ‌ The novel models presented here all use a parallel bilingual treebank for both training and evaluation.1 Our annotated parallel data all comes from the English Chinese Transla- tion Treebank (▇▇▇▇ et al., 2007). This dataset consists of the documents from the Chinese Treebank v1.0 (▇▇▇ et al., 2000), which have all since been translated into English. In ad- dition, the translated English sentences have been manually annotated with gold-standard parse trees using similar (though not quite identical) conventions as the Penn WSJ Tree- bank. Finally, this dataset was also later annotated with gold-standard word alignment information. Because the various components of this data set (English translations, English trees, gold word alignments) were released piecemeal, there is no publicly available cleanly sentence-aligned version of this corpus. Thus, we did have to perform some additional processing to make this data usable, throwing away all sentences that did not neatly align one-to-one2 or for which one or more annotations were missing. This filtering step left a total of 2749 sentence pairs from the 4180 sentences in the original data. The data was sep- arated into training, development, and test sets according to the standard Chinese treebank division. Details are in Table 2.3.