Computer Science Colloquium, 2003-2004

Alon Lavie
Language Technologies Institute, School of Computer Science, Carnegie Mellon University
March 3rd, 2004

A Trainable Transfer-based MT Approach for Languages with Limited Resources

The past decade has seen significant advances in the area of Machine Translation (MT): the automatic computer-based translation between human languages. The need for MT in this era of Internet and globalization is obvious. Much of the recent research progress in MT has been within corpus-based approaches such as Statistical Machine Translation (SMT). SMT is attractive because it is fully automated, and requires orders of magnitude less human labor than traditional rule-based MT approaches. However, to achieve reasonable levels of translation performance, SMT requires very large volumes of sentence-aligned parallel text for the two languages -- on the order of magnitude of a million words or more. Such resources are only currently available for only a small number of language pairs, limiting the practical application of SMT in the foreseeable future. Our MT research group at Carnegie Mellon, has been working on a new MT approach that is specifically designed to enable rapid development of MT for languages with limited amounts of online resources. Our approach assumes the availability of a small number of bi-lingual speakers of the two languages, but these need not be linguistic experts. The bi-lingual speakers create a comparatively small corpus of word aligned phrases and sentences (on the order of magnitude of a few thousand sentence pairs) using a specially designed elicitation tool. From this data, the learning module of our system automatically infers hierarchical syntactic transfer rules, which encode how syntactic constituent structures in the source language transfer to the target language. The collection of transfer rules is then used in our run-time system to translate previously unseen source language text into the target language. I will describe the general principles underlying our approach, and the current state of development of our research system. I will present results from an experiment we conducted where we developed a basic Hindi-to-English MT system over the course of two months, using extremely limited resources on the Hindi side. The results of the experiment indicate that under these extremely limited training data conditions, when tested on unseen data, our XFER system significantly outperforms SMT. Finally, I will describe our current work on applying our XFER approach to the development of a basic Hebrew-to-English MT system.

Joint work with: Lori Levin, Jaime Carbonell, Katharina Probst, Erik Peterson, Stephan Vogel, and Ariadna Font-Llitjos.

Shuly Wintner

Last modified: Mon Jan 26 08:13:30 IST 2004