Computer Science Colloquium, University of Haifa

Abstract:

Transliteration is the process of converting terms written in one language into their approximate spelling or phonetic equivalents in another language. Transliteration is defined for a pair of languages, a source language and a target language. The two languages may differ in their orthographic systems and phonetic inventories. In the context of a Machine Translation system, one has to first identify which terms should be transliterated rather than translated, and then produce a proper transliteration for these terms.

We present a Hebrew to English transliteration method in the context of a Machine Translation system. Our method uses machine learning to determine which terms are to be transliterated rather than translated. The training corpus for this purpose includes only positive examples, acquired semi-automatically from a corpus of articles from Hebrew press and web-forums. Our classifier reduces more than 38% of the errors made by a baseline method. The identified terms are then transliterated based on a Statistical Machine Translation technique. The transliteration model was trained with a parallel corpus extracted from Wikipedia using a fairly simple method which requires minimal linguistic knowledge. The correct result is produced as the Top-1 result in more than 76% of the cases, and in 95% of the instances it is one of the Top-5 results. We also demonstrate an improvement in the performance of a Hebrew-to-English MT system that uses our transliteration module.

Computer Science Colloquium, 2008-2009

Lightly supervised Transliteration for Machine Translation