November 23, Wednesday 14:15, Room 303, Jacobs Building

Title: Statistical Parsing in the Face of Language Diversity

Lecturer: Reut Tsarfaty

Lecturer homepage : stp.lingfil.uu.se/~tsarfaty

Affiliation : The department of Linguistics and Philology, Uppsala University

 

Syntactic parsing, that task of automatically analyzing the structure of natural language sentences, is considered a core Computational Linguistics/Natural Language Processing (CL/NLP) task as it provides the first step towards utterance understanding, text summarization, machine translation and other applications. Statistical parsers are designed to automatically discover a set of relations between language-independent elements such as a subject, a predicate, an object, etc., based on the language-specific realization patterns observed in language data. A subject in English, for example, is realized in syntax using word order. In German, in contrast, it is realized in morphology, using word affixes. The diversity in the realization of grammatical relations across languages has dramatic effects on parsing accuracy. Existing statistical parsing models demonstrate excellent performance on English, but when trained on data from other languages they often fail to yield comparable results. A research question thus emerges, namely, what kind of models are suitable for parsing these different languages?

In this talk I present the motivation, design and application of a Relational-Realizational (RR) parsing model which is designed to cope with cross-linguistic diversity by mapping grammatical relations to morphosyntactic realization in a non-rigid, language-independent, fashion. The model is defined over a formal grammar that inter-relates function, syntax and morphology. The model parameters encode complex interactions, which, for particular languages, are estimated based on corpus statistics, thus capturing their language-specific behavior. I demonstrate the application of the RR model to parsing Hebrew and Swedish, showing significant improvements over previous results. I further employ these results to instantiate an explicit link between language technology and linguistic typology, whereby the search for a ``universal grammar" for cross-linguistic description is equated with the development of a processing engine that learns different probability distributions from data in different domains. I suggest that exploring this link further will lead to advances at both the technological and the scientific CL/NLP fronts, from better models for machine translation to modeling human language acquisition.

SHORT BIO: Reut Tsarfaty is a Post-Doctoral Research Fellow at the Computational Linguistics lab at Uppsala University in Sweden. She received her Ph.D. and MSc. from the Institute for Logic, Language and Computation (ILLC) at the University of Amsterdam, and her BSc. from the Computer Science department at the Technion. Reut's research focuses on models and methodologies for cross-linguistic and cross-framework natural language parsing. Beyond syntactic parsing Reut has also worked on modeling the morphological, syntactic and semantic interactions in natural language processing and on applying formal logic to natural language semantics. Reut is a recipient of the Dutch Science Foundation's prestigious MOSAIC award, and she is an internationally renowned expert on Parsing Morphologically Rich Languages (PMRL), a topic on which she now serves as a guest editor on the Computational Linguistics journal editorial board.