ISCOL 2004


Abstracts

Hspell - the Free Hebrew Spell-Checker and Morphological Analyzer

Dan Kenigsberg and Nadav Har'El

Unlike with most European languages, it was impossible to obtain a free and legal list of all Hebrew words, let alone a Hebrew spell-checker. In order to amend this, we have developed Hspell - a freely available Hebrew spell-checker and morphological analyzer. Hspell takes the ``synthesis'' approach to spell-checking: it generates a huge list of all Hebrew inflections, compresses it, and checks texts by comparing each of the text's words against it. Since most Hebrew words may be prepended by a prefix, Hspell has to know which are the legal prefixes of each inflection. We believe that this is the first time that a complete prefix-matching algorithm is presented. As Hspell synthesizes all inflections, it stores their stem and inflection attributes such as tense, plurality, gender, etc. This information may be accessed later, which makes Hspell a morphological analyzer. Currently, Hspell supports the strict Niqqud-less standard devised by the Academy of the Hebrew Language.

The crux of Hspell - the complete list of words - has become a part of the database of the Knowledge Center for the Hebrew Language, and the speller itself has been integrated with various software applications. We hope that further applied and theoretical linguistic works would be based on Hspell's infrastructure.

Master-Slave Dependency Model and its Application to Hebrew Understanding

Yan Tsitrin

In this work we present a new model for sentences analyzing called MASTER-SLAVE dependency model. In this formalism we relate an utterance meaning as a tree-based structure existing in the three-dimensional SPaCe (Structural-Perceptual-Conceptual space). Such a structure is called Meaning Tree or MOLECULE (Morphological-Lexical- Conceptual Entity). The tree's nodes stand for the concepts expressed in the utterance (we call them semantic nuclei) and its edges (or connectors) represent conceptual or functional relations between the nuclei. We say that the computer understands an utterance when it finds its meaning tree.

We propose an algorithm for construction of such meaning structures for languages that have extreme morphological complexity and where the words order is much freer then this of the English (like Hebrew, Russian etc.). We also show how these structures can be used in more general applications, such as Dialog Systems, Machine Translation Systems, Text Summarizers and Search Engines.

Knowledge Center for Processing Hebrew: product presentations

http://www.mila.cs.technion.ac.il/

Summarizing Jewish Law Articles Using Genetic Algorithms

Yaakov HaCohen-Kerner

People often need to make decisions based on different kinds of information. However, explosion of information is hard to handle. Summaries allow people to decide whether to read the whole text or not. In addition, they can serve as brief substitutes of full documents. This talk describes the first summarization model for texts in Hebrew. The summarization is done by extraction of the most relevant sentences. First, we have formulated nine known summarization methods and two Hebrew summarization methods. Then, we combined them into a hybrid method that achieves better results. The best results have been achieved by a genetic algorithm. To the best of our knowledge, our model is also the first to use successfully genetic algorithm for summarization based on sentence extraction.

Reader-based Exploration of Lexical Cohesion

Beata Klebanov

In this talk we present motivation, build-up and initial results of a text annotation experiment with human subjects aimed at eliciting people's intuitions about a certain type of shallow meaning relation we termed "anchoring", which bears on the phenomenon of lexical cohesion in text (Halliday and Hasan, "Cohesion in English", 1976).

The basic question presented to people was: for every concept first mentioned in a text, which previously mentioned concepts help the easy accommodation of this concept into the evolving story (serve as "anchors" for the new concept), if indeed it is easily accommodated, based on the common knowledge as you perceive it?

We reflect on challenges in providing guidelines and analysing the results of such an experiment within the established framework of classification tasks, and discuss possible extentions of the framework.

Work under the supervision of Prof. Eli Shamir.

Learning and Inference with Structured Representations

Dan Roth, Department of Computer Science and the Beckman Institute University of Illinois at Urbana-Champaign

Natural language decisions often involve assigning values to sets of variables where complex and expressive dependencies can influence, or even dictate, what assignments are possible. This is common in natural language tasks ranging from predicting pos tags of words in their context -- governed by sequential constraints such as that no three consecutive words are verbs -- to semantic parsing -- governed by constraints such that certain verbs must have, somewhere in the sentence, three arguments of specific semantic types.

I will describe research on a framework that combines learning and inference for the problem of assigning globally optimal values to a set of variables with complex and expressive dependencies among them.

The inference process of assigning globally optimal values to mutually dependent variables is formalized as an optimization problem and is solved as an integer linear programming (ILP) problem. Two general classes of training processes are presented. In one, the inference process applied to derive a global assignment to the variables of interest is decoupled from the process of learning estimators to variables' values; in the second, dependencies among the variables are incorporated into the learning process, and directly induce estimators that yield a global assignment.

I will show how this framework generalizes existing approaches to the problem of learning structured representations, and discuss the advantages the two training paradigms have in different situations. Examples will be given in the context of semantic role labeling and of information extraction problems.

Scaling Web Based Acquisition of Entailment Relations

Idan Szpektor, Hristo Tanev, Ido Dagan and Bonaventura Coppola

Paraphrase recognition is a critical step for natural language interpretation. Accordingly, many NLP applications would benefit from high coverage knowledge bases of paraphrases. However, state-of-the-art paraphrase acquisition approaches show limited scalability. We present an unsupervised learning algorithm for Web-based extraction of entailment relations, an extended model of paraphrases. We focus on increased scalability and generality with respect to prior work, eventually aiming at a full scale knowledge base. Our current implementation of the algorithm takes as its input a verb lexicon and for each verb searches the Web for related syntactic entailment templates. Experiments show promising results with respect to the ultimate goal, achieving much better scalability than prior Web-based methods.

Feature Generation for Text Categorization Using Hierarchical Web Knowledge Bases

Evgeniy Gabrilovich and Shaul Markovitch

Text categorization is useful in a wide variety of tasks, such as routing news and email to the appropriate corporate desk, identifying junk email, correctly handling intelligence reports, and many more. State of the art systems for text categorization use some kind of induction algorithm in conjunction with word-based features. This approach is inherently limited in its ability to handle concepts that cannot be predicted by several easily distinguishable keywords.

We propose to enhance text categorization systems with constructed features that are based on domain-specific and commonsense knowledge. The approach capitalizes on a wide selection of large-scale hierarchical knowledge bases available today. Our initial setup utilizes the Open Directory Project (ODP) -- a hierarchy of 600,000 categories and 4 million URLs. To significantly extend the body of knowledge represented by the ODP, we crawl the sites linked from the directory, and collect a set of documents that represent their contents.

This huge collection of documents is then used to learn a classifier that maps pieces of text onto ODP categories. Later, when training for text categorization, text chunks of the training examples (called "contexts") are first classified using the learned ODP classifier, which generates a set of ODP categories to be used as additional features. These knoweldge-based features can bring in knowledge that traditional text categorization systems have no access to. Feature generation for local contexts effectively performs word sense disambiguation, and to some degree resolves the problems of word polysemy and synonymy.

Preliminary results are very encouraging. Our feature generator is able to discover generalizations that no other method is able to perform. We believe that additional research on this new methodology can bring text categorization to unprecedented levels of performance.


ISCOL, http://cs.haifa.ac.il/~shuly/iscol/
Maintained by shuly@cs.haifa.ac.il. Last modified: Wed Dec 15 20:14:29 IST 2004