Statistical and Learning Methods in Natural Language Processing, Spring 2004

Administration

Instructors:: Ido Dagan, dagan@cs.biu.ac.il and Shuly Wintner, shuly@cs.haifa.ac.il.
Office hours:: Sunday 15:00-16:00, Jacobs 43. Phone: (828)8180.
Times:: Sunday 10:00-14:00.
Place:: Education Building 3502.
Prerequisites:: Computational Linguistics.
Grading:: The final grade will be based on a final project (approximately 60%) and a short exam (approximately 40%).

Content

Textbook

There is no recommended textbook for this course, but some of the material can be found in Foundations of Statistical Natural Language Processing, by Chris Manning and Hinrich Schuetze.

Syllabus

Introduction (4 hours)

Course Organization - learning approaches (Ido)
Organization according to application (Shuly)
- Segmentation
- Morphology
- Lexicon expansion
- Morphological disambiguation
- Syntax: parsing, shallow parsing, attachment, ...
- Semantics: word-sense disambiguation, categorization
Explanation of the problems; setting notation (Ido)
The source of difficulties and how statistical methods can help (Ido)

Supervised classification (6 hours)

Basic/earlier models: PP-attachment, decision list, target word selection (Ido)
Confidence interval (Ido)
Naive Bayes classification (Ido)
Simple smoothing -- add-constant (Ido)
Winnow (Ido)
Decision trees (Shuly)
Boosting (Ido)

Evaluation (2 hours) (Shuly)

Standard evaluation measures: accuracy, error, precision, recall, F-measure
Gold standards

Part-of-speech tagging (6-8 hours)

Hidden Markov Models and the Viterbi algorithm (Ido)
Smoothing -- Good-Turing, back-off (Ido)
Part-of-speech tagging for Hebrew (Shuly)
Unsupervised parameter estimation with Expectation Maximization (EM) algorithm (Ido)
Transformation-based learning (Ido)

Shallow parsing (2-4 hours)

Memory based learning (Ido)
Applications for phrase boundary detection and named entity recognition, e.g. SNoW (Puyakanok and Roth, 2001) or TiMBL (Shuly)
Feature engineering (Shuly)

Statistical parsing and PCFG (2 hours)

Full parsing -- Probabilistic Context Free Grammar (PCFG) (Ido)
Hebrew (Itai, Winter, Sima'an et al.) (Shuly)

Reducing training data (2 hours) (Ido)

Selective sampling for training
Bootstrapping

Unsupervised learning (4 hours) (Ido)

Word association
Information theory measures
Distributional word similarity, similarity-based smoothing
Clustering

Specific applications (time permitting)

Review of various applications and the role of learning and statistical approaches (question answering, information extraction, speech recognition and language modelling, translation and bilingual alignment, weighted automata, identifying roots, ...)
Unsupervised learning of morphology (Shuly)

Projects

The final projects will be centered around a specific application: part-of-speech tagging for Hebrew. Teams of one or two students will address the problem using various methods:

HMM
Winnow
TBL
MBL

The projects will be handed out around the middle of the semester and will be due by the end of the summer.

Assignment 1 handed over Sunday, March 28, due Sunday, April 18th.

Final project

Note: The file gold90.txt contains analyses which were not present in analyzed90.txt. I therefore added all these analyses and created a new file, fix90.txt. Please use it instead. Note that these analyses are marked by a `+'; you can use this information in learning (obviously, all the analyses in fix90.txt which are marked by a `+' are the correct analyses!)

Also, some of the analyses in the learning file are invalid in the sense that they do not conform to Erel Segal's format. I cannot change that, but hopefully the number of such input is low (18 found so far). Thanks are due to Shlomo for pointing out both problems.

A description of the final project is available.

For development you should use three files: analyzed90.txt is a file containing the full morphological analyses of a text of 32217 words, which are given in the file text90.txt. The disambiguated analyses of this same text are given in the file gold90.txt. The three files are aligned, of course.

The format of the analyses is as given by Erel Segal's analyzer. See here for the details of the format.

You should use the sample corpus for development and report the performance of your system using 10-fold cross-validation on this text. Before final submission you will be given an additional, smaller test corpus, and will have to report performance on the test corpus as well.

Exam

The final exam will take place on Monday, June 28th, at 10:00, in Jacobs 506 (old 24D).

Slides

Bibliography list

Computational Linguistics, http://cs.haifa.ac.il/~shuly/teaching/04/statnlp/
Maintained by shuly@cs.haifa.ac.il. Last modified: Sun Jun 13 09:42:46 IDT 2004