Statistical and Learning Methods in Natural Language Processing, Spring 2004


Administration

Instructors:
Ido Dagan, dagan@cs.biu.ac.il and Shuly Wintner, shuly@cs.haifa.ac.il.
Office hours:
Sunday 15:00-16:00, Jacobs 43. Phone: (828)8180.
Times:
Sunday 10:00-14:00.
Place:
Education Building 3502.
Prerequisites:
Computational Linguistics.
Grading:
The final grade will be based on a final project (approximately 60%) and a short exam (approximately 40%).

Content

Textbook

There is no recommended textbook for this course, but some of the material can be found in Foundations of Statistical Natural Language Processing, by Chris Manning and Hinrich Schuetze.

Syllabus

Introduction (4 hours)
Supervised classification (6 hours)
Evaluation (2 hours) (Shuly)
Part-of-speech tagging (6-8 hours)
Shallow parsing (2-4 hours)
Statistical parsing and PCFG (2 hours)
Reducing training data (2 hours) (Ido)
Unsupervised learning (4 hours) (Ido)
Specific applications (time permitting)

Projects

The final projects will be centered around a specific application: part-of-speech tagging for Hebrew. Teams of one or two students will address the problem using various methods: The projects will be handed out around the middle of the semester and will be due by the end of the summer.

Assignment 1 handed over Sunday, March 28, due Sunday, April 18th.

Final project

Note: The file gold90.txt contains analyses which were not present in analyzed90.txt. I therefore added all these analyses and created a new file, fix90.txt. Please use it instead. Note that these analyses are marked by a `+'; you can use this information in learning (obviously, all the analyses in fix90.txt which are marked by a `+' are the correct analyses!)

Also, some of the analyses in the learning file are invalid in the sense that they do not conform to Erel Segal's format. I cannot change that, but hopefully the number of such input is low (18 found so far). Thanks are due to Shlomo for pointing out both problems.

A description of the final project is available.

For development you should use three files: analyzed90.txt is a file containing the full morphological analyses of a text of 32217 words, which are given in the file text90.txt. The disambiguated analyses of this same text are given in the file gold90.txt. The three files are aligned, of course.

The format of the analyses is as given by Erel Segal's analyzer. See here for the details of the format.

You should use the sample corpus for development and report the performance of your system using 10-fold cross-validation on this text. Before final submission you will be given an additional, smaller test corpus, and will have to report performance on the test corpus as well.

Exam

The final exam will take place on Monday, June 28th, at 10:00, in Jacobs 506 (old 24D).

Slides

Bibliography list


Computational Linguistics, http://cs.haifa.ac.il/~shuly/teaching/04/statnlp/
Maintained by shuly@cs.haifa.ac.il. Last modified: Sun Jun 13 09:42:46 IDT 2004