Statistical and Learning Methods in Natural Language Processing,
Spring 2004
Administration
- Instructors:
- Ido Dagan,
dagan@cs.biu.ac.il
and
Shuly Wintner,
shuly@cs.haifa.ac.il.
- Office hours:
- Sunday 15:00-16:00, Jacobs 43. Phone: (828)8180.
- Times:
- Sunday 10:00-14:00.
- Place:
- Education Building 3502.
- Prerequisites:
- Computational Linguistics.
- Grading:
- The final grade will be based on a final project
(approximately 60%) and a short exam
(approximately 40%).
Content
Textbook
There is no recommended textbook for this course, but some of the
material can be found in
Foundations of Statistical
Natural Language Processing, by Chris Manning and Hinrich
Schuetze.
Syllabus
- Introduction (4 hours)
- Course Organization - learning approaches (Ido)
- Organization according to application (Shuly)
- Segmentation
- Morphology
- Lexicon expansion
- Morphological disambiguation
- Syntax: parsing, shallow parsing, attachment, ...
- Semantics: word-sense disambiguation, categorization
- Explanation of the problems; setting notation (Ido)
- The source of difficulties and how statistical methods can help (Ido)
- Supervised classification (6 hours)
- Basic/earlier models: PP-attachment, decision list, target word selection (Ido)
- Confidence interval (Ido)
- Naive Bayes classification (Ido)
- Simple smoothing -- add-constant (Ido)
- Winnow (Ido)
- Decision trees (Shuly)
- Boosting (Ido)
- Evaluation (2 hours) (Shuly)
- Standard evaluation measures: accuracy, error, precision, recall, F-measure
- Gold standards
- Part-of-speech tagging (6-8 hours)
- Hidden Markov Models and the Viterbi algorithm (Ido)
- Smoothing -- Good-Turing, back-off (Ido)
- Part-of-speech tagging for Hebrew (Shuly)
- Unsupervised parameter estimation with Expectation Maximization (EM)
algorithm (Ido)
- Transformation-based learning (Ido)
- Shallow parsing (2-4 hours)
- Memory based learning (Ido)
- Applications for phrase boundary detection and named entity recognition,
e.g. SNoW (Puyakanok and Roth, 2001) or TiMBL (Shuly)
- Feature engineering (Shuly)
- Statistical parsing and PCFG (2 hours)
- Full parsing -- Probabilistic Context Free Grammar (PCFG) (Ido)
- Hebrew (Itai, Winter, Sima'an et al.) (Shuly)
- Reducing training data (2 hours) (Ido)
- Selective sampling for training
- Bootstrapping
- Unsupervised learning (4 hours) (Ido)
- Word association
- Information theory measures
- Distributional word similarity,
similarity-based smoothing
- Clustering
- Specific applications (time permitting)
- Review of various applications and the role of learning and statistical
approaches (question answering, information extraction, speech recognition and language modelling,
translation and bilingual alignment, weighted automata, identifying roots,
...)
- Unsupervised learning of morphology (Shuly)
Projects
The final projects will be centered around a specific application:
part-of-speech tagging for Hebrew. Teams of one or two students will
address the problem using various methods:
The projects will be handed out around the middle of the semester and
will be due by the end of the summer.
Assignment 1 handed over Sunday, March
28, due Sunday, April 18th.
Final project
Note: The file gold90.txt contains analyses which were not present in analyzed90.txt. I therefore added all these analyses and created a new file, fix90.txt. Please use it instead. Note that these analyses are marked by a `+'; you can use this information in learning (obviously, all the analyses in fix90.txt which are marked by a `+' are the correct analyses!)
Also, some of the analyses in the learning file are invalid in the sense that they do not conform to Erel Segal's format. I cannot change that, but hopefully the number of such input is low (18 found so far). Thanks are due to Shlomo for pointing out both problems.
A description of the final project is available.
For development you should use three files: analyzed90.txt
is a file containing the full morphological analyses of a text of
32217 words, which are given in the file text90.txt. The disambiguated analyses of this same text are given
in the file gold90.txt. The three files are
aligned, of course.
The format of the analyses is as given by Erel Segal's analyzer. See
here for the details of the format.
You should use the sample corpus for development and report the
performance of your system using 10-fold cross-validation on
this text. Before final submission you will be given an
additional, smaller test corpus, and will have to report
performance on the test corpus as well.
Exam
The final exam will take place on Monday, June 28th, at 10:00, in Jacobs 506 (old 24D).
Slides
- Introduction (Shuly, PDF)
- Introduction (Ido, .ppt)
- Supervised classification (Ido, .ppt)
- Supervised classification (Shuly, PDF)
- Supervised classification (Ido, .ppt)
- Hebrew POS tagging (Shuly, PDF)
- Part of Speech tagging (Ido, .ppt)
- EM algorithm (Ido, .ppt)
- Shallow parsing (Shuly, PDF)
- Bootstrapping (Ido, .ppt)
- Unsupervised learning (Ido, .ppt)
Bibliography list
Computational Linguistics,
http://cs.haifa.ac.il/~shuly/teaching/04/statnlp/
Maintained by
shuly@cs.haifa.ac.il
.
Last modified: Sun Jun 13 09:42:46 IDT 2004