January 23rd, Wednesday 14:15, Room 303, Jacobs Building
The goal of discriminative learning is to train a system to optimize a certain desired measure of performance. In simple classification we seek a function that assigns a binary label to a single object, and tries to minimize the error rate (correct or incorrect) on unseen data. In structured prediction we are interested in the prediction a structured label, where the input is a complex object. Typically, each structured prediction task has its own measure of performance, or task loss, such as word error rate in speech recognition, the BLEU score in machine translation or the intersection-over-union score in object segmentation. Not only that those task losses are much more involved than the binary error rate, the structured prediction itself spans an exponentially large label space. In the talk, I will present two algorithms each designed to the minimize a given task loss, and applied to speech recognition.
In the first part of the talk, I will present an algorithm which aims at achieving a high area under the receiver operating characteristic (ROC) curve. This measure of performance is often used to evaluate the detection of events over time. We will give some theoretical bounds that relate the performance of the algorithm with the expected area under the ROC curve, and will demonstrate its efficiency to that task of keyword spotting, i.e., that detection of all occurrences of any given word in a speech signal.
In the second part of the talk, I will describe a new algorithm which aims to minimize a regularized task loss. The algorithm is derived by directly minimizing a generalization bound for structured prediction, which gives an upper-bound on the expected task loss in terms of the empirical task loss. The resulting algorithm is iterative and easy to implement, and as far as we know, the only algorithm that can handle a non-separable task loss. We will present experimental results on the task of phoneme recognition, and will show that the algorithm achieves the lowest phoneme error rate (normalized edit distance) compared to other discriminative and generative models with the same expressive power.