Shaul Markovitch
May 9, 2007
Title: The Knowledgeable Computer: Enhancing Learning Algorithms with Wikipedia-Based Feature Generation
Abstract:
When humans approach the task of text categorization, they interpret the
specific wording of the document in the much larger context of their background
knowledge and experience. On the other hand, state-of-the-art information
retrieval systems are quite brittle -they traditionally represent documents as
bags of words, and are restricted to learning from individual word occurrences
in the (necessarily limited) training set. For instance, given the sentence
``Wal-Mart supply chain goes real time'', how can a text categorization system
know that Wal-Mart manages its stock with RFID technology? And having read that
``Ciprofloxacin belongs to the quinolones group'', how can a machine know that
the drug mentioned is an antibiotic produced by Bayer? We propose to enrich
document representation through automatic use of a vast compendium of human
knowledge---an encyclopedia. We apply machine learning techniques to Wikipedia,
the largest encyclopedia to date, which surpasses in scope many conventional
encyclopedias and provides a cornucopia of world knowledge. Each Wikipedia
article represents a concept, and documents to be categorized are represented in
the rich feature space of words and relevant Wikipedia concepts. Empirical
results confirm that this knowledge-intensive representation brings text
categorization to a qualitatively new level of performance across a diverse
collection of datasets.
This work is done is collaboration with Evgeniy Gabrilovich.