Inter-document similarities, language models, and ad hoc information retrieval
The abundance and variety of textual information in data repositories poses enormous challenges for systems that attempt to automatically extract information that pertains to users' needs. One of the most fundamental problems in the field of information search is ad hoc retrieval: finding documents in a repository (corpus) that are relevant to a query. We present a novel algorithmic framework for ad hoc retrieval in which information provided by document-based language models is enhanced by the incorporation of information drawn from clusters of similar documents. Thus, our framework combines the strengths of two approaches. The first is integration of document-specific content with contextual information induced from corpus structure, and the second is the leveraging of statistical language models. We present several highly effective retrieval algorithms that are natural instantiations of our proposed framework. In some settings, however, clustering of the entire corpus might not be practical due to rapid changes in the document set. We show that information induced from inter-document similarities within some small set of documents can still substantially improve the performance of an existing search engine. Specifically, we consider the task of re-ranking documents retrieved by some search engine for obtaining high precision at the top ranks. To this end, we present a graph-based approach that can be implemented on top of any search engine, and that is based on principles adopted from Web-search methods (e.g., PageRank, hubs and authorities) to settings with no hyperlink information. Our graph-formation method creates links based on relationships between language models. Centrality measures induced over such graphs are shown to be very effective in re-ranking documents retrieved by some search engine. Furthermore, we demonstrate that these graph-based measures substantially outperform previously suggested measures that are based on document-specific characteristics, and we discuss the merits of our link induction method by comparing it with previously suggested notions of inter-document relationships (e.g., cosines in a vector space).