Database Selection in Distributed Information Retrieval: A Study of Multi-Collection Information Retrieval

Powell, Allison L., Department of Computer Science, University of Virginia
French, James C., Department of Computer Science, University of Virginia

The proliferation of online information resources increases the importance of effective and efficient information retrieval in a multi-collection environment. Multi-collection searching includes distributed searching as a special case but is more broadly defined here to incorporate searching partitioned content independently from its physical storage. It is cast in three parts: collection selection (also referred to as database selection) - decide here should a query be sent; query processing - execute the query at each selected collection; and results merging - combine the results from individual collections into a single coherent list for the searcher. We focus our attention on collection selection.

We compare a number of different collection selection approaches and examine the effect of collection selection on document retrieval performance. We consider multi-collection retrieval in six different test environments utilizing three document test beds. Considering collection selection in isolation, we find that effective collection selection can be achieved using limited information about each collection. We then turn our attention from selection alone to data item retrieval in a multi-collection environment, considering retrieval performance in the same six test environments. First we find that good collection selection has the potential to result in better retrieval effectiveness than can be achieved in an equivalent single collection. Second we find that good performance can be achieved when only a few collections are selected and that the performance generally increases as more collections are selected. Finally we find that when collection selection is employed, it may not be necessary to maintain collection wide information (CWI), e.g., global idf. Local information can be used to achieve equivalent performance. This means that multi- collection systems can be engineered with more autonomy and less cooperation. This work demonstrates that improvements in collection selection can lead to broader improvements in document retrieval performance.

PHD (Doctor of Philosophy)
information retrieval, collection selection, online information
All rights reserved (no additional license for public reuse)
Issued Date: