Wei Dong will present his preFPO on Thursday December 2 at 2PM in Room 402. The members of his committee are: Kai Li, advisor; Moses Charikar and Andrea LaPaugh, readers; Tom Funkhouser and Olga Troyanskaya, nonreaders. Everyone is invited to attend his talk. His abstract follows below. ---------------------------------- Title: High-Dimensional Similarity Search for Large Datasets Images and other non-text feature-rich data are predominant in today's exponentially growing digital universe. How to organize such data at large scale for efficient content-based search is an increasingly important problem, which remains open after two decades of intensive research. The challenge is that feature-rich data are usually of high dimensionality, and the dimensionality curse indicates that as dimensionality grows, any search strategy is subject to an increasingly large candidate set (the portion of the dataset necessary to be examined), and eventually degenerate to brute-force scan. This thesis approaches the problem by addressing two aspects of candidate set size: the number of items and the size of each item. Specifically, we made the following contributions. First, we substantially improved Locality-Sensitive Hashing, the state of the art of high-dimensional similarity search. We developed an accurate performance model for automatic parameter tuning with sampled data, and a method to improve search results with an offline computed K-Nearest Neighbor (K-NN) graph. Second, we developed compact data representations optimized for similarity search tasks, where we only need to approximate distances in a roughly fixed small range. Further more, with an asymmetric distance estimation method to exploit the uncompressed query data, we are able to further compress the indexed dataset. Third, we developed an efficient method for offline K-NN graph construction, i.e. jointly solving K-NN search for every point, based on the idea that a K-NN graph approximation can be improved by comparing each point to its neighbors' neighbors. Finally, we demonstrated some of the proposed techniques by building a large-scale near-duplicate image search system, which serves more than 70 million web images on a single commodity server and responses to queries within a few seconds.
participants (1)
-
Melissa Lawson