[talks] W Dong preFPO

Mon Nov 29 09:35:19 EST 2010

Wei Dong will present his preFPO on Thursday December 2 
at 2PM in Room 402.  
The members of his committee are:  Kai Li, advisor; Moses Charikar
and Andrea LaPaugh, readers; Tom Funkhouser and Olga Troyanskaya,
nonreaders.  Everyone is invited to attend his talk.  His abstract
follows below.
----------------------------------

Title: High-Dimensional Similarity Search for Large Datasets

Images and other non-text feature-rich data are predominant in
today's exponentially growing digital universe.  How to organize
such data at large scale for efficient content-based search is
an increasingly important problem, which remains open after two
decades of intensive research.  The challenge is that feature-rich
data are usually of high dimensionality, and the dimensionality
curse indicates that as dimensionality grows, any search strategy
is subject to an increasingly large candidate set (the portion of
the dataset necessary to be examined), and eventually degenerate to
brute-force scan.  This thesis approaches the problem by addressing
two aspects of candidate set size: the number of items and the size
of each item.  Specifically, we made the following contributions.

First, we substantially improved Locality-Sensitive Hashing,
the state of the art of high-dimensional similarity search.
We developed an accurate performance model for automatic parameter
tuning with sampled data, and a method to improve search results
with an offline computed K-Nearest Neighbor (K-NN) graph.

Second, we developed compact data representations optimized for
similarity search tasks, where we only need to approximate distances
in a roughly fixed small range.  Further more, with an asymmetric
distance estimation method to exploit the uncompressed query data,
we are able to further compress the indexed dataset.

Third, we developed an efficient method for offline K-NN graph
construction, i.e. jointly solving K-NN search for every point,
based on the idea that a K-NN graph approximation can be improved
by comparing each point to its neighbors' neighbors.

Finally, we demonstrated some of the proposed techniques by building
a large-scale near-duplicate image search system, which serves
more than 70 million web images on a single commodity server and
responses to queries within a few seconds.