[talks] W Dong general exam

Melissa M Lawson mml at CS.Princeton.EDU
Thu May 3 10:19:06 EDT 2007

Wei Dong will present his research seminar/general exam on Wednesday May 9 at 
2PM in Room 301 (note room!)  The members of his committee are:  Kai Li (advisor), 
Moses Charikar, and Andrea LaPaugh.  Everyone is invited to attend his talk, and 
those faculty wishing to remain for the oral exam following are welcome to do so.
His final abstract and reading list appear below.

K-Nearest Neighbor (K-NN) search on high dimensional feature vectors is the principal
problem underlying content based similarity search systems. The recently developed
Locality Sensitive Hashing approach has advantages over traditional space partitioning
methods, which suffer from the curse of dimensionality. In this project we aim at
addressing some practical issues when applying LSH to a real system.

First,  we propose an LSH based compact representation of feature vectors (aka sketch).
We have been using main memory indices of feature vectors to reduce access time.  By
keeping sketches in place of the original feature vectors, we can significantly reduce the
memory overhead and maintain high precision and low query time.

Second, we study the geometric and statistical characteristics of different data sets and
identify some common properties with a set of associated parameters.  Basing on that we
build a performance model of LSH, making it possible to tune the parameters of LSH

Finally, we show that local geometric properties have a non-trivial impact on query
results.  A fixed set of parameters optimized for the global average can still yield poor
results on some of the queries.  We address this problem by adaptively guiding the search
process of each query using partial results seen so far.

Reading List

Text book:

Tanenbaum, Modern Operating Systems
Hector Garcia, Jeffrey Ullman, Jennifer Widom, Database Systems The Complete Book, Chapter


Bohm, C.; Berchtold, S. & Keim, D. A. (2001), 'Searching in high-dimensional spaces: Index
structures for improving the performance of multimedia databases', /ACM Comput. Surv./
*33*(3), 322--373.

Bawa, M.; Condie, T. & Ganesan, P. (2005),LSH forest: self-tuning indexes for similarity
search, /in /'WWW '05: Proceedings of the 14th international conference on World Wide
Web', ACM Press, New York, NY, USA, pp. 651--660.

Datar, M.; Immorlica, N.; Indyk, P. & Mirrokni, V. S.
(2004),Locality-sensitive hashing scheme based on p-stable distributions, /in /'SCG '04:
Proceedings of the twentieth annual symposium on Computational geometry', ACM Press, New
York, NY, USA, pp.

Indyk, P. & Motwani, R. (1998),Approximate nearest neighbors: towards removing the curse
of dimensionality, /in /'STOC '98: Proceedings of the thirtieth annual ACM symposium on
Theory of computing', ACM Press, New York, NY, USA, pp. 604--613.

Lv, Q.; Josephson, W.; Wang, Z.; Charikar, M.; Li, K.(2006) Efficient Filtering with
Sketches in the Ferret Toolkit. In Proceedings of the 8th ACM SIGMM International Workshop
on Multimedia Information Retrieval (MIR). Santa Barbara, CA, USA. October 2006.

Lv, Q.; Josephson, W.; Wang, Z.; Charikar, M.; Li, K.(2006) A Time-Space Efficient
Locality Sensitive Hashing Method for Similarity Search in High Dimensions.  Technical

Panigrahy, R. (2006),Entropy based nearest neighbor search in high dimensions, /in /'SODA
'06: Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm', ACM
Press, New York, NY, USA, pp. 1186--1195.

Weber, R.; Schek, H. & Blott, S. (1998),A Quantitative Analysis and Performance Study for
Similarity-Search Methods in High-Dimensional Spaces, /in /'VLDB '98: Proceedings of the
24rd International Conference on Very Large Data Bases', Morgan Kaufmann Publishers Inc.,
San Francisco, CA, USA, pp. 194--205.

Henzinger, M. 2006. Finding near-duplicate web pages: a large-scale evaluation of
algorithms. In Proceedings of the 29th Annual international ACM SIGIR Conference on
Research and Development in information Retrieval

More information about the talks mailing list