Wei Dong will present his research seminar/general exam on Wednesday May 9 at 2PM in Room 301 (note room!) The members of his committee are: Kai Li (advisor), Moses Charikar, and Andrea LaPaugh. Everyone is invited to attend his talk, and those faculty wishing to remain for the oral exam following are welcome to do so. His final abstract and reading list appear below. ---------------------------------------- Abstract K-Nearest Neighbor (K-NN) search on high dimensional feature vectors is the principal problem underlying content based similarity search systems. The recently developed Locality Sensitive Hashing approach has advantages over traditional space partitioning methods, which suffer from the curse of dimensionality. In this project we aim at addressing some practical issues when applying LSH to a real system. First, we propose an LSH based compact representation of feature vectors (aka sketch). We have been using main memory indices of feature vectors to reduce access time. By keeping sketches in place of the original feature vectors, we can significantly reduce the memory overhead and maintain high precision and low query time. Second, we study the geometric and statistical characteristics of different data sets and identify some common properties with a set of associated parameters. Basing on that we build a performance model of LSH, making it possible to tune the parameters of LSH automatically. Finally, we show that local geometric properties have a non-trivial impact on query results. A fixed set of parameters optimized for the global average can still yield poor results on some of the queries. We address this problem by adaptively guiding the search process of each query using partial results seen so far. Reading List Text book: Tanenbaum, Modern Operating Systems Hector Garcia, Jeffrey Ullman, Jennifer Widom, Database Systems The Complete Book, Chapter 11-14 Papers: Bohm, C.; Berchtold, S. & Keim, D. A. (2001), 'Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases', /ACM Comput. Surv./ *33*(3), 322--373. Bawa, M.; Condie, T. & Ganesan, P. (2005),LSH forest: self-tuning indexes for similarity search, /in /'WWW '05: Proceedings of the 14th international conference on World Wide Web', ACM Press, New York, NY, USA, pp. 651--660. Datar, M.; Immorlica, N.; Indyk, P. & Mirrokni, V. S. (2004),Locality-sensitive hashing scheme based on p-stable distributions, /in /'SCG '04: Proceedings of the twentieth annual symposium on Computational geometry', ACM Press, New York, NY, USA, pp. 253--262. Indyk, P. & Motwani, R. (1998),Approximate nearest neighbors: towards removing the curse of dimensionality, /in /'STOC '98: Proceedings of the thirtieth annual ACM symposium on Theory of computing', ACM Press, New York, NY, USA, pp. 604--613. Lv, Q.; Josephson, W.; Wang, Z.; Charikar, M.; Li, K.(2006) Efficient Filtering with Sketches in the Ferret Toolkit. In Proceedings of the 8th ACM SIGMM International Workshop on Multimedia Information Retrieval (MIR). Santa Barbara, CA, USA. October 2006. Lv, Q.; Josephson, W.; Wang, Z.; Charikar, M.; Li, K.(2006) A Time-Space Efficient Locality Sensitive Hashing Method for Similarity Search in High Dimensions. Technical report. ftp://ftp.cs.princeton.edu/techreports/2006/759.pdf Panigrahy, R. (2006),Entropy based nearest neighbor search in high dimensions, /in /'SODA '06: Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm', ACM Press, New York, NY, USA, pp. 1186--1195. Weber, R.; Schek, H. & Blott, S. (1998),A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces, /in /'VLDB '98: Proceedings of the 24rd International Conference on Very Large Data Bases', Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp. 194--205. Henzinger, M. 2006. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In Proceedings of the 29th Annual international ACM SIGIR Conference on Research and Development in information Retrieval