Huy Nguyen will present his preFPO on Friday December 6 at 4:30 PM in Room 402.
The members of his committee are: Moses Charikar, advisor; Bernard Chazelle and
Piotr Indyk (MIT), readers; Sanjeev Arora and Mark Braverman, nonreaders. Everyone
is invited to attend his talk. His abstract follows below.

Title: Algorithms for Data in High Dimensions

In the last decade, the availability of large data sets in high dimensions has been both an opportunity and a challenge for algorithm design. Many new algorithms are invented and yet many challenges remain unconquered. This thesis examines several methods for handling such data efficiently in terms of time and space, their limitations, and new techniques to get beyond those limitations.

First, we look at embeddings, a powerful technique to reduce data in high dimensions to low dimensions. Tight or near tight bounds are shown for sparse subspace embeddings, sparse Johnson-Lindenstrauss transforms, and sparse RIP matrices. We also touch upon applications of subspace embeddings in numerical linear algebra and approximating large eigenvalues in both the classical setting and the streaming setting. For instance, the algorithm implied by our bound for subspace embeddings runs in time linear in the sparsity of the input plus an additive term that is optimal unless one can bypass generic matrix multiplication. To conclude the embedding section, we show a near linear lower bound for dimension reduction in L1. This bound is close to optimal as one can reduce the dimensions to linear while approximately preserving all pairwise distances.

Even though significant dimension reduction is not possible in L1, fortunately there is a technique called locality sensitive hashing (LSH) allowing for efficient approximate nearest neighbor search in L1. In the second section of the thesis, we give a new construction of LSH for L_p (1<p<2) that is close to the lower bound for LSH. The algorithm is a generalization of previous work for L2 via the uniform convexity property.

Lastly, in the final section of the thesis, we overcome the aforementioned lower bound for LSH and design a new algorithm for approximate nearest neighbor search in L1 and L2 that is faster and uses less memory than any LSH. Unlike LSH, the algorithm exploits the structure of the whole point set, and can be viewed as a step toward understanding data-aware methods for nearest neighbor search, which are ubiquitous in practice.