Huy Nguyen will present his preFPO on Friday December 6 at 4:30 PM in Room 402.
The members of his committee are: Moses Charikar, advisor; Bernard Chazelle and
Piotr Indyk (MIT), readers; Sanjeev Arora and Mark Braverman, nonreaders. Everyone
is invited to attend his talk. His abstract follows below.
Title: Algorithms for Data in High Dimensions
In the last decade, the availability of large data sets
in high dimensions has been both an opportunity and a challenge for
algorithm
design. Many new algorithms are invented and yet many challenges
remain
unconquered. This thesis examines several methods for handling such
data
efficiently in terms of time and space, their limitations, and new
techniques
to get beyond those limitations.
First, we look at embeddings, a powerful technique to reduce data in
high
dimensions to low dimensions. Tight or near tight bounds are shown
for sparse
subspace embeddings, sparse Johnson-Lindenstrauss transforms, and
sparse RIP
matrices. We also touch upon applications of subspace embeddings in
numerical
linear algebra and approximating large eigenvalues in both the
classical
setting and the streaming setting. For instance, the algorithm
implied by our
bound for subspace embeddings runs in time linear in the sparsity of
the input
plus an additive term that is optimal unless one can bypass generic
matrix
multiplication. To conclude the embedding section, we show a near
linear lower
bound for dimension reduction in L1. This bound is close to optimal
as one can
reduce the dimensions to linear while approximately preserving all
pairwise distances.
Even though significant dimension reduction is not possible in L1,
fortunately
there is a technique called locality sensitive hashing (LSH)
allowing for
efficient approximate nearest neighbor search in L1. In the second
section of
the thesis, we give a new construction of LSH for L_p (1<p<2)
that is close
to the lower bound for LSH. The algorithm is a generalization of
previous work for
L2 via the uniform convexity property.
Lastly, in the final section of the thesis, we overcome
the aforementioned lower bound for LSH and design a new algorithm
for
approximate nearest neighbor search in L1 and L2 that is faster and
uses less memory than any LSH.
Unlike LSH, the algorithm exploits the structure of the whole point
set, and
can be viewed as a step toward understanding data-aware methods for
nearest
neighbor search, which are ubiquitous in practice.