[talks] 11am Tue Apr 17 talk on large alphabet probability estimation in B205 in the E-Quad

Mon Apr 16 22:49:18 EDT 2007

Speaker: Narayana P.Santhanam, UC Berkeley
Date:    Tuesday, April 17, 2007
Time:    11:00am
Room:    B205 ~ EQuad 

Title:   New solution for old problems: Large alphabet probability
estimation

Abstract:

Modern advancements in communication, computation, and storage has made
possible complex systems like the Internet as well as helped scientific
advances like the Genome project, none of which would have been conceivable
when I was born.

New advancements bring about new problems. Two aspects of some of these
problems have captured my interest.

First, a fair number of these problems require solutions for very large
alphabets. For instance, language models for speech recognition estimate
distributions over English words; thousands of genes are clustered by their
expression levels for applications in diagnosis and drug response prediction
using the limited number of samples that can be obtained from test subjects.

On the other hand, a lot of results in both statistics and information
theory assumes that we operate in a regime where the data size is much
larger than the alphabet size. We are therefore forced to rework problems
where conventional approaches no longer apply.

Second, problems posed by different systems are interconnected. For example,
consider text compression on the one hand, along with language modeling for
speech recognition on the other. The former tries to compress as well as the
unknown underlying distribution, the latter estimates word probabilities
associated with the unknown underlying distribution.

The talk will examine some recent results in the related areas of large
alphabet probability estimation and data compression. These results should
be seen as a first step towards new solutions for classification, entropy
estimation and inference problems arising from modern finance, biology, and
data mining.

Bio

Narayana Santhanam obtained his MS and PhD from the University of
California, San Diego in 2003 and 2006 respectively. He currently holds a
postdoctoral position in the University of California, Berkeley.

He is the recepient of the 2003 Capocelli Prize for student authored papers
at the Data Compression Conference and the 2006 IEEE Best Paper Award along
with Prof. Alon Orlitsky and Dr. Junan Zhang. His research interests include
large alphabet problems, the intersection of information theory and machine
learning, and their applications.