[Topic-models] probabilities for documents of different sizes

Laura Dietz dietz at informatik.hu-berlin.de
Wed Aug 20 05:43:00 EDT 2008

Sorry forgot to reply to the list...

-------- Original Message --------
Subject: Re: [Topic-models] probabilities for documents of different sizes
Date: Wed, 20 Aug 2008 11:16:51 +0200
From: Laura Dietz <dietz at informatik.hu-berlin.de>
To: Thomas G. Dietterich <tgd at eecs.oregonstate.edu>
References: <7E8FEF671B744E9398DE03059DF96334 at oregone1295e5a>


Several ideas came into my mind. First, I think the geometric mean is
neither far fetched nor a hack, as it expresses the average
compatability of the document's words to the learned topics.

You could also repeatedly sample a k word long sub-document, calculate
the sub-document's likelihood, and average your samples. This would pay
respect to the sparseness prior, but resorts to yet another way of
calculating a mean of word probabilities.

A generative process that directly models "document fits topics"
incorporating a background word distribution might be as follows.

For each document, draw topic distribution theta (~ Dirichlet) and a
compatability distribution lambda (~ Beta), representing a coin parameter.
For each word, flip the document's coin lambda.
If lambda is 1, draw a topic from theta, and a word from the topics word
Otherwise, draw a word from a uniform word distribution (or one
representing the vocabulary universe).

Your inference process should be separated in estimating the topics from
the training documents, and then estimating lambda/theta for the
testdocuments holding the word distributions fixed.

The learned coin parameter lambda represents a measure of compatability
to the topics.

Hope this helps,

Laura Dietz, Dipl.-Inform.
Microsoft Research scholarship holder, IMPRS student
Research Group 2 (Machine Learning)
Max Planck Institut für Informatik
Campus E1.4, 66123 Saarbrücken, Germany
Room 429
Phone: +49 681 9325 529
E-mail: dietz at mpi-inf.mpg.de

More information about the Topic-models mailing list