[Topic-models] topic models, complexity and additional knowledge
mimno at cs.umass.edu
Tue Mar 30 10:04:11 EDT 2010
On Tue, Mar 30, 2010 at 12:12:03PM +0200, Julien Velcin wrote:
> 1) What about the calculation of the topic model complexity? The time
> goes by and the models seem to become more complex. The task of
> estimating is more difficult because the number of parameters to
> estimate grows up. This complexity could be confronted to the quality
> of prediction (for instance, using the perplexity measure) and to
> machine runtime.
The standard LDA topic model is a powerful method, but nobody really
believes that it is a particularly good model of how people create
documents. Something more complicated must be happening, but I don't think
anyone has come up with a model that is conclusively better. The ability
to predict what words will appear together in unseen documents is an
important measure of model quality, but shouldn't be used as the one final
> 2) The hyperparameters Alpha and Beta are often set to some constants,
> even if they can be estimated too. If they're considered constant, how
> restricted is the kind of topics we can learn? The relation between
> Beta and the number k of topics doesn't seem to me as clear as stated
> in some papers.
There was a fairly extensive discussion of hyperparameters on this list
about a month ago. You might also look at our paper "Rethinking LDA"
(Wallach et al., NIPS, 2009). In practice, setting hyperparameters to
constants isn't particularly bad, but you may need to be more careful
about curating the vocabulary you use (eg removing common words).
The Dirichlet prior on topic-word distributions (beta) controls your prior
expectation of how many distinct word types appear in any given topic.
Beta times the size of the vocabulary is the prior weight on a uniform
distribution over word types, measured in "word units". 0.01 is a typical
value, optimizing this parameter leads to values between 0.004 and 0.02,
in my experience. Note that this parameters significantly affects held-out
More information about the Topic-models