[Topic-models] topic models, complexity and additional knowledge

David Mimno mimno at cs.umass.edu
Tue Mar 30 10:04:11 EDT 2010


On Tue, Mar 30, 2010 at 12:12:03PM +0200, Julien Velcin wrote:
> 1) What about the calculation of the topic model complexity? The time  
> goes by and the models seem to become more complex. The task of  
> estimating is more difficult because the number of parameters to  
> estimate grows up. This complexity could be confronted to the quality  
> of prediction (for instance, using the perplexity measure) and to  
> machine runtime.

The standard LDA topic model is a powerful method, but nobody really 
believes that it is a particularly good model of how people create 
documents. Something more complicated must be happening, but I don't think 
anyone has come up with a model that is conclusively better. The ability 
to predict what words will appear together in unseen documents is an 
important measure of model quality, but shouldn't be used as the one final 
metric.

> 2) The hyperparameters Alpha and Beta are often set to some constants,  
> even if they can be estimated too. If they're considered constant, how  
> restricted is the kind of topics we can learn? The relation between  
> Beta and the number k of topics doesn't seem to me as clear as stated  
> in some papers.

There was a fairly extensive discussion of hyperparameters on this list 
about a month ago. You might also look at our paper "Rethinking LDA" 
(Wallach et al., NIPS, 2009). In practice, setting hyperparameters to 
constants isn't particularly bad, but you may need to be more careful 
about curating the vocabulary you use (eg removing common words).

The Dirichlet prior on topic-word distributions (beta) controls your prior 
expectation of how many distinct word types appear in any given topic. 
Beta times the size of the vocabulary is the prior weight on a uniform 
distribution over word types, measured in "word units". 0.01 is a typical 
value, optimizing this parameter leads to values between 0.004 and 0.02, 
in my experience. Note that this parameters significantly affects held-out 
likelihood experiments.

-David


More information about the Topic-models mailing list