[Topic-models] Intuition behind CTM and DTM

David Mimno mimno at cs.umass.edu
Mon Nov 24 09:29:20 EST 2008

On Mon, Nov 24, 2008 at 12:14:11AM -0700, Lei Tang wrote:
> 1. In correlated topic models, the topic proportion is sampled from a
> logistic normal distribution instead of Dirichlet as in LDA. I didn't quite
> understand the intuition behind such a modeling. Why is logistic normal
> distribution has such power?

There are two primary advantages:

First, covariance. (See John Aitchison's work for a more detailed 
discussion.) Let's say I have a corpus with three topics: sports (team, 
player, league), politics (weapons, trade, president), and negotiation 
(meeting, deadline, agreement). Both sports and politics occur with 
negotiation, but sports and politics rarely cooccur.

With a Dirichlet, all I can say is how often I expect each topic to occur 
(the values of the parameters in proportion to each other) and how much I 
expect any given document to follow those proportions (the sum of the 
parameters, where larger = less variance). With a logistic normal, I can 
set up a covariance matrix with positive covariance between sports and 
negotiation but negative covariance between sports and politics.

Second, there are very well studied models for time-series and 
spatio-temporal data in continuous spaces. These usually aren't applicable 
to count data like words, but if you can represent the word counts as 
derived from a real-valued hidden variable, Kalman filtering and dynamic 
linear models become available.

Here are two R functions that might help give some intuition for the 
parameterization and the behavior of Dirichlets and logistic normals:

## Dirichlet
rdirichlet <- function(alpha = c(1.0, 1.0, 1.0)) {
	n <- length(alpha)
	result <- rep(0, )
	for (i in 1:n) {
		result[i] <- rgamma(1, alpha[i])
	result / sum(result)

## zero-mean logistic normal
rlogisticnorm <- function(covariance = matrix(c(2, 0.5, -0.5, 0.5, 2, 0.5, 
-0.5, 0.5, 2), nrow=3)) {
	n <- dim(covariance)[1]
	result <- exp(covariance %*% rnorm(n))

	result / sum(result)


More information about the Topic-models mailing list