[Topic-models] Non-parametric topic models
Thibaut Thonet
thibaut.thonet at irit.fr
Mon Feb 20 12:23:24 EST 2017
Hi all,
I've got a question about non-parametric topic models. I'm wondering
whether the model described by the following generative process makes
any sense:
* For each topic k = 1, 2, ...
- Draw phi_k ~ Dirichlet(beta)
* For each document d = 1, ..., D
- Draw theta_d ~ GEM(alpha)
- For each n = 1, ..., N_d
+ Draw z_{dn} ~ Discrete(theta_d)
+ Draw w_{dn} ~ Discrete(phi_{z_dn})
This resembles the stick-breaking version of the Hierarchical Dirichlet
Process (described in Yee Whye Teh's 2006 paper), but the difference is
that theta_d is directly drawn from GEM(alpha) instead of being drawn
from a DP(alpha, theta_0) where theta_0 is a GEM-distributed base
measure shared across all documents. Under the CRP interpretation, this
is a sort of hybrid between the Chinese restaurant process and the
Chinese restaurant franchise: in this model, p(z_{dn} = k | z_{-dn}) is
proportional to n_{dk}^{-dn} if k is an existing topic and proportional
to alpha if k is a new topic.
Although I feel that there is something conceptually wrong with this
model, I fail to put the finger on the exact arguments to prove it. My
intuition is that since each theta_d is independently drawn from a GEM,
the topic indexes should not be able to be shared across documents
(i.e., topic k in document j need not be coherent with topic k in
document j'). But since all documents will use the same {phi_k}_k --
which are generated independently from documents, it seems that this
model's Gibbs sampler should nonetheless 'work' in practice and produce
coherent topics.
What also puzzles me is that this 'easy' non-parametric extension to
parametric models (I described the 'easy' non-parametric extension to
LDA in this example) is used in a few papers from top text mining
conferences (e.g., SIGIR, CIKM, WWW), relating it to CRP or HDP (whereas
it in fact isn't exactly either of them)...
Thanks in advance for any insight on what's theoretically wrong (or not)
with this model.
Best,
Thibaut
More information about the Topic-models
mailing list