[Topic-models] Non-parametric topic models
Thibaut Thonet
thibaut.thonet at irit.fr
Tue Feb 21 10:57:11 EST 2017
Hi Wray,
Thanks a lot for your detailed and thorough answer. So I conclude from
what you said that the model I described isn't 'wrong', but it would
just (most likely) perform worse than, e.g., HDP-LDA or the NP-LDA from
your 2014 KDD paper. I'm nonetheless surprised that no work in
literature evaluated this model and compared it against hierarchical
non-parametric models and against vanilla LDA (symmetric-symmetric).
Although it is indeed pretty sure that it would yield a higher
perplexity than that of hierarchical non-parametric models, it seems
that posterior inference for that model (e.g., using direct assignment
sampling), would be time-wise about as efficient as that of vanilla LDA
-- since table counts need not be sampled in that version, given its
non-hierarchical nature. So I'm curious whether its effectiveness
(perplexity, topic coherence) is better than that of vanilla LDA, or
otherwise if flat priors are more penalizing in a non-parametric setting.
Best,
Thibaut
Le 20/02/2017 à 23:08, Wray Buntine a écrit :
> Hi Thibaut
>
> What's wrong with this is that its not hierarchical.
> You allow the theta to be infinite, but you don't give them all a
> common parent.
> The main advantage of the HDP-LDA method is that it allows topics to
> have different
> proportions. You're doing that in a very controlled way with stick
> breaking, but with the
> HDP you get to better fit the overall topic proportions.
>
> The HDP-LDA is more or less equivalent to your's but with
> alpha ~ GEM(psi_0)
> * For each document d = 1, ..., D
> - Draw theta_d ~ Dirichlet(alpha*psi_1)
>
> NB. using a bit of liberty here with the Dirichlet as alpha is an
> infinite vector, but
> just truncate it
>
> This extra level means alpha is estimated giving topic proportions.
>
> This, is rather similar to the Asymmetric-Symmetric LDA Model in
> Mallet, which
> as it happens is *almost* truncated HDP-LDA and beats the pants off most
> HDP-LDA implementations in perplexity and is 10-100 times faster than
> most.
> Experiments reported in my KDD 2014 paper.
>
> So your model would be OK, and it would "fit" the number of topics,
> but a good
> implementation of the above *should* beat it. Implementations vary so
> much that YMMV.
>
> As an implementation note, I know of few contexts where Chinese restaurant
> processes, hierarchical or franchise, give competitive sampling
> algorithms.
>
> Finally, the more interesting model is this one:
>
> beta = GEM(mu_0,nu_0)
> * For each topic k = 1, 2, ...
> - Draw phi_k ~ PYP(beta,mu_1)
> alpha = GEM(psi_0,nu_0)
> * For each document d = 1, ..., D
> - Draw theta_d ~ Dirichlet(alpha*\psi_1)
> - For each n = 1, ..., N_d
> + Draw z_{dn} ~ Discrete(theta_d)
> + Draw w_{dn} ~ Discrete(phi_{z_dn})
>
> NB. the two-parameter GEM is the vector version of the Pitman-Yor
> process,
> and the PYP is used on the word side to take advantage of Zipfian
> behaviour of words
>
> In this case alpha is the topic proportions, a latent vector that is
> estimated, and
> beta is the *background* word proportions which again is latent and
> estimated.
> Algorithms based on Chinese restaurants simply give up with size of the
> word vectors, but more modern algorithms work and do lovely estimates of
> "background", i.e., non-topical words and make your topics in phi more
> interpretable as well as improving perplexity.
>
> Prof. Wray Buntine
> Course Director for Master of Data Science
> Monash University
> http://topicmodels.org
>
> On 21 February 2017 at 04:23, Thibaut Thonet <thibaut.thonet at irit.fr
> <mailto:thibaut.thonet at irit.fr>> wrote:
>
> Hi all,
>
> I've got a question about non-parametric topic models. I'm
> wondering whether the model described by the following generative
> process makes any sense:
> * For each topic k = 1, 2, ...
> - Draw phi_k ~ Dirichlet(beta)
> * For each document d = 1, ..., D
> - Draw theta_d ~ GEM(alpha)
> - For each n = 1, ..., N_d
> + Draw z_{dn} ~ Discrete(theta_d)
> + Draw w_{dn} ~ Discrete(phi_{z_dn})
>
> This resembles the stick-breaking version of the Hierarchical
> Dirichlet Process (described in Yee Whye Teh's 2006 paper), but
> the difference is that theta_d is directly drawn from GEM(alpha)
> instead of being drawn from a DP(alpha, theta_0) where theta_0 is
> a GEM-distributed base measure shared across all documents. Under
> the CRP interpretation, this is a sort of hybrid between the
> Chinese restaurant process and the Chinese restaurant franchise:
> in this model, p(z_{dn} = k | z_{-dn}) is proportional to
> n_{dk}^{-dn} if k is an existing topic and proportional to alpha
> if k is a new topic.
>
> Although I feel that there is something conceptually wrong with
> this model, I fail to put the finger on the exact arguments to
> prove it. My intuition is that since each theta_d is independently
> drawn from a GEM, the topic indexes should not be able to be
> shared across documents (i.e., topic k in document j need not be
> coherent with topic k in document j'). But since all documents
> will use the same {phi_k}_k -- which are generated independently
> from documents, it seems that this model's Gibbs sampler should
> nonetheless 'work' in practice and produce coherent topics.
>
> What also puzzles me is that this 'easy' non-parametric extension
> to parametric models (I described the 'easy' non-parametric
> extension to LDA in this example) is used in a few papers from top
> text mining conferences (e.g., SIGIR, CIKM, WWW), relating it to
> CRP or HDP (whereas it in fact isn't exactly either of them)...
>
> Thanks in advance for any insight on what's theoretically wrong
> (or not) with this model.
>
> Best,
>
> Thibaut
>
> _______________________________________________
> Topic-models mailing list
> Topic-models at lists.cs.princeton.edu
> <mailto:Topic-models at lists.cs.princeton.edu>
> https://lists.cs.princeton.edu/mailman/listinfo/topic-models
> <https://lists.cs.princeton.edu/mailman/listinfo/topic-models>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cs.princeton.edu/pipermail/topic-models/attachments/20170221/0e3dc203/attachment-0001.html>
More information about the Topic-models
mailing list