[Topic-models] Non-parametric topic models

Wray Buntine wray.buntine at monash.edu
Tue Feb 21 17:04:42 EST 2017

On 22 February 2017 at 02:57, Thibaut Thonet <thibaut.thonet at irit.fr> wrote:

> Hi Wray,
>
> Thanks a lot for your detailed and thorough answer. So I conclude from
> what you said that the model I described isn't 'wrong', but it would just
> (most likely) perform worse than, e.g., HDP-LDA or the NP-LDA from your
> 2014 KDD paper. I'm nonetheless surprised that no work in literature
> evaluated this model and compared it against hierarchical non-parametric
> models and against vanilla LDA (symmetric-symmetric).
>
Well, the words "meta", "infinite" and "hierarchical" earn a lot of brownie
points during the paper review process ;-)
Actually, I think reviewers would treat it is too small an improvement to
make the big conferences.  The original
HDP-LDA paper was a real revolution in capability.

> Although it is indeed pretty sure that it would yield a higher perplexity
> than that of hierarchical non-parametric models, it seems that posterior
> inference for that model (e.g., using direct assignment sampling), would be
> time-wise about as efficient as that of vanilla LDA -- since table counts
> need not be sampled in that version, given its non-hierarchical nature. So
> I'm curious whether its effectiveness (perplexity, topic coherence) is
> better than that of vanilla LDA, or otherwise if flat priors are more
> penalizing in a non-parametric setting.
>
I think you are right, it should perform well.  It should be fast too, as
the marginal posterior is similar to the standard one
for the Dirichlet case.   But you would want to sample the concentration
parameter too.

Best,
>
> Thibaut
>
> Le 20/02/2017 à 23:08, Wray Buntine a écrit :
>
> Hi Thibaut
>
> What's wrong with this is that its not hierarchical.
> You allow the theta to be infinite, but you don't give them all a common
> parent.
> The main advantage of the HDP-LDA method is that it allows topics to have
> different
> proportions.  You're doing that in a very controlled way with stick
> breaking, but with the
> HDP you get to better fit the overall topic proportions.
>
> The HDP-LDA is more or less equivalent to your's but with
>        alpha ~ GEM(psi_0)
>       * For each document d = 1, ..., D
>         - Draw theta_d ~ Dirichlet(alpha*psi_1)
>
> NB.  using a bit of liberty here with the Dirichlet as alpha is an
> infinite vector, but
>           just truncate it
>
> This extra level means alpha is estimated giving topic proportions.
>
> This, is rather similar to the Asymmetric-Symmetric LDA Model in Mallet,
> which
> as it happens is *almost* truncated HDP-LDA and beats the pants off most
> HDP-LDA implementations in perplexity and is 10-100 times faster than most.
> Experiments reported in my KDD 2014 paper.
>
> So your model would be OK, and it would "fit" the number of topics, but a
> good
> implementation of the above *should* beat it.  Implementations vary so
> much that YMMV.
>
> As an implementation note, I know of few contexts where Chinese restaurant
> processes, hierarchical or franchise, give competitive sampling algorithms.
>
> Finally, the more interesting model is this one:
>
> beta = GEM(mu_0,nu_0)
> * For each topic k = 1, 2, ...
>   - Draw phi_k ~ PYP(beta,mu_1)
> alpha = GEM(psi_0,nu_0)
> * For each document d = 1, ..., D
>   - Draw theta_d ~ Dirichlet(alpha*\psi_1)
>   - For each n = 1, ..., N_d
>     + Draw z_{dn} ~ Discrete(theta_d)
>     + Draw w_{dn} ~ Discrete(phi_{z_dn})
>
> NB.  the two-parameter GEM is the vector version of the Pitman-Yor process,
>        and the PYP is used on the word side to take advantage of Zipfian
>        behaviour of words
>
> In this case alpha is the topic proportions, a latent vector that is
> estimated, and
> beta is the *background* word proportions which again is latent and
> estimated.
> Algorithms based on Chinese restaurants simply give up with size of the
> word vectors, but more modern algorithms work and do lovely estimates of
> "background", i.e., non-topical words and make your topics in phi more
> interpretable as well as improving perplexity.
>
> Prof. Wray Buntine
> Course Director for Master of Data Science
> Monash University
> http://topicmodels.org
>
> On 21 February 2017 at 04:23, Thibaut Thonet <thibaut.thonet at irit.fr>
> wrote:
>
>> Hi all,
>>
>> I've got a question about non-parametric topic models. I'm wondering
>> whether the model described by the following generative process makes any
>> sense:
>> * For each topic k = 1, 2, ...
>>   - Draw phi_k ~ Dirichlet(beta)
>> * For each document d = 1, ..., D
>>   - Draw theta_d ~ GEM(alpha)
>>   - For each n = 1, ..., N_d
>>     + Draw z_{dn} ~ Discrete(theta_d)
>>     + Draw w_{dn} ~ Discrete(phi_{z_dn})
>>
>> This resembles the stick-breaking version of the Hierarchical Dirichlet
>> Process (described in Yee Whye Teh's 2006 paper), but the difference is
>> that theta_d is directly drawn from GEM(alpha) instead of being drawn from
>> a DP(alpha, theta_0) where theta_0 is a GEM-distributed base measure shared
>> across all documents. Under the CRP interpretation, this is a sort of
>> hybrid between the Chinese restaurant process and the Chinese restaurant
>> franchise: in this model, p(z_{dn} = k | z_{-dn}) is proportional to
>> n_{dk}^{-dn} if k is an existing topic and proportional to alpha if k is a
>> new topic.
>>
>> Although I feel that there is something conceptually wrong with this
>> model, I fail to put the finger on the exact arguments to prove it. My
>> intuition is that since each theta_d is independently drawn from a GEM, the
>> topic indexes should not be able to be shared across documents (i.e., topic
>> k in document j need not be coherent with topic k in document j'). But
>> since all documents will use the same {phi_k}_k -- which are generated
>> independently from documents, it seems that this model's Gibbs sampler
>> should nonetheless 'work' in practice and produce coherent topics.
>>
>> What also puzzles me is that this 'easy' non-parametric extension to
>> parametric models (I described the 'easy' non-parametric extension to LDA
>> in this example) is used in a few papers from top text mining conferences
>> (e.g., SIGIR, CIKM, WWW), relating it to CRP or HDP (whereas it in fact
>> isn't exactly either of them)...
>>
>> Thanks in advance for any insight on what's theoretically wrong (or not)
>> with this model.
>>
>> Best,
>>
>> Thibaut
>>
>> _______________________________________________
>> Topic-models mailing list
>> Topic-models at lists.cs.princeton.edu
>> https://lists.cs.princeton.edu/mailman/listinfo/topic-models
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cs.princeton.edu/pipermail/topic-models/attachments/20170222/a76e4aea/attachment.html>