[Topic-models] Non-parametric topic models

Wray Buntine wray.buntine at monash.edu
Mon Feb 20 17:08:03 EST 2017

Hi Thibaut

What's wrong with this is that its not hierarchical.
You allow the theta to be infinite, but you don't give them all a common
The main advantage of the HDP-LDA method is that it allows topics to have
proportions.  You're doing that in a very controlled way with stick
breaking, but with the
HDP you get to better fit the overall topic proportions.

The HDP-LDA is more or less equivalent to your's but with
       alpha ~ GEM(psi_0)
      * For each document d = 1, ..., D
        - Draw theta_d ~ Dirichlet(alpha*psi_1)

NB.  using a bit of liberty here with the Dirichlet as alpha is an infinite
vector, but
          just truncate it

This extra level means alpha is estimated giving topic proportions.

This, is rather similar to the Asymmetric-Symmetric LDA Model in Mallet,
as it happens is *almost* truncated HDP-LDA and beats the pants off most
HDP-LDA implementations in perplexity and is 10-100 times faster than most.
Experiments reported in my KDD 2014 paper.

So your model would be OK, and it would "fit" the number of topics, but a
implementation of the above *should* beat it.  Implementations vary so much
that YMMV.

As an implementation note, I know of few contexts where Chinese restaurant
processes, hierarchical or franchise, give competitive sampling algorithms.

Finally, the more interesting model is this one:

beta = GEM(mu_0,nu_0)
* For each topic k = 1, 2, ...
  - Draw phi_k ~ PYP(beta,mu_1)
alpha = GEM(psi_0,nu_0)
* For each document d = 1, ..., D
  - Draw theta_d ~ Dirichlet(alpha*\psi_1)
  - For each n = 1, ..., N_d
    + Draw z_{dn} ~ Discrete(theta_d)
    + Draw w_{dn} ~ Discrete(phi_{z_dn})

NB.  the two-parameter GEM is the vector version of the Pitman-Yor process,
       and the PYP is used on the word side to take advantage of Zipfian
       behaviour of words

In this case alpha is the topic proportions, a latent vector that is
estimated, and
beta is the *background* word proportions which again is latent and
Algorithms based on Chinese restaurants simply give up with size of the
word vectors, but more modern algorithms work and do lovely estimates of
"background", i.e., non-topical words and make your topics in phi more
interpretable as well as improving perplexity.

Prof. Wray Buntine
Course Director for Master of Data Science
Monash University

On 21 February 2017 at 04:23, Thibaut Thonet <thibaut.thonet at irit.fr> wrote:

> Hi all,
> I've got a question about non-parametric topic models. I'm wondering
> whether the model described by the following generative process makes any
> sense:
> * For each topic k = 1, 2, ...
>   - Draw phi_k ~ Dirichlet(beta)
> * For each document d = 1, ..., D
>   - Draw theta_d ~ GEM(alpha)
>   - For each n = 1, ..., N_d
>     + Draw z_{dn} ~ Discrete(theta_d)
>     + Draw w_{dn} ~ Discrete(phi_{z_dn})
> This resembles the stick-breaking version of the Hierarchical Dirichlet
> Process (described in Yee Whye Teh's 2006 paper), but the difference is
> that theta_d is directly drawn from GEM(alpha) instead of being drawn from
> a DP(alpha, theta_0) where theta_0 is a GEM-distributed base measure shared
> across all documents. Under the CRP interpretation, this is a sort of
> hybrid between the Chinese restaurant process and the Chinese restaurant
> franchise: in this model, p(z_{dn} = k | z_{-dn}) is proportional to
> n_{dk}^{-dn} if k is an existing topic and proportional to alpha if k is a
> new topic.
> Although I feel that there is something conceptually wrong with this
> model, I fail to put the finger on the exact arguments to prove it. My
> intuition is that since each theta_d is independently drawn from a GEM, the
> topic indexes should not be able to be shared across documents (i.e., topic
> k in document j need not be coherent with topic k in document j'). But
> since all documents will use the same {phi_k}_k -- which are generated
> independently from documents, it seems that this model's Gibbs sampler
> should nonetheless 'work' in practice and produce coherent topics.
> What also puzzles me is that this 'easy' non-parametric extension to
> parametric models (I described the 'easy' non-parametric extension to LDA
> in this example) is used in a few papers from top text mining conferences
> (e.g., SIGIR, CIKM, WWW), relating it to CRP or HDP (whereas it in fact
> isn't exactly either of them)...
> Thanks in advance for any insight on what's theoretically wrong (or not)
> with this model.
> Best,
> Thibaut
> _______________________________________________
> Topic-models mailing list
> Topic-models at lists.cs.princeton.edu
> https://lists.cs.princeton.edu/mailman/listinfo/topic-models
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cs.princeton.edu/pipermail/topic-models/attachments/20170221/609be1a0/attachment.html>

More information about the Topic-models mailing list