[Topic-models] Explanation about Polya urn model and LDA (Thibaut Thonet)

Gabriele Pergola gabriele.pergola at gmail.com
Tue Jul 11 19:13:44 EDT 2017

Hi Thibaut,

Clear enough?! You have been great!
One of the clearest explanation I've read so far.

Actually, before your answer, I missed one point: the words that are
increased by A_vw are already "under topic z". Instead, I wrongly thought
that also the words under different topics might experience a frequency
increment; this will have entailed that those words would change their
topic assignments, which in turn would change the proportion of words
assigned to a topic in a document (i.e. N_dz).
Of course, this does not occur if the words, whose frequency is increased,
were already under the same topic.

Speaking of which, could you suggest me any works (if any exist) that have
explored the idea to assign the new sampled topic not only to the current
word but even to its related words?
(Supposed that this idea could make sense..).

Thank you so much for your help!

2017-07-07 17:00 GMT+01:00 <topic-models-request at lists.cs.princeton.edu>:

> Send Topic-models mailing list submissions to
>         topic-models at lists.cs.princeton.edu
> To subscribe or unsubscribe via the World Wide Web, visit
>         https://lists.cs.princeton.edu/mailman/listinfo/topic-models
> or, via email, send a message with subject or body 'help' to
>         topic-models-request at lists.cs.princeton.edu
> You can reach the person managing the list at
>         topic-models-owner at lists.cs.princeton.edu
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Topic-models digest..."
> Today's Topics:
>    1. Re: Explanation about Polya urn model and LDA (Thibaut Thonet)
> ---------- Messaggio inoltrato ----------
> From: Thibaut Thonet <thibaut.thonet at irit.fr>
> To: Gabriele Pergola <gabriele.pergola at gmail.com>
> Cc: topic-models <topic-models at lists.cs.princeton.edu>
> Bcc:
> Date: Fri, 7 Jul 2017 08:11:44 +0200
> Subject: Re: [Topic-models] Explanation about Polya urn model and LDA
> Hi Gabriele,
> Let's first have a look at the original LDA's generative process under the
> simple Polya urn perspective. We consider two types of urns. Urns of the
> first type (that we will call theta-urns) are specific to documents (i.e.,
> D such urns) and each of them initially contains alpha balls of T different
> colors (T being the number of topics). Urns of the second type (that we
> will call phi-urns) are specific to topics (i.e., T such urns) and each of
> them initially contains beta balls of W different colors (W being the
> vocabulary size). For the n-th token in the d-th document, we first draw a
> ball from the d-th theta-urn. We observe its color z and set z_dn = z. We
> then apply the following replacement scheme: we put the ball back in the
> d-th theta-urn and add another ball of color z in that same urn. Secondly,
> we draw a ball from the z-th phi-urn. We observe its color w and set w_dn =
> w. And once again, we put the ball back in the z-th phi-urn and add another
> ball of color w in that urn.
> The generative process for Generalized Polya Urn LDA (GPU-LDA) is very
> similar. We still have document-specific theta-urns and topic-specific
> phi-urns. The only difference compared with LDA lies in the replacement
> scheme for phi-urns, after drawing a ball from the z-th phi-urn, observing
> w and setting w_dn = w. Instead of putting the ball back in the z-th
> phi-urn and adding another ball of color w, we add A_vw balls for each
> color v=1...W to the z-th phi-urn. Intuitively, this will increase the
> likelihood of subsequently observing words v that are related to w (i.e.,
> words v such that A_vw > 0) under topic z. The replacement scheme for
> theta-urns however remains the same as in LDA.
> To put it differently, in GPU-LDA, the replacement scheme for theta-urns
> follows that of a simple Polya urn while the replacement scheme for
> phi-urns follows that of a generalized Polya urn. This is the reason why
> N_dz is only increased or decreased by 1, while for all v=1...W, N_zv is
> increased or decreased by A_vw. In that case, N_dz still represents the
> number of tokens in document d which are assigned topic z, but N_zv isn't
> anymore equal to the number of tokens with word type v which are assigned
> topic z in the collection. N_zv is the total number of balls with color v
> that were previously added to the z-th phi-urn (excluding the initial beta
> number of balls from the count) using the GPU replacement scheme.
> Let me know if my explanation was clear enough.
> Best,
> Thibaut
> Le 06/07/2017 à 17:01, Gabriele Pergola a écrit :
> Hello!
> I came across the paper "Optimizing semantic coherence in topic models" by
> Mimno et al. 2011, where they present a modified version of Gibbs sampling
> following the generalized Polya-urn model.
> I couldn't manage to find any code, it seems was not provided; so, I
> decided to implement it by myself.
> However, I have got a problem. If you have look at the pseudocode provided
> in the paper ("Algorithm 2"), the counter N_(z,d) about how many words for
> a topic are present in a document is decremented and incremented only by 1;
> but because of the polya urn approach, more than one words in document can
> be assigned to a topic at once (line 10).
> I wonder if even this counter should be updated according to all the new
> words that have been assigned to a new topic during one iteration (line
> 10); otherwise, a fake value will be counted about how much a topic is
> prominent in a document.
> I look forward some explanation.
> Best,
> Gabriele
> _______________________________________________
> Topic-models mailing listTopic-models at lists.cs.princeton.eduhttps://lists.cs.princeton.edu/mailman/listinfo/topic-models
> _______________________________________________
> Topic-models mailing list
> Topic-models at lists.cs.princeton.edu
> https://lists.cs.princeton.edu/mailman/listinfo/topic-models
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cs.princeton.edu/pipermail/topic-models/attachments/20170712/7e572f9c/attachment.html>

More information about the Topic-models mailing list