[Topic-models] Explanation about Polya urn model and LDA

Thibaut Thonet thibaut.thonet at irit.fr
Fri Jul 7 02:11:44 EDT 2017

Hi Gabriele,

Let's first have a look at the original LDA's generative process under 
the simple Polya urn perspective. We consider two types of urns. Urns of 
the first type (that we will call theta-urns) are specific to documents 
(i.e., D such urns) and each of them initially contains alpha balls of T 
different colors (T being the number of topics). Urns of the second type 
(that we will call phi-urns) are specific to topics (i.e., T such urns) 
and each of them initially contains beta balls of W different colors (W 
being the vocabulary size). For the n-th token in the d-th document, we 
first draw a ball from the d-th theta-urn. We observe its color z and 
set z_dn = z. We then apply the following replacement scheme: we put the 
ball back in the d-th theta-urn and add another ball of color z in that 
same urn. Secondly, we draw a ball from the z-th phi-urn. We observe its 
color w and set w_dn = w. And once again, we put the ball back in the 
z-th phi-urn and add another ball of color w in that urn.

The generative process for Generalized Polya Urn LDA (GPU-LDA) is very 
similar. We still have document-specific theta-urns and topic-specific 
phi-urns. The only difference compared with LDA lies in the replacement 
scheme for phi-urns, after drawing a ball from the z-th phi-urn, 
observing w and setting w_dn = w. Instead of putting the ball back in 
the z-th phi-urn and adding another ball of color w, we add A_vw balls 
for each color v=1...W to the z-th phi-urn. Intuitively, this will 
increase the likelihood of subsequently observing words v that are 
related to w (i.e., words v such that A_vw > 0) under topic z. The 
replacement scheme for theta-urns however remains the same as in LDA.

To put it differently, in GPU-LDA, the replacement scheme for theta-urns 
follows that of a simple Polya urn while the replacement scheme for 
phi-urns follows that of a generalized Polya urn. This is the reason why 
N_dz is only increased or decreased by 1, while for all v=1...W, N_zv is 
increased or decreased by A_vw. In that case, N_dz still represents the 
number of tokens in document d which are assigned topic z, but N_zv 
isn't anymore equal to the number of tokens with word type v which are 
assigned topic z in the collection. N_zv is the total number of balls 
with color v that were previously added to the z-th phi-urn (excluding 
the initial beta number of balls from the count) using the GPU 
replacement scheme.

Let me know if my explanation was clear enough.



Le 06/07/2017 à 17:01, Gabriele Pergola a écrit :
> Hello!
> I came across the paper "Optimizing semantic coherence in topic 
> models" by Mimno et al. 2011, where they present a modified version of 
> Gibbs sampling following the generalized Polya-urn model.
> I couldn't manage to find any code, it seems was not provided; so, I 
> decided to implement it by myself.
> However, I have got a problem. If you have look at the pseudocode 
> provided in the paper ("Algorithm 2"), the counter N_(z,d) about how 
> many words for a topic are present in a document is decremented and 
> incremented only by 1; but because of the polya urn approach, more 
> than one words in document can be assigned to a topic at once (line 10).
> I wonder if even this counter should be updated according to all the 
> new words that have been assigned to a new topic during one iteration 
> (line 10); otherwise, a fake value will be counted about how much a 
> topic is prominent in a document.
> I look forward some explanation.
> Best,
> Gabriele
> _______________________________________________
> Topic-models mailing list
> Topic-models at lists.cs.princeton.edu
> https://lists.cs.princeton.edu/mailman/listinfo/topic-models

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cs.princeton.edu/pipermail/topic-models/attachments/20170707/089f7c5f/attachment.html>

More information about the Topic-models mailing list