[Topic-models] Explanation about Polya urn model and LDA
Thibaut Thonet
thibaut.thonet at irit.fr
Fri Jul 7 02:11:44 EDT 2017
Hi Gabriele,
Let's first have a look at the original LDA's generative process under
the simple Polya urn perspective. We consider two types of urns. Urns of
the first type (that we will call theta-urns) are specific to documents
(i.e., D such urns) and each of them initially contains alpha balls of T
different colors (T being the number of topics). Urns of the second type
(that we will call phi-urns) are specific to topics (i.e., T such urns)
and each of them initially contains beta balls of W different colors (W
being the vocabulary size). For the n-th token in the d-th document, we
first draw a ball from the d-th theta-urn. We observe its color z and
set z_dn = z. We then apply the following replacement scheme: we put the
ball back in the d-th theta-urn and add another ball of color z in that
same urn. Secondly, we draw a ball from the z-th phi-urn. We observe its
color w and set w_dn = w. And once again, we put the ball back in the
z-th phi-urn and add another ball of color w in that urn.
The generative process for Generalized Polya Urn LDA (GPU-LDA) is very
similar. We still have document-specific theta-urns and topic-specific
phi-urns. The only difference compared with LDA lies in the replacement
scheme for phi-urns, after drawing a ball from the z-th phi-urn,
observing w and setting w_dn = w. Instead of putting the ball back in
the z-th phi-urn and adding another ball of color w, we add A_vw balls
for each color v=1...W to the z-th phi-urn. Intuitively, this will
increase the likelihood of subsequently observing words v that are
related to w (i.e., words v such that A_vw > 0) under topic z. The
replacement scheme for theta-urns however remains the same as in LDA.
To put it differently, in GPU-LDA, the replacement scheme for theta-urns
follows that of a simple Polya urn while the replacement scheme for
phi-urns follows that of a generalized Polya urn. This is the reason why
N_dz is only increased or decreased by 1, while for all v=1...W, N_zv is
increased or decreased by A_vw. In that case, N_dz still represents the
number of tokens in document d which are assigned topic z, but N_zv
isn't anymore equal to the number of tokens with word type v which are
assigned topic z in the collection. N_zv is the total number of balls
with color v that were previously added to the z-th phi-urn (excluding
the initial beta number of balls from the count) using the GPU
replacement scheme.
Let me know if my explanation was clear enough.
Best,
Thibaut
Le 06/07/2017 à 17:01, Gabriele Pergola a écrit :
> Hello!
>
> I came across the paper "Optimizing semantic coherence in topic
> models" by Mimno et al. 2011, where they present a modified version of
> Gibbs sampling following the generalized Polya-urn model.
>
> I couldn't manage to find any code, it seems was not provided; so, I
> decided to implement it by myself.
>
> However, I have got a problem. If you have look at the pseudocode
> provided in the paper ("Algorithm 2"), the counter N_(z,d) about how
> many words for a topic are present in a document is decremented and
> incremented only by 1; but because of the polya urn approach, more
> than one words in document can be assigned to a topic at once (line 10).
> I wonder if even this counter should be updated according to all the
> new words that have been assigned to a new topic during one iteration
> (line 10); otherwise, a fake value will be counted about how much a
> topic is prominent in a document.
>
> I look forward some explanation.
>
> Best,
> Gabriele
>
>
> _______________________________________________
> Topic-models mailing list
> Topic-models at lists.cs.princeton.edu
> https://lists.cs.princeton.edu/mailman/listinfo/topic-models
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cs.princeton.edu/pipermail/topic-models/attachments/20170707/089f7c5f/attachment.html>
More information about the Topic-models
mailing list