[Topic-models] Explanation about Polya urn model and LDA (Thibaut Thonet) (Gabriele Pergola)

Gabriele Pergola gabriele.pergola at gmail.com
Wed Jul 19 04:31:14 EDT 2017


Hi Thibaut,
thank you again for your help!

I would ask to you or anyone else in the mailing list, if you are aware of
any available implementation of LDA based on Polya-urn model.
Best would be the one presented in "Optimizing semantic coherence in topic
models" by Mimno et al. 2011".

Thank you.

Best,
Gabriele

2017-07-16 0:59 GMT+02:00 Gabriele Pergola <pergolag at aston.ac.uk>:

> Hi Thibaut,
> thank you again for your help!
>
> I would ask to you or anyone else in the mailing list, if you are aware of
> any available implementation of LDA based on Polya-urn model.
> Best would be the one presented in "Optimizing semantic coherence in
> topic models" by Mimno et al. 2011".
>
> Thank you.
>
> Best,
> Gabriele
>
> 2017-07-12 17:00 GMT+01:00 <topic-models-request at lists.cs.princeton.edu>:
>
>> Send Topic-models mailing list submissions to
>>         topic-models at lists.cs.princeton.edu
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>>         https://lists.cs.princeton.edu/mailman/listinfo/topic-models
>> or, via email, send a message with subject or body 'help' to
>>         topic-models-request at lists.cs.princeton.edu
>>
>> You can reach the person managing the list at
>>         topic-models-owner at lists.cs.princeton.edu
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of Topic-models digest..."
>>
>> Today's Topics:
>>
>>    1. Re: Explanation about Polya urn model and LDA     (Thibaut
>>       Thonet) (Gabriele Pergola)
>>    2. Re: Explanation about Polya urn model and LDA (Thibaut Thonet)
>>
>>
>> ---------- Messaggio inoltrato ----------
>> From: Gabriele Pergola <gabriele.pergola at gmail.com>
>> To: topic-models at lists.cs.princeton.edu
>> Cc:
>> Bcc:
>> Date: Wed, 12 Jul 2017 00:13:44 +0100
>> Subject: Re: [Topic-models] Explanation about Polya urn model and LDA
>> (Thibaut Thonet)
>> Hi Thibaut,
>>
>> Clear enough?! You have been great!
>> One of the clearest explanation I've read so far.
>>
>> Actually, before your answer, I missed one point: the words that are
>> increased by A_vw are already "under topic z". Instead, I wrongly thought
>> that also the words under different topics might experience a frequency
>> increment; this will have entailed that those words would change their
>> topic assignments, which in turn would change the proportion of words
>> assigned to a topic in a document (i.e. N_dz).
>> Of course, this does not occur if the words, whose frequency is
>> increased, were already under the same topic.
>>
>> Speaking of which, could you suggest me any works (if any exist) that
>> have explored the idea to assign the new sampled topic not only to the
>> current word but even to its related words?
>> (Supposed that this idea could make sense..).
>>
>> Thank you so much for your help!
>> Best,
>> Gabriele
>>
>>
>> 2017-07-07 17:00 GMT+01:00 <topic-models-request at lists.cs.princeton.edu>:
>>
>>> Send Topic-models mailing list submissions to
>>>         topic-models at lists.cs.princeton.edu
>>>
>>> To subscribe or unsubscribe via the World Wide Web, visit
>>>         https://lists.cs.princeton.edu/mailman/listinfo/topic-models
>>> or, via email, send a message with subject or body 'help' to
>>>         topic-models-request at lists.cs.princeton.edu
>>>
>>> You can reach the person managing the list at
>>>         topic-models-owner at lists.cs.princeton.edu
>>>
>>> When replying, please edit your Subject line so it is more specific
>>> than "Re: Contents of Topic-models digest..."
>>>
>>> Today's Topics:
>>>
>>>    1. Re: Explanation about Polya urn model and LDA (Thibaut Thonet)
>>>
>>>
>>> ---------- Messaggio inoltrato ----------
>>> From: Thibaut Thonet <thibaut.thonet at irit.fr>
>>> To: Gabriele Pergola <gabriele.pergola at gmail.com>
>>> Cc: topic-models <topic-models at lists.cs.princeton.edu>
>>> Bcc:
>>> Date: Fri, 7 Jul 2017 08:11:44 +0200
>>> Subject: Re: [Topic-models] Explanation about Polya urn model and LDA
>>>
>>> Hi Gabriele,
>>>
>>> Let's first have a look at the original LDA's generative process under
>>> the simple Polya urn perspective. We consider two types of urns. Urns of
>>> the first type (that we will call theta-urns) are specific to documents
>>> (i.e., D such urns) and each of them initially contains alpha balls of T
>>> different colors (T being the number of topics). Urns of the second type
>>> (that we will call phi-urns) are specific to topics (i.e., T such urns) and
>>> each of them initially contains beta balls of W different colors (W being
>>> the vocabulary size). For the n-th token in the d-th document, we first
>>> draw a ball from the d-th theta-urn. We observe its color z and set z_dn =
>>> z. We then apply the following replacement scheme: we put the ball back in
>>> the d-th theta-urn and add another ball of color z in that same urn.
>>> Secondly, we draw a ball from the z-th phi-urn. We observe its color w and
>>> set w_dn = w. And once again, we put the ball back in the z-th phi-urn and
>>> add another ball of color w in that urn.
>>>
>>> The generative process for Generalized Polya Urn LDA (GPU-LDA) is very
>>> similar. We still have document-specific theta-urns and topic-specific
>>> phi-urns. The only difference compared with LDA lies in the replacement
>>> scheme for phi-urns, after drawing a ball from the z-th phi-urn, observing
>>> w and setting w_dn = w. Instead of putting the ball back in the z-th
>>> phi-urn and adding another ball of color w, we add A_vw balls for each
>>> color v=1...W to the z-th phi-urn. Intuitively, this will increase the
>>> likelihood of subsequently observing words v that are related to w (i.e.,
>>> words v such that A_vw > 0) under topic z. The replacement scheme for
>>> theta-urns however remains the same as in LDA.
>>>
>>> To put it differently, in GPU-LDA, the replacement scheme for theta-urns
>>> follows that of a simple Polya urn while the replacement scheme for
>>> phi-urns follows that of a generalized Polya urn. This is the reason why
>>> N_dz is only increased or decreased by 1, while for all v=1...W, N_zv is
>>> increased or decreased by A_vw. In that case, N_dz still represents the
>>> number of tokens in document d which are assigned topic z, but N_zv isn't
>>> anymore equal to the number of tokens with word type v which are assigned
>>> topic z in the collection. N_zv is the total number of balls with color v
>>> that were previously added to the z-th phi-urn (excluding the initial beta
>>> number of balls from the count) using the GPU replacement scheme.
>>>
>>> Let me know if my explanation was clear enough.
>>>
>>> Best,
>>>
>>> Thibaut
>>> Le 06/07/2017 à 17:01, Gabriele Pergola a écrit :
>>>
>>> Hello!
>>>
>>> I came across the paper "Optimizing semantic coherence in topic models"
>>> by Mimno et al. 2011, where they present a modified version of Gibbs
>>> sampling following the generalized Polya-urn model.
>>>
>>> I couldn't manage to find any code, it seems was not provided; so, I
>>> decided to implement it by myself.
>>>
>>> However, I have got a problem. If you have look at the pseudocode
>>> provided in the paper ("Algorithm 2"), the counter N_(z,d) about how many
>>> words for a topic are present in a document is decremented and incremented
>>> only by 1; but because of the polya urn approach, more than one words in
>>> document can be assigned to a topic at once (line 10).
>>> I wonder if even this counter should be updated according to all the new
>>> words that have been assigned to a new topic during one iteration (line
>>> 10); otherwise, a fake value will be counted about how much a topic is
>>> prominent in a document.
>>>
>>> I look forward some explanation.
>>>
>>> Best,
>>> Gabriele
>>>
>>>
>>> _______________________________________________
>>> Topic-models mailing listTopic-models at lists.cs.princeton.eduhttps://lists.cs.princeton.edu/mailman/listinfo/topic-models
>>>
>>>
>>>
>>> _______________________________________________
>>> Topic-models mailing list
>>> Topic-models at lists.cs.princeton.edu
>>> https://lists.cs.princeton.edu/mailman/listinfo/topic-models
>>>
>>>
>>
>>
>> ---------- Messaggio inoltrato ----------
>> From: Thibaut Thonet <thibaut.thonet at irit.fr>
>> To: Gabriele Pergola <gabriele.pergola at gmail.com>
>> Cc: topic-models <topic-models at lists.cs.princeton.edu>
>> Bcc:
>> Date: Wed, 12 Jul 2017 14:31:30 +0200
>> Subject: Re: [Topic-models] Explanation about Polya urn model and LDA
>> Hi Gabriele,
>>
>> Glad that my explanation could help!
>>
>> I'm not aware of such work. But I'm also not sure one would want to
>> systematically assign the new topic to related words (in addition to the
>> current word), as it might be too constraining. It seems more natural to
>> only influence words towards a topic without compelling their assignment.
>> And this is actually what is done with GPU-LDA: when the count N_zv of a
>> related word v is increased, the topic z will be slightly favored (i.e.,
>> more likely to be assigned) for all tokens with word type v.
>>
>> Also, keep in mind that a word can have several meanings (e.g., 'bank' as
>> the financial institution or as the land bordering a river). So the 'hard'
>> constraint you want to enforce could for example lead to linking all
>> occurrences of the word 'river' (which is somehow related to bank) to the
>> topic of finance. While I don't say this phenomenon won't occur at all for
>> GPU-LDA, my intuition is that it will be less prominent.
>>
>> Best,
>>
>> Thibaut
>>
>> Le 12/07/2017 à 01:13, Gabriele Pergola a écrit :
>>
>>> Hi Thibaut,
>>>
>>> Clear enough?! You have been great!
>>> One of the clearest explanation I've read so far.
>>>
>>> Actually, before your answer, I missed one point: the words that are
>>> increased by A_vw are already "under topic z". Instead, I wrongly thought
>>> that also the words under different topics might experience a frequency
>>> increment; this will have entailed that those words would change their
>>> topic assignments, which in turn would change the proportion of words
>>> assigned to a topic in a document (i.e. N_dz).
>>> Of course, this does not occur if the words, whose frequency is
>>> increased, were already under the same topic.
>>>
>>> Speaking of which, could you suggest me any works (if any exist) that
>>> have explored the idea to assign the new sampled topic not only to the
>>> current word but even to its related words?
>>> (Supposed that this idea could make sense..).
>>>
>>> Thank you so much for your help!
>>> Best,
>>> Gabriele
>>>
>>
>>
>> _______________________________________________
>> Topic-models mailing list
>> Topic-models at lists.cs.princeton.edu
>> https://lists.cs.princeton.edu/mailman/listinfo/topic-models
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cs.princeton.edu/pipermail/topic-models/attachments/20170719/9c1787e5/attachment-0001.html>


More information about the Topic-models mailing list