[Topic-models] Does word-ordering matter in Gibbs sampling?

dan danwalkeriv at gmail.com
Wed Jul 26 13:23:11 EDT 2017


In theory it shouldn't matter, a Gibbs sampler with infinite time and
machine precision would eventually mix well converge in distribution and
you would sample from every region of the support in proportion to it's
probability mass. In practice, I think you are right that it would be
possible for the data ordering to cause you to quickly enter a local
maximum that would be difficult (or impossible, given finite time and
machine precision) to ever exit from. One approach to mitigating this
problem would be to do a random sweep over the variables that you are
sampling. Another might be to use deterministic annealing. Charles Elkan
has some great descriptions about how deterministic annealing works in the
context of EM for mixture models (
http://cseweb.ucsd.edu/~elkan/250Bwinter2011/mixturemodels.pdf). I tried
applying the same concepts to a Gibbs sampler in my dissertation work and
achieved some really promising results (
http://scholarsarchive.byu.edu/cgi/viewcontent.cgi?article=4529&context=etd).
The advantage of DA would be that it helps avoid all kinds of maxima, not
just those caused by scan order.

I also did a quick search and came across these relevant publications:
Scan Order in Gibbs Sampling: Models in Which it Matters and Bounds on How
Much (https://arxiv.org/pdf/1606.03432.pdf)
Implementing Random Scan Gibbs Samplers (
https://link.springer.com/article/10.1007/BF02736129)

--dan

On Tue, Jul 25, 2017 at 9:49 PM, Eric Kang <erickangnz at gmail.com> wrote:

> Hi everyone,
>
> My apologies if this is an uninformed question, but in Gibbs sampling for
> LDA inference, aren’t the various counts of word-topic assignments updated
> word-by-word? Doesn’t this make it somewhat dependent on word ordering? For
> example, if word_1 is strongly associated with topic_1 and word_2 is
> strongly associated with topic_2, if I see a document {word_1, word_1, …
> (100 times), word_2, word_2, … (100 times), word_2}, then by the time I
> start seeing word_2, wouldn’t the algorithm be more inclined to think that
> it should be assigned to topic_1, compared to a scenario where I see the
> document {word_1, word_2, word_1, word_2, …}?
>
> Thank you,
> Eric
> _______________________________________________
> Topic-models mailing list
> Topic-models at lists.cs.princeton.edu
> https://lists.cs.princeton.edu/mailman/listinfo/topic-models
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cs.princeton.edu/pipermail/topic-models/attachments/20170726/20906341/attachment.html>


More information about the Topic-models mailing list