[Topic-models] asymmetric priors and IDF

Wray Buntine wray.buntine at monash.edu
Thu Jun 16 05:25:33 EDT 2016


Hi

Well, the standard asymmetric prior in LDA is set up to *learn* the weights
for the prior.
To do this you need a pretty good algorithm because you have 10,000 weights
in the prior,
one for each word.

You'll see this in Figure 2 of our KDD 2014 paper (Buntine and Mishra):
           http://dl.acm.org/citation.cfm?id=2623691
there is a vector $\beta$ that is learnt and it represents the prior
probability of words
appearing in topics ... i.e., the word weights are normalised.

What happens, in result, is that $\beta$ becomes the "background" topic,
which is kind of the inverse of TFIDF.
     *)  you want "topical" words to appear in only a few topics, so they
get
          lower weights, not higher
     *)  if you include stop words (unlike regular LDA preprocessing),
          it handles them very nicely, and generally makes them
          more likely (spread across all topics so they no longer become
          differentiating and are effectively ignored during topic
estimation)
      *)  it gives high weights to stop words, but not all, some stop words
            *are* topical
     *)  in news articles, pronouns such as "I, me mine" often occur in
         articles about movie and music stars, being self centred ;-),
          so they become very topical!
    *)   regular words that are *not* topical get up weighted

Here are the top background words for NIPS abstracts with stop words
removed:

present,apply,base,thus,describe,provide,similar,take,include,second
See they are *not* stop words but they are very plain.  Not topical for
NIPS.
Moreover, not all are that common, its that they appear randomly throughout
documents and cannot be associated with any given topic.
Here are top background words for ABC (Australia) news articles about Japan:

japanese,world,include,group,take,first,country,united-states,day
So "japanese" would become a topical word in general news article, but if
the
article is tagged as "Japan", then it becomes non-topical.

The background topic, I find, is really informative about the collection,
and its not something vanilla LDA can give you.  I now routinely
display it when I show topics.  See this:
           https://wordpress.com/post/topicmodels.org/577
In the figure, the background topic is in blue.

So, what I'm saying is, rather than trying to set the weights via TFIDF,
you are better to learn the weights.
Anyway, not sure if MALLET does this well.  At least the original
paper 2009 had a method that didn't do the word weights well,
according to their experiments.    When done properly, the
asymmetric prior blitzes the symmetric case, almost always.

Wray Buntine
http://topicmodels.org

On 15 June 2016 at 23:57, Julien Velcin <julien.velcin at univ-lyon2.fr> wrote:

> Dear topic modelers,
>
> I'm wondering whether someone has tried to use an asymmetric prior in LDA
> for p(w/z), based on the inverse document frequency (IDF). We can postulate
> that this kind of prior will lower the impact of stop words and, therefore,
> results in topics of higher quality.
>
> By the way, if this is a good idea, which packages allow to (easily) set
> up asymmetric priors? For instance, MALLET is based on symmetric priors.
>
> Thank you,
>
> Julien
> _______________________________________________
> Topic-models mailing list
> Topic-models at lists.cs.princeton.edu
> https://lists.cs.princeton.edu/mailman/listinfo/topic-models
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cs.princeton.edu/pipermail/topic-models/attachments/20160616/5b9a9afc/attachment.html>


More information about the Topic-models mailing list