[Topic-models] asymmetric priors and IDF

Julien Velcin julien.velcin at univ-lyon2.fr
Thu Jun 16 08:02:00 EDT 2016


Thank you Wray for this thorough and highlighting explanation. You've 
convinced me and I'll try to use your mloss library for testing 
automatic learning of priors (possibly with burstiness).

By the way, I cannot visualize your topics for the access to wordpress 
is restricted.

Kind regards,

Julien

> Wray Buntine <mailto:wray.buntine at monash.edu>
> June 16, 2016 at 11:25 AM
> Hi
>
> Well, the standard asymmetric prior in LDA is set up to *learn* the 
> weights for the prior.
> To do this you need a pretty good algorithm because you have 10,000 
> weights in the prior,
> one for each word.
>
> You'll see this in Figure 2 of our KDD 2014 paper (Buntine and Mishra):
> http://dl.acm.org/citation.cfm?id=2623691
> there is a vector $\beta$ that is learnt and it represents the prior 
> probability of words
> appearing in topics ... i.e., the word weights are normalised.
>
> What happens, in result, is that $\beta$ becomes the "background" topic,
> which is kind of the inverse of TFIDF.
>      *)  you want "topical" words to appear in only a few topics, so 
> they get
>           lower weights, not higher
>      *)  if you include stop words (unlike regular LDA preprocessing),
>           it handles them very nicely, and generally makes them
>           more likely (spread across all topics so they no longer become
>           differentiating and are effectively ignored during topic 
> estimation)
>       *)  it gives high weights to stop words, but not all, some stop 
> words
>             *are* topical
>      *)  in news articles, pronouns such as "I, me mine" often occur in
>          articles about movie and music stars, being self centred ;-),
>           so they become very topical!
>     *)   regular words that are *not* topical get up weighted
>
> Here are the top background words for NIPS abstracts with stop words 
> removed:
>                
> present,apply,base,thus,describe,provide,similar,take,include,second
> See they are *not* stop words but they are very plain.  Not topical 
> for NIPS.
> Moreover, not all are that common, its that they appear randomly 
> throughout
> documents and cannot be associated with any given topic.
> Here are top background words for ABC (Australia) news articles about 
> Japan:
>                
> japanese,world,include,group,take,first,country,united-states,day
> So "japanese" would become a topical word in general news article, but 
> if the
> article is tagged as "Japan", then it becomes non-topical.
>
> The background topic, I find, is really informative about the collection,
> and its not something vanilla LDA can give you.  I now routinely
> display it when I show topics.  See this:
> https://wordpress.com/post/topicmodels.org/577
> In the figure, the background topic is in blue.
>
> So, what I'm saying is, rather than trying to set the weights via TFIDF,
> you are better to learn the weights.
> Anyway, not sure if MALLET does this well.  At least the original
> paper 2009 had a method that didn't do the word weights well,
> according to their experiments.    When done properly, the
> asymmetric prior blitzes the symmetric case, almost always.
>
> Wray Buntine
> http://topicmodels.org
>
>
> Julien Velcin <mailto:julien.velcin at univ-lyon2.fr>
> June 15, 2016 at 3:57 PM
> Dear topic modelers,
>
> I'm wondering whether someone has tried to use an asymmetric prior in 
> LDA for p(w/z), based on the inverse document frequency (IDF). We can 
> postulate that this kind of prior will lower the impact of stop words 
> and, therefore, results in topics of higher quality.
>
> By the way, if this is a good idea, which packages allow to (easily) 
> set up asymmetric priors? For instance, MALLET is based on symmetric 
> priors.
>
> Thank you,
>
> Julien

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cs.princeton.edu/pipermail/topic-models/attachments/20160616/9cfa250d/attachment-0001.html>


More information about the Topic-models mailing list