[Topic-models] Inference problem in LDA

Laura Dietz dietz at mpi-inf.mpg.de
Wed Mar 2 23:55:12 EST 2011


Lda may find that two topics have an equal share in a document. Say theta(t1)=0.5 and theta(t2)=0.5  How would you pick the largest then?

Even if theta(t1)=0.51 and theta(t2)=0.49 would you really want the document to be associated only with the cluster of t1?

Lda rather gives you a soft clustering.
And I think that is a good thing. 

You may rethink whether lda is the right model for your purposes. Maybe you would like to learn a different model that associates each document with only a single topic rather than a mixture over topics.

Laura



On Mar 2, 2011, at 9:20 PM, Li Qingshan <furtherli at gmail.com> wrote:

> Dear Laura Dietz,
> 
> You got my idea exactly. I've got the problem a) resolved with the help of Alexandre. As to problem b), I mean using the LDA model to do some classification(clustering) work. That's why I want to choose the highest likelihood(with labels in the paper it equals to arg max p(theta |w,alpha, beta)) of the k topics. 
> 
> I don't quite understand what you say " as a document may have more
> than one dominating topics. Even words may have more than one dominating
> topic.". Does the "dominating" mean high probability ? Then we can always choose the highest probability of the k topics?. 
> 
> Is anywhere wrong?
> 
> Yours 
> Qingshan Li
> 
> 2011/3/2 Laura Dietz <dietz at mpi-inf.mpg.de>
> 
> Dear Li Quingshan,
> 
> I am not sure whether I understood your concern right. There are two
> concepts that are referred to with the term likelihood.
> 
> a) The training (or test,holdout,...) likelihood of a corpus. This
> describes the quality of the inferred parameters and model.
> 
> b) The likelihood of a topic given a document, where we actually mean
> the probability of a topic given a document.
> 
> I got the impression that Alexandre, Heinrich, and ap.dat are talking
> about a) but your question refers to b). Is that right?
> 
> 
> 
> > On Wed, Mar 2, 2011 at 10:32, Li Qingshan <furtherli at gmail.com> wrote:
> >> So we can choose the topic with the
> >> highest likelihood as the topic of the document.
> 
> When you say likelihood of topic in a document, do you refer to the
> topic with the highest proportion in the document's topic mixture theta?
> 
> So you want to choose the topic_d := argmax_t theta_d(t)?
> 
> Lets call topic_d the topic that dominates the document. You can use it
> as a rough approximation if you want to find a hard cluster
> (partitioning) of documents.
> 
> But the LDA model is much richer than this, as a document may have more
> than one dominating topics. Even words may have more than one dominating
> topic.
> 
> Cheers,
> Laura
> _______________________________________________
> Topic-models mailing list
> Topic-models at lists.cs.princeton.edu
> https://lists.cs.princeton.edu/mailman/listinfo/topic-models
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cs.princeton.edu/pipermail/topic-models/attachments/20110302/aad05271/attachment.htm>


More information about the Topic-models mailing list