[Topic-models] Inference problem in LDA
Laura Dietz
dietz at mpi-inf.mpg.de
Wed Mar 2 23:55:12 EST 2011
Lda may find that two topics have an equal share in a document. Say theta(t1)=0.5 and theta(t2)=0.5 How would you pick the largest then?
Even if theta(t1)=0.51 and theta(t2)=0.49 would you really want the document to be associated only with the cluster of t1?
Lda rather gives you a soft clustering.
And I think that is a good thing.
You may rethink whether lda is the right model for your purposes. Maybe you would like to learn a different model that associates each document with only a single topic rather than a mixture over topics.
Laura
On Mar 2, 2011, at 9:20 PM, Li Qingshan <furtherli at gmail.com> wrote:
> Dear Laura Dietz,
>
> You got my idea exactly. I've got the problem a) resolved with the help of Alexandre. As to problem b), I mean using the LDA model to do some classification(clustering) work. That's why I want to choose the highest likelihood(with labels in the paper it equals to arg max p(theta |w,alpha, beta)) of the k topics.
>
> I don't quite understand what you say " as a document may have more
> than one dominating topics. Even words may have more than one dominating
> topic.". Does the "dominating" mean high probability ? Then we can always choose the highest probability of the k topics?.
>
> Is anywhere wrong?
>
> Yours
> Qingshan Li
>
> 2011/3/2 Laura Dietz <dietz at mpi-inf.mpg.de>
>
> Dear Li Quingshan,
>
> I am not sure whether I understood your concern right. There are two
> concepts that are referred to with the term likelihood.
>
> a) The training (or test,holdout,...) likelihood of a corpus. This
> describes the quality of the inferred parameters and model.
>
> b) The likelihood of a topic given a document, where we actually mean
> the probability of a topic given a document.
>
> I got the impression that Alexandre, Heinrich, and ap.dat are talking
> about a) but your question refers to b). Is that right?
>
>
>
> > On Wed, Mar 2, 2011 at 10:32, Li Qingshan <furtherli at gmail.com> wrote:
> >> So we can choose the topic with the
> >> highest likelihood as the topic of the document.
>
> When you say likelihood of topic in a document, do you refer to the
> topic with the highest proportion in the document's topic mixture theta?
>
> So you want to choose the topic_d := argmax_t theta_d(t)?
>
> Lets call topic_d the topic that dominates the document. You can use it
> as a rough approximation if you want to find a hard cluster
> (partitioning) of documents.
>
> But the LDA model is much richer than this, as a document may have more
> than one dominating topics. Even words may have more than one dominating
> topic.
>
> Cheers,
> Laura
> _______________________________________________
> Topic-models mailing list
> Topic-models at lists.cs.princeton.edu
> https://lists.cs.princeton.edu/mailman/listinfo/topic-models
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cs.princeton.edu/pipermail/topic-models/attachments/20110302/aad05271/attachment.htm>
More information about the Topic-models
mailing list