[Topic-models] Comparing Topics to Real Classes

Flavio Altinier Maximiano DaSilva fa269 at cornell.edu
Sun Jul 31 22:05:46 EDT 2016


Hi Laura,

Thank you for your answer. In my dataset, documents may belong to more than
one class (in fact most of them do), so I'm trying to use a multilabel SVM
classifier. I ran the LDA model with 54 topics (the same number of real
classes) and the best accuracy I get with the SVM classifier on a
cross-validation set is around 15%. How many topics do you usually run the
model with?

Thank you again,
Flavio.

On Sat, Jul 30, 2016 at 1:19 AM, Laura Dietz <dietz at cs.umass.edu> wrote:

> Hi Flavio,
>
> in my experience you get better classification performance when you
> 1. derive document features from \theta_d (one feature per topic) then
> 2. use Z-score normalization on the features
> 3. train a SVM classifier
>
> i highly recommend to compare to a baseline using only word-frequency
> features (one feature per word). Just a word of warning, this baseline
> often outperforms topic features.
>
> Would you let me know how well it worked?
>
> Cheers,
> Laura
>
>
>
> On 07/29/2016 09:53 PM, Flavio Altinier Maximiano DaSilva wrote:
>
> Good morning,
>
> I am facing the problem of correlating topics generated by LDA with real
> class labels. For now, I am trying to figure out the best way to do
> one-on-one topic-class assignment (by running the model with the same
> number of topics as there are of real labels).
>
> Say I have [image: J] real classes (so also [image: J] topics). Right
> now, what I am doing is:
>
>
> [image: P(d_i | z_j) = \prod_{k} P(w_k \in d_i | z_j)]
>
> For all [image: i] and [image: j]'s, where [image: d] are documents, [image:
> z] the topics and [image: w] the words. I populate a matrix [image: \Pi]
> with dimension [image: J \times J] :
>
> [image: \Pi_{k, j} = \prod_{m} P(d_m \in c_k | z_j)]
>
> Where [image: c] are the real class labels. Then to make assignments, I
> just choose the highest vale of [image: \Pi], assign that class [image: k] to
> that topic [image: j], then choose the next highest value for the next
> assignment, and so on.
>
> The problem is, I believe I need normalization for that -- documents with
> lots of words put on more weight on a class, as well as classes with lots
> of documents. What kind of normalization should I use on both processes?
> Just dividing the document likelihood by the number of words does not feel
> right. In other words, how to normalize Equation (3) on the original LDA
> paper so that document likelihoods are comparable? And once I have that,
> how to normalize the following equation on the paper?
>
> Thank you,
> Flavio.
>
>
> _______________________________________________
> Topic-models mailing listTopic-models at lists.cs.princeton.eduhttps://lists.cs.princeton.edu/mailman/listinfo/topic-models
>
>
>
> _______________________________________________
> Topic-models mailing list
> Topic-models at lists.cs.princeton.edu
> https://lists.cs.princeton.edu/mailman/listinfo/topic-models
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cs.princeton.edu/pipermail/topic-models/attachments/20160731/71de6a3e/attachment.html>


More information about the Topic-models mailing list