[Topic-models] Comparing Topics to Real Classes

Flavio Altinier Maximiano DaSilva fa269 at cornell.edu
Fri Jul 29 15:53:08 EDT 2016


Good morning,

I am facing the problem of correlating topics generated by LDA with real
class labels. For now, I am trying to figure out the best way to do
one-on-one topic-class assignment (by running the model with the same
number of topics as there are of real labels).

Say I have [image: J] real classes (so also [image: J] topics). Right now,
what I am doing is:


[image: P(d_i | z_j) = \prod_{k} P(w_k \in d_i | z_j)]

For all [image: i] and [image: j]'s, where [image: d] are documents, [image:
z] the topics and [image: w] the words. I populate a matrix [image: \Pi]
with dimension [image: J \times J] :

[image: \Pi_{k, j} = \prod_{m} P(d_m \in c_k | z_j)]

Where [image: c] are the real class labels. Then to make assignments, I
just choose the highest vale of [image: \Pi], assign that class [image: k] to
that topic [image: j], then choose the next highest value for the next
assignment, and so on.

The problem is, I believe I need normalization for that -- documents with
lots of words put on more weight on a class, as well as classes with lots
of documents. What kind of normalization should I use on both processes?
Just dividing the document likelihood by the number of words does not feel
right. In other words, how to normalize Equation (3) on the original LDA
paper so that document likelihoods are comparable? And once I have that,
how to normalize the following equation on the paper?

Thank you,
Flavio.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cs.princeton.edu/pipermail/topic-models/attachments/20160729/f00c0f30/attachment.html>


More information about the Topic-models mailing list