[Topic-models] Help build an evaluated topic dataset

Michael Röder roeder at informatik.uni-leipzig.de
Fri Jun 24 07:47:49 EDT 2016

Dear all,
1. If you have only 200 topics, all of them should be evaluated. Make sure to show only 1 topic at a time (if you show more than one, you get different results). The topics should be shown in a random order and the participating human raters shouldn't be able to see whether a topic has been created by M1 or M2. Every topic should be rated by a minimum number of persons (maybe 3 but the more the better). If you have more topics, you can sample them but evaluating only 10 or 20 topics for each approach is in my opinion not sufficient.
2. Apart from the obvious aggregation, i.e., calculating the arithmetic average and the variance of the average ratings of the single topics, it might be interesting to visualize the coherences. Simply sort the ratings for the single topics of both approaches ascending and plot them (x = topics (sorted), y = coherence scale). The comparison of the two curves might lead to insides that can not be gained from the average values, e.g., one of the approaches might generate less non-coherent topics or it creates a larger set of very good topics.
3. I am not sure about this point. However, I think that a Kolmogorov–Smirnov test (https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test) should be sufficient if we assume that the average ratings for the single topics are 2 samples from probability distributions and we want to check whether both have been sampled from the same reference distribution. If this is completely wrong or somebody has a better idea, please let us know!
only my 2 cents ;)
Best regards,Michael Röder
From: Swapnil Hingmire <swapnil.hingmire at tcs.com>
Date: Mon, Jun 20, 2016 at 10:04 AM
Subject: Re: [Topic-models] Help build an evaluated topic dataset
To: David Mimno <mimno at cornell.edu>
Cc: "topic-models at lists.cs.princeton.edu" <topic-models at lists.cs.princeton.edu>


I have a few doubts regarding human evaluation of topic models.

Comparing two or more models:
    Let us assume that, we have inferred two different topic models (M1 and M2) on the same corpus. As an example, let M1 denotes vanilla LDA and M2 denotes correspondence LDA (Corr-LDA). Now I would like to compare which model has inferred more coherent topics. Following are my doubts:
    1. How many topics of each model should be displayed to the user? (Let us say, we have inferred 100 topics on the NIPS corpus using both M1 and M2, should we show all the 100 topics of each model?)
    2. As mentioned by David, we are using coherence scale from 1 to 5. How to aggregate coherence of topics inferred by M1 (or M2) to come up with coherence score of M1 (or M2)?
    3. How can we say that topic inferred by M2 are "significantly" coherent than M1? (or vice-versa)

Request you to have discussion on these doubts.

Thanks and Regards,
Swapnil  Hingmire

-----topic-models-bounces at lists.cs.princeton.edu wrote: -----To: "topic-models at lists.cs.princeton.edu" <topic-models at lists.cs.princeton.edu>
From: David Mimno 
Sent by: topic-models-bounces at lists.cs.princeton.edu
Date: 06/09/2016 08:59PM
Subject: [Topic-models] Help build an evaluated topic dataset

We need more examples of human-evaluated topic models. I trained a 50-topic model on questions and answers from the CrossValidated site, http://stats.stackexchange.com/. These are available freely from archive.org. Evaluate the topics here:
(Can you find the topic modeling topic?)
If I get enough non-troll responses, I'll post the documents, the Mallet state file, and the response spreadsheet on a github repo.
To create this form I went to http://scripts.google.com and used this code:
function createForm() {
var form = FormApp.create('Topic Coherence').setDescription("Each list of terms represents a topic. Evaluate each topic's coherence on a scale from 1 to 5. Does a topic contain terms that you would expect to see together on a page? Does it contain terms that would work together as search queries? Could you easily think of a short descriptive label? A Coherent topic (5) should be clear, consistent, and readily interpretable. A Problematic topic (3) should have some related words but might merge two unrelated concepts or contain several off-topic words. A Useless topic (1) should have no obvious connection between more than two or three words.");
var topics = ["time series data model trend noise signal period change seasonal autocorrelation level arima structure analysis process spatial trends frequency lag",...,"distribution random normal distributions variables variance independent variable distributed sigma probability gaussian poisson case uniform process theorem function mixture sample"];
topics.forEach(function (topic) {  form.addScaleItem()    .setTitle(topic)    .setBounds(1, 5)    .setLabels("Useless", "Coherent");});
Logger.log('Published URL: ' + form.getPublishedUrl());Logger.log('Editor URL: ' + form.getEditUrl());
Topic-models mailing list
Topic-models at lists.cs.princeton.edu

Notice: The information contained in this e-mail

message and/or attachments to it may contain 

confidential or privileged information. If you are 

not the intended recipient, any dissemination, use, 

review, distribution, printing or copying of the 

information contained in this e-mail message 

and/or attachments to it are strictly prohibited. If 

you have received this communication in error, 

please notify us by reply e-mail or telephone and 

immediately and permanently delete the message 

and any attachments. Thank you


Topic-models mailing list

Topic-models at lists.cs.princeton.edu


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cs.princeton.edu/pipermail/topic-models/attachments/20160624/0142d716/attachment-0001.html>

More information about the Topic-models mailing list