[Topic-models] Finding annotated datasets

Devashish Deshpande ashu.9412 at gmail.com
Tue Jun 28 16:56:43 EDT 2016

Dear Mr Roeder,

Thanks a lot for the help! I have entered the benchmark testing phase of my
project and will be working with the RTL-wiki first. However, I was
wondering if you had any script prepared for the preprocessing of this
dataset. Since I will be comparing my project against Palmetto I was
thinking that the preprocessing should be the same so as to pinpoint
whatever faults that occur in my project better. Would it be possible to
share the script or the procedure you followed for the preprocessing?

Thanking you,

On Thu, Jun 9, 2016 at 6:27 PM, Michael Röder <
roeder at informatik.uni-leipzig.de> wrote:

> Hi Devashish,
> unfortunately, the blog "http://topics.labs.bluekiwi.de/" does not exist
> any more and I am sorry for any inconveniences this might have caused.
> In the paper, a dataset is defined by three parts:
> 1. a corpus
> 2. topics that have been calculated using the corpus
> 3. human ratings for the topics
> You can find the topics (topics* files) and the human ratings (gold*
> files) used for our paper at:
> (I will add the link to
> the Palmetto web page).
> However, because of their license I am not allowed to upload the corpora.
> You would need them to recreate the upper part of the table. If you are
> interested in that part, please write me a mail and I can describe how you
> could get them.
> Since we did not create all datasets by ourself, I would like to remind
> you to cite the creators/providers of the dataset where appropriate. You
> can find the reference of their publications in our paper in the section
> that describes the datasets.
> Cheers,
> Michael Röder
> ------------------------------
> From: *Devashish Deshpande* <ashu.9412 at gmail.com>
> Date: Wed, Jun 8, 2016 at 8:35 PM
> Subject: [Topic-models] Finding annotated datasets
> To: topic-models at lists.cs.princeton.edu
> Hey everyone,
> My name is Devashish Deshpande. I am a contributor to the Gensim open
> source topic modelling library in python and am currently working on a
> project to add the topic coherence pipeline as mentioned in this paper
> <http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf> and
> demonstrated in this code
> <https://github.com/AKSW/Palmetto/tree/master/src/main/java/org/aksw/palmetto>
> to gensim. You can find my open PR here
> <https://github.com/piskvorky/gensim/pull/710>.
> For the purpose of writing a blog post on this project and performing some
> benchmark testing, I wanted to reproduce table 2 from the above paper.
> However I was finding it hard to find the annotated datasets that were used
> for this. I did manage to find some links (eg the annotated movies dataset
> <http://topics.labs.bluekiwi.de/data/nips2013>, RTL NYT
> <https://catalog.ldc.upenn.edu/LDC2008T19>, genomics
> <http://ir.ohsu.edu/genomics>) but none of them seem to be working. Is
> there any other place where I can download any of these datasets from?
> Any help from will be greatly appreciated!
> Thanks!
> Devashish
> _______________________________________________
> Topic-models mailing list
> Topic-models at lists.cs.princeton.edu
> https://lists.cs.princeton.edu/mailman/listinfo/topic-models
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cs.princeton.edu/pipermail/topic-models/attachments/20160629/eb26e37d/attachment.html>

More information about the Topic-models mailing list