[Topic-models] Question regarding automatic evaluation of topic models using PMI/NPMI
mimno at cornell.edu
Mon Apr 10 10:12:16 EDT 2017
The advantage of using a separate corpus (like wikipedia) is that you have less risk of fooling yourself. You could think that you're doing a better job than you really are by testing on the training data, which is usually a really bad idea. In practice, this is unlikely. Topic models are trying to represent the complexity of human discourse with multinomial distributions. They rarely fit too well, they usually fit too poorly.
The advantage of using the source corpus to build word co-occurrence statistics is that you are guaranteed to have a representative sample. Wikipedia is good for general language, but most corpora will have specific elements that are different. If you're worried about overfitting, you might try using held-out cross-validation folds.
From: topic-models-bounces at lists.cs.princeton.edu <topic-models-bounces at lists.cs.princeton.edu> on behalf of Jocelyn Mazarura <jocelynmazarura at yahoo.com>
Sent: Friday, April 7, 2017 9:03:20 AM
To: topic-models at lists.cs.princeton.edu
Subject: [Topic-models] Question regarding automatic evaluation of topic models using PMI/NPMI
My question is inspired by the article Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality by Lau et al. (2014). (See full reference below.)
When estimating the joint and marginal probabilities for the PMI/NPMI, is it ok to use the original data I would have used to extract the topics to estimate these probabilities instead of using another large corpus like English Wikipedia like they do in the original article?
Reference: Lau, J.H., Newman, D. and Baldwin, T., 2014, April. Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality. In EACL (pp. 530-539).
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Topic-models