[Topic-models] Question regarding automatic evaluation of topic models using PMI/NPMI

Arian Pasquali arianpasquali at gmail.com
Fri Apr 7 13:06:59 EDT 2017

Hi Jocelyn
I don't see a problem with that.
I did exactly that in my master's thesis.

In fact I see it as a complementary coherence score.

When using the original dataset to calculate PMI/NPMI you can get what is
called "intrinsic coherence".
If you use an external reference dataset, such as wikipedia, we end up with
the "extrinsic coherence" score that the authors suggest.

You can find it here
if you are interested.
On chapter 4 I compare the intrinsic and extrinsic scores using 20
newsgroup to model topics and wikipedia as reference.
I also compared the results with human evaluation using mechanical turks.
In the end there results are similar.

If you need to choose one or another just keep in mind the resources you
have available for your particular use case.

kind regards
Arian Pasquali

---------- Forwarded message ----------
> From: Jocelyn Mazarura <jocelynmazarura at yahoo.com>
> To: "topic-models at lists.cs.princeton.edu" <
> topic-models at lists.cs.princeton.edu>
> Cc:
> Bcc:
> Date: Fri, 7 Apr 2017 13:03:20 +0000 (UTC)
> Subject: [Topic-models] Question regarding automatic evaluation of topic
> models using PMI/NPMI
> Hi
> My question is inspired by the article *Machine Reading Tea Leaves:
> Automatically Evaluating Topic Coherence **and Topic Model Quality* by
> Lau et al*. *(2014)*. (S*ee full reference below.)
> When estimating the joint and marginal probabilities for the PMI/NPMI, is
> it ok to use the original data I would have used to extract the topics to
> estimate these probabilities instead of using another large corpus like
> English Wikipedia like they do in the original article?
> *Reference:* Lau, J.H., Newman, D. and Baldwin, T., 2014, April. Machine
> Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic
> Model Quality. In *EACL* (pp. 530-539).
> Kind regards
> Jocelyn Mazarura
> _______________________________________________
> Topic-models mailing list
> Topic-models at lists.cs.princeton.edu
> https://lists.cs.princeton.edu/mailman/listinfo/topic-models
[image: INESC TEC]

*Arian Rodrigo Pasquali*
Laboratório de Inteligência Artificial e Apoio à Decisão
Laboratory of Artificial Intelligence and Decision Support

Campus da FEUP
Rua Dr Roberto Frias
4200-465 Porto

T +351 22 040 2963
F +351 22 209 4050
arian.r.pasquali at inesctec.pt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cs.princeton.edu/pipermail/topic-models/attachments/20170407/9b18f034/attachment.html>

More information about the Topic-models mailing list