[Topic-models] [Topic modeling] LDA for Clustering Text Documents

Durgesh Bhagat durgeshit at gmail.com
Tue Jun 14 01:40:55 EDT 2016


Dear All,

We have text data containing 1000 documents and and 93 cluster ( only few
aprox 30 document belong to multiple cluster). We have used Gensim
implementation of LDA to find topics after removing stop words and without
stemming. We have assigned each document to topic having highest
contribution.

We have used Purity,  Rand-Index, Jaccard-coefficient, Precision, and
Recall for the cluster evaluation.

Considering all feature, and  few selected feature we get following
performance  :

                                        * |   TP          |   TN        |
  FP    |   FN           |         Purity             |        Rand-Index
|        Jacard-coefficient *
*All unigram words  *        |    908        |  238298  |   6886  | 4894
       |         0.16               |          0.95              |
0.07
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
*Selected Feature *           |    1688      |  240946  |  3539   |   4105
       |          0.10               |           0.969           |     0.18



My query  is :

1) Whether I am using LDA in wrong context. ?
2) Whether We have to use other variants of LDA ?
3) Whether LDA will not work well for such amall data-set ?
4) Do we need to use some other evaluation matrix ?


Thanking You all in advance!

Regards,

Durgesh Kumar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cs.princeton.edu/pipermail/topic-models/attachments/20160614/37510781/attachment.html>


More information about the Topic-models mailing list