[Topic-models] Questions regarding the training set likelihood of LDA
Pradipto Das
pdas3 at buffalo.edu
Sat Mar 27 23:49:42 EDT 2010
Hi Liangjie,
You should use the test set for topic fitting and not the training set
(in conjunction with cross-validation). For the training set, you could
get the lowest perplexity when number of topics equals the number of
words in the vocabulary - a way of the model saying "hey! I have a
perfect understanding of the training data, but I can't say much about
the unseen data." This is a case of over-fitting in which case, the
inference on test data will be very very poor.
Pradipto
Hong, Liangjie wrote:
> Hi all,
>
> I'm a newbie to LDA and find this a very interesting tool for text
> mining. My question is related to training set likelihood. What I am
> trying is to calculate the log likelihood (e.g., log P(w|z) ) during
> the Gibbs Sampling process and to see the convergence. I calculated it
> as Equation (2) in Griffiths and Steyvers' "Finding scientific
> topics". It indeed converges during the process. However, when I want
> to recover the model selection process of what they did in the same
> paper, where the topic number was determined by varying the number of
> topics T and see where the optimum value of log P(z|w) appears, the
> log likelihood, in my case, is ALWAYS smaller than T gets smaller. For
> example, in my experiments, log P(w|z) for 50 topics is always larger
> than 100 topics.
>
> Is log P(w|z) getting smaller since T getting larger in general? Can
> we use log P(w|z) to determine the number of topics, like what
> Griffiths et al. did in the paper?
>
> In order to give more information to the problem, here is the equation
> I used ( Equation 2 in "Finding scientific topics") in pseudo Latex code:
>
> P(w|z) = (\Gamma(W * beta) / \Gamma(beta)^(W) )^(T) * \prod_{j=0}^{T}
> (\prod_{w} \Gamma (n_{j}^{w} + beta)) / (\Gamma(n_{j}^{*}) + W * beta )
>
> Here is the C code I write for the calculation where lgamma is the log
> gamma function, n_t_k is the number of times term t in topic k and n_k
> is the total number of terms in topic k:
> -------------------------------------------------------------------------------
> for (int k=0; k<TOPIC_NUM; k++){
> training_likelihood = training_likelihood +
> (lgamma(TERM_NUM * beta) - TERM_NUM * lgamma(beta));
> for (int i=0; i<TERM_NUM;i++){
> training_likelihood = training_likelihood +
> lgamma(n_t_k[i][k] + beta);
> }
> training_likelihood = training_likelihood - lgamma(n_k[k] + TERM_NUM
> * beta);
> }
> -------------------------------------------------------------------------------
>
>
> Thanks in advanced and I will be appreciated if someone can shed some
> light on the problem.
>
>
> Liangjie
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Topic-models mailing list
> Topic-models at lists.cs.princeton.edu
> https://lists.cs.princeton.edu/mailman/listinfo/topic-models
>
--
Pradipto Das
PhD Candidate
CSE Dept.
SUNY Buffalo
www.buffalo.edu/~pdas3
More information about the Topic-models
mailing list