[Topic-models] Questions regarding the training set likelihood of LDA

Pradipto Das pdas3 at buffalo.edu
Sat Mar 27 23:49:42 EDT 2010


Hi Liangjie,
You should use the test set for topic fitting and not the training set 
(in conjunction with cross-validation). For the training set, you could 
get the lowest perplexity when number of topics equals the number of 
words in the vocabulary - a way of the model saying "hey! I have a 
perfect understanding of the training data, but I can't say much about 
the unseen data." This is a case of over-fitting in which case, the 
inference on test data will be very very poor.

Pradipto

Hong, Liangjie wrote:
> Hi all,
>
> I'm a newbie to LDA and find this a very interesting tool for text 
> mining. My question is related to training set likelihood. What I am 
> trying is to calculate the log likelihood (e.g., log P(w|z) ) during 
> the Gibbs Sampling process and to see the convergence. I calculated it 
> as Equation (2) in Griffiths and Steyvers' "Finding scientific 
> topics". It indeed converges during the process. However, when I want 
> to recover the model selection process of what they did in the same 
> paper, where the topic number was determined by varying the number of 
> topics T and see where the optimum value of log P(z|w) appears, the 
> log likelihood, in my case, is ALWAYS smaller than T gets smaller. For 
> example, in my experiments, log P(w|z) for 50 topics is always larger 
> than 100 topics.
>
> Is log P(w|z) getting smaller since T getting larger in general? Can 
> we use log P(w|z) to determine the number of topics, like what 
> Griffiths et al. did in the paper?
>
> In order to give more information to the problem, here is the equation 
> I used ( Equation 2 in "Finding scientific topics") in pseudo Latex code:
>
> P(w|z) = (\Gamma(W * beta) / \Gamma(beta)^(W) )^(T) * \prod_{j=0}^{T} 
> (\prod_{w} \Gamma (n_{j}^{w} + beta)) / (\Gamma(n_{j}^{*}) + W * beta )
>
> Here is the C code I write for the calculation where lgamma is the log 
> gamma function, n_t_k is the number of times term t in topic k and n_k 
> is the total number of terms in topic k:
> -------------------------------------------------------------------------------
>  for (int k=0; k<TOPIC_NUM; k++){
>           training_likelihood  = training_likelihood + 
> (lgamma(TERM_NUM * beta) - TERM_NUM * lgamma(beta));
>           for (int i=0; i<TERM_NUM;i++){
>             training_likelihood = training_likelihood + 
> lgamma(n_t_k[i][k] + beta);
>           }
>  training_likelihood = training_likelihood - lgamma(n_k[k] + TERM_NUM 
> * beta);
>  }
> -------------------------------------------------------------------------------
>
>
> Thanks in advanced and I will be appreciated if someone can shed some 
> light on the problem.
>
>
> Liangjie
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Topic-models mailing list
> Topic-models at lists.cs.princeton.edu
> https://lists.cs.princeton.edu/mailman/listinfo/topic-models
>   

-- 
Pradipto Das
PhD Candidate
CSE Dept.
SUNY Buffalo

www.buffalo.edu/~pdas3



More information about the Topic-models mailing list