[Topic-models] Likelihood of a held out set using DTM code

Ruth Jemma White R.J.White at pgr.reading.ac.uk
Tue Oct 25 16:29:48 EDT 2016

Dear Topic Modellers,

Apologies if this has been addressed previously but I couldn't find a reference to it.

I am using dynamic topic models to look at changes in routines over time for a dataset of activities of daily living. I have successfully applied  S. Gerrish and D. Blei's C++ implementation of DTMs https://github.com/blei-lab/dtm to my dataset. I would now like to validate my results quantitatively and intended to follow the approach described in section 4. of the original paper  D. Blei and J. Lafferty. Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning, 2006.<http://www.cs.columbia.edu/%7Eblei/papers/BleiLafferty2006a.pdf> In the paper a comparison is made between the predictive power of the DTM vs LDA (estimated from different subsets of the data).

My understanding is that a model is fitted to the data for N-1 years and then this model is used to find the log likelihood of the final year's articles. I believe that in order to implement this for a dataset the 'time' mode needs to be used in the C++ implementation. However, I cannot find any examples of how to use this mode and which flags need to be used.

I have tried the following:  (assuming a model has already been estimated on corpus test-mult.dat and is saved in model_run/lda-seq and test_heldout-mult.dat only contains documents from one time slice for which a prediction of the log likelihood is desired)
dtm-win64 ./main \
--ntopics=5 \
--mode=time \
--corpus_prefix=example/test \
--rng_seed=0 \
--heldout_corpus_prefix=example/test_heldout \
--heldout_time=1 \
--lda_model_prefix=example/model_run/lda-seq/ \
--outname=example/model_heldout \

But after successfully reading both the held-out corpus and existing model, I got this error:
gsl: ../gsl/gsl_vector_double.h:193: ERROR: index out of range
Default GSL error handler invoked.

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.

I also tried changing the following flags (assuming that the original full corpus is required and then the time slice, from which point onwards should be used as the held_out set, is specified):
--heldout_corpus_prefix=example/test \
--heldout_time=12 \

This gives the same error at the same point.

If anyone has any experience with this or an example of how to perform inference for a held-out set for DTMs I would very much appreciate your help and advice.

Many thanks,
Ruth White

PhD Student, University of Reading, UK
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cs.princeton.edu/pipermail/topic-models/attachments/20161025/da24620a/attachment.html>

More information about the Topic-models mailing list