[Topic-models] HDP tuning

Radim Rehurek RadimRehurek at seznam.cz
Mon Mar 5 13:07:24 EST 2012


I'm glad to report we tried with more data, and the "online HDP" results were much better: English Wikipedia https://gist.github.com/1979640 . 

I also added a lemmatizer and did more careful vocab pruning, leaving only the 50k most frequent nouns, verbs, adjectives and adverbs, after filtering out lemmas that appeared in more than 10% of docs.

Thanks everyone for their help and comments. There're still parameters to tune etc. but these results are already much more cheerful :)

Best,
Radim



> ------------ Původní zpráva ------------
> Od: Chong Wang <chongw at CS.Princeton.EDU>
> Předmět: Re: [Topic-models] HDP tuning
> Datum: 18.2.2012 04:35:10
> ----------------------------------------
> the script i used,
> ---- script ----
> 
> #! /bin/bash
> for corpus in nyt
> do
>   if [ $corpus == nature ]; then
>     D=334922
>     W=4253
>   elif [ $corpus == nyt ]; then
>     D=1728305
>     W=8000
>   else
>     D=3611558
>     W=7702
>   fi
>   for kappa in 0.8
>   do
>     for tau in 1
>     do
>       for batchsize in 500
>       do
>         ./qsub.sh /usr/local/python/current/bin/python run_online_hdp.py \
>         --max_time=129600 --corpus_name=$corpus --tau=$tau
> --max_iter=-1 --D=$D \
>         --kappa=$kappa --K=20
> --data_path=../data/${corpus}/mult-train-split-0* \
>         --test_data_path=../data/${corpus}/mult-test.dat --T=300 --W=$W \
>         --var_converge=0.0001 --directory=../results-2012-feb
> --alpha=1.0 --gamma=1.0 \
>         --batchsize=$batchsize --save_lag=500 --pass_ratio=0.5
>       done
>     done
>   done
> done
> 
> ----- script ------
> 
> 
> --
> Chong Wang
> chongw at cs.princeton.edu
> http://www.cs.princeton.edu/~chongw
> Computer Science Department
> Princeton University
> 
> 
> 
> On Fri, Feb 17, 2012 at 10:11 PM, Chong Wang <chongw at cs.princeton.edu> wrote:
> > in addition, i run it for a total of 129600 seconds.
> >
> > best
> > chong
> > --
> > Chong Wang
> > chongw at cs.princeton.edu
> > http://www.cs.princeton.edu/~chongw
> > Computer Science Department
> > Princeton University
> >
> >
> >
> > On Fri, Feb 17, 2012 at 9:32 PM, Chong Wang <chongw at cs.princeton.edu> wrote:
> >> Hi, Radim
> >>
> >> what new york times data do you use? mine is
> >> New York Times:
> >>
> >> The New York Times (NYT) dataset contains about 1.8 million documents, with
> about 461 million tokens and a vocabulary size of 8,000. These articles are from
> the years 1987 to 2007,
> >>
> >> the vocab is at http://www.cs.princeton.edu/~chongw/nyt-vocab.dat
> >>
> >> for kappa=0.8, tau0=1, and batchsize=500, i got topics like this
> >> http://www.cs.princeton.edu/~chongw/nyt-topics.txt
> >>
> >>
> >> i also found that if the number of documents is small, the online algorithm
> doesn't work well.
> >>
> >> best
> >> chong
> >>
> >>
> >> ----- Original Message -----
> >> From: "Radim Rehurek" <RadimRehurek at seznam.cz>
> >> To: "Chong Wang" <chongw at CS.Princeton.EDU>
> >> Cc: topic-models at lists.cs.princeton.edu
> >> Sent: Monday, February 13, 2012 3:37:45 AM
> >> Subject: Re: [Topic-models] HDP tuning
> >>
> >> Oh, Mr. Chong himself -- I love the topic-models list :)
> >>
> >>> Thanks for the email. That is a little strange. (it could be a bug.) I
> >>> have been trying on
> >>> some new york times data and it does seem to produce reasonable
> >>> results, like
> >>> http://www.cs.princeton.edu/~chongw/nyt-topics.txt
> >>
> >> But these topics look amazing.
> >> Jonathan says he tried both his adapted code (Python) and your original code,
> with equally strange results, on the NYT dataset. So the difference must lie in
> data preprocessing.
> >>
> >> Can you give more info on how to replicate your result? What particular
> dictionary did you use, what training parameters? I'd like to get to the bottom
> of this, so I can include HDP into gensim with clear conscience.
> >>
> >> Best,
> >> Radim
> >>
> >>
> >>>
> >>> But if you look at the tail topics, you might see some junk, since
> >>> they are either shrinked by the model or not well converged. You can
> >>> take a look at the final.topics to see why that topic is actually on
> >>> or not.
> >>>
> >>> thanks
> >>> best
> >>> Chong
> >>>
> >>> --
> >>> Chong Wang
> >>> chongw at cs.princeton.edu
> >>> http://www.cs.princeton.edu/~chongw
> >>> Computer Science Department
> >>> Princeton University
> >>>
> >>>
> >>>
> >>> On Sun, Feb 12, 2012 at 2:43 PM, Radim Rehurek <RadimRehurek at seznam.cz>
> wrote:
> >>> > Hello,
> >>> >
> >>> > Jonathan, one of gensim's users, adapted and tried the HDP method of
> Chong
> >>> Wang et al [1] on a medical dataset. The results don't seem very promising
> (for
> >>> a tiny comparison of LDA and HDP over the same corpus:
> >>> https://github.com/piskvorky/gensim/pull/73#issuecomment-3891114 ).
> >>> >
> >>> > Any pointers of how to tweak this method, or what to expect from it in
> >>> practice? Has anyone tried HDP on their corpora, so we can cross-check the
> >>> results?
> >>> >
> >>> > Cheers,
> >>> > Radim
> >>> >
> >>> >
> >>> > [1] Wang, Paisley, Blei: Online Variational Inference for the
> Hierarchical
> >>> Dirichlet Process, JMLR (2011).
> >>> > _______________________________________________
> >>> > Topic-models mailing list
> >>> > Topic-models at lists.cs.princeton.edu
> >>> > https://lists.cs.princeton.edu/mailman/listinfo/topic-models
> >>>
> >>>
> >>>
> >> _______________________________________________
> >> Topic-models mailing list
> >> Topic-models at lists.cs.princeton.edu
> >> https://lists.cs.princeton.edu/mailman/listinfo/topic-models
> 
> 
> 


More information about the Topic-models mailing list