[Topic-models] Speeding up LDA
blei at CS.Princeton.EDU
Wed Oct 15 18:19:28 EDT 2008
hi laura and all,
along the same lines, i've had good luck with trimming terms that
occur in many documents. these are essentially stop words.
a good method for selecting the vocabulary is to
(a) remove terms occurring in fewer than 0.1% of all documents
(b) remove terms occurring in more than 90% of all documents
criterion (a) will trim the vocabulary.
criterion (b) will not, but if you are using gibbs sampling then it
will save time during inference. the reason is that a topic
assignment must be sampled for each instance of each term and the
terms trimmed in (b) tend to be very frequent. there won't be as
marked a difference if using variational methods because the within-
document word count does not affect their complexity.
there have been a number of recent papers on speeding up gibbs
sampling for LDA, as others have pointed out. variational methods are
easy to parallelize.
On Oct 15, 2008, at 5:25 PM, Laura Dietz wrote:
> Hi Gregg,
> LDA is an dimensionality reduction on its own, so I am not sure about
> the benefits from combinations with yet another dim reduction. I would
> guess that you will not gain much by stemming, too. Both issues have
> already been discussed on this list and you find them in the archive.
> In order to speed things up, I would go for a different approach.
> that occur only in a single document do not contribute to the
> problem in
> an LDA sense and can be removed from the data. You could reduce the
> number of words by increasing a threshold X in "a word must occur in
> least X different documents to be considered". Note that it is
> different from "a word must occur at least Y times in the corpus".
> I have not tried something like this. It was just an idea one might
> with. (@list, if this is not a good idea in general, I would like to
> Topic-models mailing list
> Topic-models at lists.cs.princeton.edu
More information about the Topic-models