[Topic-models] Speeding up LDA

David Blei blei at CS.Princeton.EDU
Wed Oct 15 18:19:28 EDT 2008


hi laura and all,

along the same lines, i've had good luck with trimming terms that  
occur in many documents.  these are essentially stop words.

a good method for selecting the vocabulary is to

(a) remove terms occurring in fewer than 0.1% of all documents
(b) remove terms occurring in more than 90% of all documents

criterion (a)  will trim the vocabulary.

criterion (b) will not, but if you are using gibbs sampling then it  
will save time during inference.  the reason is that a topic  
assignment must be sampled for each instance of each term and the  
terms trimmed in (b) tend to be very frequent.  there won't be as  
marked a difference if using variational methods because the within- 
document word count does not affect their complexity.

there have been a number of recent papers on speeding up gibbs  
sampling for LDA, as others have pointed out.  variational methods are  
easy to parallelize.

best,
dave


On Oct 15, 2008, at 5:25 PM, Laura Dietz wrote:

> Hi Gregg,
>
> LDA is an dimensionality reduction on its own, so I am not sure about
> the benefits from combinations with yet another dim reduction. I would
> guess that you will not gain much by stemming, too. Both issues have
> already been discussed on this list and you find them in the archive.
>
> In order to speed things up, I would go for a different approach.  
> Words
> that occur only in a single document do not contribute to the  
> problem in
> an LDA sense and can be removed from the data. You could reduce the
> number of words by increasing a threshold X in "a word must occur in  
> at
> least X different documents to be considered". Note that it is  
> slightly
> different from "a word must occur at least Y times in the corpus".
>
> I have not tried something like this. It was just an idea one might  
> play
> with. (@list, if this is not a good idea in general, I would like to
> know...)
>
> Cheers,
> Laura
> _______________________________________________
> Topic-models mailing list
> Topic-models at lists.cs.princeton.edu
> https://lists.cs.princeton.edu/mailman/listinfo/topic-models



More information about the Topic-models mailing list