[Topic-models] sparse word vectors and LDA

Mike Mansour mnmansour91 at gmail.com
Tue Jun 7 13:15:41 EDT 2016


Greetings Mike, 

I played around with Gaussian LDA that used continuous word embeddings. I think you could build dense word vectors encoded with a ‘topic space’, akin to how word-embeddings are encoded in a word space.  I have written a paper improving on the original method and implemented it in python.  PM me for a deeper discussion. 

Perhaps you could generate new word-vectors by finding the pdf’s for a word from each continuous topic distribution, and using those values. 

While sparsity has its advantages, the dense representations allow for a more tractable dimensionality that includes the latent meaning.  LDA sounds like a good tool for this.  Do you have a particular use case in mind?

<><><><><><><><><><><><><><>
Michael Mansour
Data Scientist
IBM Blockchain Labs
(650) 773-7974
Twitter: @sourmansweet
> On May 27, 2016, at 6:54 PM, Kowalski, Radoslaw <radoslaw.kowalski.14 at ucl.ac.uk> wrote:
> 
> Hi Michael,
> 
> Use lda2vec library for python programming language. It does what you want to be done. My personal recommendation with regard to lda2vec is that you implement it on a linux system.
> 
> All the best,
> Radoslaw
> 
> 
> Radoslaw Kowalski
> PhD Student
> ______________________________
> Consumer Data Research Centre
> UCL Department of Political Science
> ______________________________
> T:  020 3108 1098 x51098
> E:  radoslaw.kowalski.14 at ucl.ac.uk <mailto:n.vij at ucl.ac.uk>W:  <http://www.cdrc.ac.uk/>www.cdrc.ac.uk <http://www.cdrc.ac.uk/>
> Twitter:@CDRC_UK
>  <http://www.cdrc.ac.uk/>
> 
> 
> From: topic-models-bounces at lists.cs.princeton.edu <topic-models-bounces at lists.cs.princeton.edu> on behalf of Michael Klachko <michaelklachko at gmail.com>
> Sent: 28 May 2016 00:48:44
> To: topic-models at lists.cs.princeton.edu
> Subject: [Topic-models] sparse word vectors and LDA
>  
> Hello, 
> 
> I'm new to topic modeling, and I'm currently exploring different ways to construct word vectors. 
> 
> One way is to use a topic modeling algorithm: run LDA on a large corpus of text, and identify k topics. Then, build k-dimensional vectors for every word, so that every position in a vector corresponds to a topic. If word X belongs to topic Z then the vector for X will have "1" at position Z. At the end, we will have sparse vectors of length k. 
> 
> I have a few questions:
> 
> 1. Does this make sense?
> 2. Has it been tried? 
> 3. Is LDA the best algorithm for this?
> 4. How to modify LDA so that instead of "1"s in the vector I would have real numbers representing probabilities of the word belonging to topics in this document? (again, I'm not sure if this makes sense in the context of LDA...). One reason for this is to avoid having identical vectors for similar words, such as "cat" and "dog". 
> 5. How such sparse vectors would compare to vectors generated with word2vec? 
> 6. Is it possible to somehow make sure that related topics would correspond to positions in the vector that are nearby?
> 
> Thanks!
> 
> _______________________________________________
> Topic-models mailing list
> Topic-models at lists.cs.princeton.edu
> https://lists.cs.princeton.edu/mailman/listinfo/topic-models

<><><><><><><><><><><><><><>
Michael Mansour
Data Scientist & Graduate Student @ Galvanize
IBM Blockchain Labs
(650) 773-7974
Twitter: @sourmansweet

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cs.princeton.edu/pipermail/topic-models/attachments/20160607/6e5adccf/attachment.html>


More information about the Topic-models mailing list