[Topic-models] sparse word vectors and LDA

Dat Quoc Nguyen datquocnguyen at gmail.com
Wed Jun 8 05:09:01 EDT 2016


Hi Michael,

I am not sure LDA is the best one for this. But several approaches are
proposed to use LDA outputs (here, topic-word assignments) to improve
Word2Vec Skip-gram model:

Improving short text classification by learning vector representations of
both words and hidden topics <doi:10.1016/j.knosys.2016.03.027>.
*Knowledge-Based
Systems*, 2016.
Contextual Text Understanding in Distributional Semantic Space
<http://research.microsoft.com/pubs/255396/contextual_embedding.pdf>. *CIKM
2015*.
Topical Word Embeddings
<http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9314>. *AAAI 2015*
.

The experimental results in AAAI 2015 and CIKM 2015 show that the proposed
approaches do better than Word2Vec Skip-gram on some evaluation tasks.

Although it is not be related to your questions on constructing word
vectors, you might also want to look at some works using word vectors to
improve LDA:

Improving Topic Models with Latent Feature Word Representations
<https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/582/158>.
*Transactions
of the Association for Computational Linguistics*, 2015.
Gaussian LDA for Topic Models with Word Embeddings
<http://rajarshd.github.io/papers/acl2015.pdf>. *ACL 2015.*

Best,

Dat.


On Wed, Jun 8, 2016 at 3:15 AM, Mike Mansour <mnmansour91 at gmail.com> wrote:

> Greetings Mike,
>
> I played around with Gaussian LDA that used continuous word embeddings. I
> think you could build dense word vectors encoded with a ‘topic space’, akin
> to how word-embeddings are encoded in a word space.  I have written a paper
> improving on the original method and implemented it in python.  PM me for a
> deeper discussion.
>
> Perhaps you could generate new word-vectors by finding the pdf’s for a
> word from each continuous topic distribution, and using those values.
>
> While sparsity has its advantages, the dense representations allow for a
> more tractable dimensionality that includes the latent meaning.  LDA sounds
> like a good tool for this.  Do you have a particular use case in mind?
>
> <><><><><><><><><><><><><><>
> Michael Mansour
> Data Scientist
> IBM Blockchain Labs
> (650) 773-7974
> Twitter: @sourmansweet
>
> On May 27, 2016, at 6:54 PM, Kowalski, Radoslaw <
> radoslaw.kowalski.14 at ucl.ac.uk> wrote:
>
> Hi Michael,
>
> Use lda2vec library for python programming language. It does what you want
> to be done. My personal recommendation with regard to lda2vec is that you
> implement it on a linux system.
>
> All the best,
> Radoslaw
>
>
>
> *Radoslaw Kowalski*
> PhD Student
> ______________________________
> *Consumer Data Research Centre*
> UCL Department of Political Science
> ______________________________
> T:  020 3108 1098 x51098
> E:  radoslaw.kowalski.14 at ucl.ac.uk <n.vij at ucl.ac.uk>
> W:  <http://www.cdrc.ac.uk/>www.cdrc.ac.uk
> Twitter:@CDRC_UK
> <http://www.cdrc.ac.uk/>
> ------------------------------
>
>
> *From:* topic-models-bounces at lists.cs.princeton.edu <
> topic-models-bounces at lists.cs.princeton.edu> on behalf of Michael Klachko
> <michaelklachko at gmail.com>
> *Sent:* 28 May 2016 00:48:44
> *To:* topic-models at lists.cs.princeton.edu
> *Subject:* [Topic-models] sparse word vectors and LDA
>
> Hello,
>
> I'm new to topic modeling, and I'm currently exploring different ways to
> construct word vectors.
>
> One way is to use a topic modeling algorithm: run LDA on a large corpus of
> text, and identify k topics. Then, build k-dimensional vectors for every
> word, so that every position in a vector corresponds to a topic. If word X
> belongs to topic Z then the vector for X will have "1" at position Z. At
> the end, we will have sparse vectors of length k.
>
> I have a few questions:
>
> 1. Does this make sense?
> 2. Has it been tried?
> 3. Is LDA the best algorithm for this?
> 4. How to modify LDA so that instead of "1"s in the vector I would have
> real numbers representing probabilities of the word belonging to topics in
> this document? (again, I'm not sure if this makes sense in the context of
> LDA...). One reason for this is to avoid having identical vectors for
> similar words, such as "cat" and "dog".
> 5. How such sparse vectors would compare to vectors generated with
> word2vec?
> 6. Is it possible to somehow make sure that related topics would
> correspond to positions in the vector that are nearby?
>
> Thanks!
>
> _______________________________________________
> Topic-models mailing list
> Topic-models at lists.cs.princeton.edu
> https://lists.cs.princeton.edu/mailman/listinfo/topic-models
>
>
> <><><><><><><><><><><><><><>
> Michael Mansour
> Data Scientist & Graduate Student @ Galvanize
> IBM Blockchain Labs
> (650) 773-7974
> Twitter: @sourmansweet
>
>
> _______________________________________________
> Topic-models mailing list
> Topic-models at lists.cs.princeton.edu
> https://lists.cs.princeton.edu/mailman/listinfo/topic-models
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cs.princeton.edu/pipermail/topic-models/attachments/20160608/56dffaaa/attachment-0001.html>


More information about the Topic-models mailing list