[Topic-models] sparse word vectors and LDA

Kowalski, Radoslaw radoslaw.kowalski.14 at ucl.ac.uk
Thu Jun 9 05:02:51 EDT 2016


Hi again Michael,


I think some answers to your questions you can find in a presentation about lda2vec package, at least when it comes to the relationship between data sparsity and data visualization. My intuition tells me that you're trying to build a model with features that mirror those of lda2vec, but with a different tool (e.g. gensim). The presentation about lda2vec is available at: http://www.slideshare.net/ChristopherMoody3/word2vec-lda-and-introducing-a-new-hybrid-algorithm-lda2vec-57135994


<http://www.slideshare.net/ChristopherMoody3/word2vec-lda-and-introducing-a-new-hybrid-algorithm-lda2vec-57135994>

<http://www.slideshare.net/ChristopherMoody3/word2vec-lda-and-introducing-a-new-hybrid-algorithm-lda2vec-57135994>By the way, convolutional neural networks are not very suitable for text data analysis. LSTM recurrent neural network is a standard best practice, but of course the choice of tools may be different as it depends on your project requirements.


All the best,

Radoslaw



Radoslaw Kowalski
PhD Student
______________________________
Consumer Data Research Centre
UCL Department of Political Science
______________________________
T:  020 3108 1098 x51098
E:  radoslaw.kowalski.14 at ucl.ac.uk<mailto:n.vij at ucl.ac.uk>
W: <http://www.cdrc.ac.uk/> www.cdrc.ac.uk<http://www.cdrc.ac.uk/>
Twitter:@CDRC_UK

________________________________
From: topic-models-bounces at lists.cs.princeton.edu <topic-models-bounces at lists.cs.princeton.edu> on behalf of Michael Klachko <michaelklachko at gmail.com>
Sent: 08 June 2016 22:27:14
To: Dat Quoc Nguyen
Cc: topic-models at lists.cs.princeton.edu
Subject: Re: [Topic-models] sparse word vectors and LDA

Thank you everyone for the answers. I believe what I was asking for originally is available in Gensim:
http://comments.gmane.org/gmane.comp.ai.gensim/591

However, as you pointed out, it might be better to use a hybrid approach (combining global and local statistics for word embedding).

Perhaps I should describe what I'm trying to do:

1. Generate word vectors for a corpus.
2. Break the corpus into documents (e.g. pages, articles, or chapters).
3. Present each document as a matrix where each word is a column.
4. Treat these matrices as images.
5. Do supervised or unsupervised learning on these images, for example, run them through a convolutional NN, or an autoencoder to extract high-level features.
6. Hope that the high-level features are transformation invariant, and robust enough to use in a search for similar documents, or to help with translation.
7. The features can also be used as a better topic indicators: maybe they could even be used to build better word vectors!
8. For any given high level feature, we can artificially generate an input (text) which produces the highest stimulus of the corresponding neuron (inspired by http://arxiv.org/abs/1112.6209). This means that we could produce a computer generated text which would represent a high level concept, such as "love", "betrayal", "faith", or something more specific. Thinking about this task, it becomes clear that common stop words need to be represented somehow (if I understand correctly, currently all word embedding methods simply throw them away).

Given the above goals, which method of generating word vectors would be most suitable for visual text representation? If we slice a real image into columns of pixels, what kind of vectors would you say those columns are? I've seen some evidence that sparse vectors are better than dense vectors if we want to interpret semantic meaning of their values: https://arxiv.org/abs/1506.02004 I'm not sure how important is this semantic interpretability (and therefore, sparsity) is for visual representation of texts.

Currently I'm working on building a convolutional autoencoder to test these ideas, but given the size of inputs, this will be very computationally intensive.


Regards,
Michael

On Wed, Jun 8, 2016 at 2:09 AM, Dat Quoc Nguyen <datquocnguyen at gmail.com<mailto:datquocnguyen at gmail.com>> wrote:
<http:///>Hi Michael,

I am not sure LDA is the best one for this. But several approaches are proposed to use LDA outputs (here, topic-word assignments) to improve Word2Vec Skip-gram model:

Improving short text classification by learning vector representations of both words and hidden topics. Knowledge-Based Systems, 2016.
Contextual Text Understanding in Distributional Semantic Space<http://research.microsoft.com/pubs/255396/contextual_embedding.pdf>. CIKM 2015.
Topical Word Embeddings<http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9314>. AAAI 2015.

The experimental results in AAAI 2015 and CIKM 2015 show that the proposed approaches do better than Word2Vec Skip-gram on some evaluation tasks.

Although it is not be related to your questions on constructing word vectors, you might also want to look at some works using word vectors to improve LDA:

Improving Topic Models with Latent Feature Word Representations<https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/582/158>. Transactions of the Association for Computational Linguistics, 2015.
Gaussian LDA for Topic Models with Word Embeddings<http://rajarshd.github.io/papers/acl2015.pdf>. ACL 2015.

Best,

Dat.


On Wed, Jun 8, 2016 at 3:15 AM, Mike Mansour <mnmansour91 at gmail.com<mailto:mnmansour91 at gmail.com>> wrote:
Greetings Mike,

I played around with Gaussian LDA that used continuous word embeddings. I think you could build dense word vectors encoded with a ‘topic space’, akin to how word-embeddings are encoded in a word space.  I have written a paper improving on the original method and implemented it in python.  PM me for a deeper discussion.

Perhaps you could generate new word-vectors by finding the pdf’s for a word from each continuous topic distribution, and using those values.

While sparsity has its advantages, the dense representations allow for a more tractable dimensionality that includes the latent meaning.  LDA sounds like a good tool for this.  Do you have a particular use case in mind?

<><><><><><><><><><><><><><>
Michael Mansour
Data Scientist
IBM Blockchain Labs
(650) 773-7974<tel:%28650%29%20773-7974>
Twitter: @sourmansweet
On May 27, 2016, at 6:54 PM, Kowalski, Radoslaw <radoslaw.kowalski.14 at ucl.ac.uk<mailto:radoslaw.kowalski.14 at ucl.ac.uk>> wrote:

Hi Michael,

Use lda2vec library for python programming language. It does what you want to be done. My personal recommendation with regard to lda2vec is that you implement it on a linux system.

All the best,
Radoslaw


Radoslaw Kowalski
PhD Student
______________________________
Consumer Data Research Centre
UCL Department of Political Science
______________________________
T:  020 3108 1098 x51098
E:  radoslaw.kowalski.14 at ucl.ac.uk<mailto:n.vij at ucl.ac.uk>
W: <http://www.cdrc.ac.uk/> www.cdrc.ac.uk<http://www.cdrc.ac.uk/>
Twitter:@CDRC_UK
<http://www.cdrc.ac.uk/>
________________________________


From: topic-models-bounces at lists.cs.princeton.edu<mailto:topic-models-bounces at lists.cs.princeton.edu> <topic-models-bounces at lists.cs.princeton.edu<mailto:topic-models-bounces at lists.cs.princeton.edu>> on behalf of Michael Klachko <michaelklachko at gmail.com<mailto:michaelklachko at gmail.com>>
Sent: 28 May 2016 00:48:44
To: topic-models at lists.cs.princeton.edu<mailto:topic-models at lists.cs.princeton.edu>
Subject: [Topic-models] sparse word vectors and LDA

Hello,

I'm new to topic modeling, and I'm currently exploring different ways to construct word vectors.

One way is to use a topic modeling algorithm: run LDA on a large corpus of text, and identify k topics. Then, build k-dimensional vectors for every word, so that every position in a vector corresponds to a topic. If word X belongs to topic Z then the vector for X will have "1" at position Z. At the end, we will have sparse vectors of length k.

I have a few questions:

1. Does this make sense?
2. Has it been tried?
3. Is LDA the best algorithm for this?
4. How to modify LDA so that instead of "1"s in the vector I would have real numbers representing probabilities of the word belonging to topics in this document? (again, I'm not sure if this makes sense in the context of LDA...). One reason for this is to avoid having identical vectors for similar words, such as "cat" and "dog".
5. How such sparse vectors would compare to vectors generated with word2vec?
6. Is it possible to somehow make sure that related topics would correspond to positions in the vector that are nearby?

Thanks!

_______________________________________________
Topic-models mailing list
Topic-models at lists.cs.princeton.edu<mailto:Topic-models at lists.cs.princeton.edu>
https://lists.cs.princeton.edu/mailman/listinfo/topic-models

<><><><><><><><><><><><><><>
Michael Mansour
Data Scientist & Graduate Student @ Galvanize
IBM Blockchain Labs
(650) 773-7974<tel:%28650%29%20773-7974>
Twitter: @sourmansweet


_______________________________________________
Topic-models mailing list
Topic-models at lists.cs.princeton.edu<mailto:Topic-models at lists.cs.princeton.edu>
https://lists.cs.princeton.edu/mailman/listinfo/topic-models



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cs.princeton.edu/pipermail/topic-models/attachments/20160609/e5773e12/attachment-0001.html>


More information about the Topic-models mailing list