[Topic-models] sparse word vectors and LDA

Michael Klachko michaelklachko at gmail.com
Wed Jun 8 17:19:34 EDT 2016


Thank you everyone for the answers. I believe what I was asking for
originally is available in Gensim:
http://comments.gmane.org/gmane.comp.ai.gensim/591

However, as you pointed out, it might be better to use a hybrid approach
(combining global and local statistics for word embedding).

Perhaps I should describe what I'm trying to do:

1. Generate word vectors for a corpus.
2. Break the corpus into documents (e.g. pages, articles, or chapters).
3. Present each document as a matrix where each word is a column.
4. Treat these matrices as images.
5. Do supervised or unsupervised learning on these images, for example, run
them through a convolutional NN, or an autoencoder to extract high-level
features.
6. Hope that the high-level features are transformation invariant, and
robust enough to use in a search for similar documents, or to help with
translation.
7. The features can also be used as a better topic indicators: maybe they
could even be used to build better word vectors!
8. For any given high level feature, we can artificially generate an input
(text) which produces the highest stimulus of the corresponding neuron
(inspired by http://arxiv.org/abs/1112.6209). This means that we could
produce a computer generated text which would represent a high level
concept, such as "love", "betrayal", "faith", or something more specific.
Thinking about this task, it becomes clear that common stop words need to
be represented somehow (if I understand correctly, currently all word
embedding methods simply throw them away).

Given the above goals, which method of generating word vectors would be
most suitable for visual text representation? If we slice a real image into
columns of pixels, what kind of vectors would you say those columns are?
I've seen some evidence that sparse vectors are better than dense vectors
if we want to interpret semantic meaning of their values:
https://arxiv.org/abs/1506.02004 I'm not sure how important is this
semantic interpretability (and therefore, sparsity) is for visual
representation of texts.

Currently I'm working on building a convolutional autoencoder to test these
ideas, but given the size of inputs, this will be very computationally
intensive.


Regards,
Michael



On Wed, Jun 8, 2016 at 2:09 AM, Dat Quoc Nguyen <datquocnguyen at gmail.com>
wrote:

> Hi Michael,
>
> I am not sure LDA is the best one for this. But several approaches are
> proposed to use LDA outputs (here, topic-word assignments) to improve
> Word2Vec Skip-gram model:
>
> Improving short text classification by learning vector representations of
> both words and hidden topics. *Knowledge-Based Systems*, 2016.
> Contextual Text Understanding in Distributional Semantic Space
> <http://research.microsoft.com/pubs/255396/contextual_embedding.pdf>. *CIKM
> 2015*.
> Topical Word Embeddings
> <http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9314>. *AAAI
> 2015*.
>
> The experimental results in AAAI 2015 and CIKM 2015 show that the proposed
> approaches do better than Word2Vec Skip-gram on some evaluation tasks.
>
> Although it is not be related to your questions on constructing word
> vectors, you might also want to look at some works using word vectors to
> improve LDA:
>
> Improving Topic Models with Latent Feature Word Representations
> <https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/582/158>
> . *Transactions of the Association for Computational Linguistics*, 2015.
> Gaussian LDA for Topic Models with Word Embeddings
> <http://rajarshd.github.io/papers/acl2015.pdf>. *ACL 2015.*
>
> Best,
>
> Dat.
>
>
> On Wed, Jun 8, 2016 at 3:15 AM, Mike Mansour <mnmansour91 at gmail.com>
> wrote:
>
>> Greetings Mike,
>>
>> I played around with Gaussian LDA that used continuous word embeddings. I
>> think you could build dense word vectors encoded with a ‘topic space’, akin
>> to how word-embeddings are encoded in a word space.  I have written a paper
>> improving on the original method and implemented it in python.  PM me for a
>> deeper discussion.
>>
>> Perhaps you could generate new word-vectors by finding the pdf’s for a
>> word from each continuous topic distribution, and using those values.
>>
>> While sparsity has its advantages, the dense representations allow for a
>> more tractable dimensionality that includes the latent meaning.  LDA sounds
>> like a good tool for this.  Do you have a particular use case in mind?
>>
>> <><><><><><><><><><><><><><>
>> Michael Mansour
>> Data Scientist
>> IBM Blockchain Labs
>> (650) 773-7974
>> Twitter: @sourmansweet
>>
>> On May 27, 2016, at 6:54 PM, Kowalski, Radoslaw <
>> radoslaw.kowalski.14 at ucl.ac.uk> wrote:
>>
>> Hi Michael,
>>
>> Use lda2vec library for python programming language. It does what you
>> want to be done. My personal recommendation with regard to lda2vec is that
>> you implement it on a linux system.
>>
>> All the best,
>> Radoslaw
>>
>>
>>
>> *Radoslaw Kowalski*
>> PhD Student
>> ______________________________
>> *Consumer Data Research Centre*
>> UCL Department of Political Science
>> ______________________________
>> T:  020 3108 1098 x51098
>> E:  radoslaw.kowalski.14 at ucl.ac.uk <n.vij at ucl.ac.uk>
>> W:  <http://www.cdrc.ac.uk/>www.cdrc.ac.uk
>> Twitter:@CDRC_UK
>> <http://www.cdrc.ac.uk/>
>> ------------------------------
>>
>>
>> *From:* topic-models-bounces at lists.cs.princeton.edu <
>> topic-models-bounces at lists.cs.princeton.edu> on behalf of Michael
>> Klachko <michaelklachko at gmail.com>
>> *Sent:* 28 May 2016 00:48:44
>> *To:* topic-models at lists.cs.princeton.edu
>> *Subject:* [Topic-models] sparse word vectors and LDA
>>
>> Hello,
>>
>> I'm new to topic modeling, and I'm currently exploring different ways to
>> construct word vectors.
>>
>> One way is to use a topic modeling algorithm: run LDA on a large corpus
>> of text, and identify k topics. Then, build k-dimensional vectors for every
>> word, so that every position in a vector corresponds to a topic. If word X
>> belongs to topic Z then the vector for X will have "1" at position Z. At
>> the end, we will have sparse vectors of length k.
>>
>> I have a few questions:
>>
>> 1. Does this make sense?
>> 2. Has it been tried?
>> 3. Is LDA the best algorithm for this?
>> 4. How to modify LDA so that instead of "1"s in the vector I would have
>> real numbers representing probabilities of the word belonging to topics in
>> this document? (again, I'm not sure if this makes sense in the context of
>> LDA...). One reason for this is to avoid having identical vectors for
>> similar words, such as "cat" and "dog".
>> 5. How such sparse vectors would compare to vectors generated with
>> word2vec?
>> 6. Is it possible to somehow make sure that related topics would
>> correspond to positions in the vector that are nearby?
>>
>> Thanks!
>>
>> _______________________________________________
>> Topic-models mailing list
>> Topic-models at lists.cs.princeton.edu
>> https://lists.cs.princeton.edu/mailman/listinfo/topic-models
>>
>>
>> <><><><><><><><><><><><><><>
>> Michael Mansour
>> Data Scientist & Graduate Student @ Galvanize
>> IBM Blockchain Labs
>> (650) 773-7974
>> Twitter: @sourmansweet
>>
>>
>> _______________________________________________
>> Topic-models mailing list
>> Topic-models at lists.cs.princeton.edu
>> https://lists.cs.princeton.edu/mailman/listinfo/topic-models
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cs.princeton.edu/pipermail/topic-models/attachments/20160608/d6538b69/attachment-0001.html>


More information about the Topic-models mailing list