[Topic-models] Questions on workflow for producing best topic models

Brian Feeny brianfeeny at fas.harvard.edu
Sat Nov 23 18:26:18 EST 2013

Hi, I am currently working with a group of three others.  We are looking at some reviews data and wanting to produce topic models using LDA based on what is being talked about.

My biggest concern right now is feeding LDA the text that matters.  I have noticed that if I feed it all words, a lot of them end up being just adjectives which by themselves just relay some sort of sentiment and not information about the content.  I have looked at many models out there, and a lot of them seem to focus on just nouns.  Is that how most people do this, do they just extract nouns?  And if so, how do they do that?  I am currently running my text through NLTK's pos_tag() but that does take a while.

Here is my current workflow, and I am looking for any recommendations on how I may improve:

Ingest text of all reviews as a list of documents
for each document:
	lower case the document
	tokenize the document (current approach is naive, and just tokenizes all words, not by sentence or anything, just using split() )
	pos_tag each word in the document (once again looking at entire document, not per sentence)
	keep only the nouns NN, NNP, NNS
	remove punctuation
 	spell correct (currently using Enchant / pyenchant)
	remove stopwords
	lemmatize (using NLTK WordNetLemmatizer() )

The output is now just the nouns for each document.

I build bigrams of these and add them back into the document, so I have the original document of unigrams and also bigrams.

I then feed this into LDA.  Currently using gensim in batch mode.

Some of my concerns with my workflow:
	Should I be using a tokenizer that does per sentence and then per word tokenization?
	I am guessing I should spell correct BEFORE pos tagging
	What parts of speech are topic models generally built on? How do you get these do you use pos_tag, or just a set of nouns as a whitelist, etc?
	pos tagging is a bit slow, which is one reason I am looking at better ways of extracting what matters, perhaps a naive regex?
	Any recommendations on stemmers/lemmatizers that may work better than WordNetLemmatizer() for topic modeling?

I appreciate any help and direction in how I may be able to improve my workflow


More information about the Topic-models mailing list