[Topic-models] stemming

David Mimno mimno at cs.umass.edu
Wed Aug 27 11:08:40 EDT 2008

On Tue, Aug 26, 2008 at 11:08:08PM -0400, Neal Parikh wrote:
> I've noticed that in the papers I've read on topic modeling, people  
> tend not to stem words before running the model. Is there any reason  
> for this?

Just to elaborate on what Laura said, here's some examples of topics 
trained on Pliny's Natural History in Latin, which is a highly inflected 

1. vino aqua aceto oleo medetur
2. serpentes venena serpentium ictus venenum
3. similis nomen vocant vocatur magnitudine
4. italia italiae hispania insula asia
5. arborum arboribus arbor arbores folia
6. lapide lapis lapides lapidem marmore
7. divo augusto regi tantae peditum
8. pinxit picturae pictura imagines tabula
9. foliis radice folia vocant nascitur
10. coloris colore colorem candida nigra
11. varro auctor m tradit mucianus
12. m varrone avctoribvs cornelio nigro
13. piscium mari pisces aquatilium appellantur

(out of 200 topics, numbers just for reference)

Those who know Latin will recognize words in topics 5 and 6 as the 
commonly used inflections of the nouns "tree" and "stone". The model has 
collected them together based on context, where a stemmer might only look 
at individual strings in isolation, possibly introducing ambiguities. The 
"tree" topic is specifically about fruit, by the way -- there's also a 
topic about wood that contains most of the same inflected forms of 
"arbor". This split implies to me that not stemming has also not 
interfered with the model's semantic distinctions.

Another interesting phenomenon is in topics 11 and 12 -- Pliny cites M. 
Varro frequently, but either uses nominative, subject forms ("Varro 
reports", topic 11) or ablative, passive forms ("as reported by Varro", 
topic 12) but doesn't use those forms both together in the same paragraph 
enough that the model would want to put those words into the same topic. 
Depending on one's goals, this may or may not be a useful property...


