# [Topic-models] LDA-based document retrieval

Stephen Thomas sthomas at cs.queensu.ca
Fri Aug 26 21:03:07 EDT 2011

Topic Modelers,

I want to perform document retrieval using LDA. I need to compute the
similarity of a query and each document in the corpus. According to
Wei and Croft 2006 (Equation 5), as well as Steyvers and Griffiths
2007 (Equation 21.9), I should compute this conditional probability
as:

P(Q | D) = \prod_{q \in Q} P(q | D)

Isn't this a bit harsh? Since this is a product, we are saying that if
any one of the query's words gets a score of 0 (i.e., P(q | D) = 0 for
some q), then the entire similarity score is 0. If the query had 10
words, 9 of which were highly relevant and 1 of which was irrelevant,
we would assign the query a score of 0? Am I understanding this
correctly?

Why isn't this a summation instead of a product? (We can normalize by
query length.)

Steve

@inproceedings{wei2006lda,
title={LDA-based document models for ad-hoc retrieval},
author={Wei, X. and Croft, W.B.},
booktitle={Proceedings of the 29th annual international ACM SIGIR
conference on Research and development in information retrieval},
pages={178--185},
year={2006},
organization={ACM}
}

@article{steyvers2007probabilistic,
title={Probabilistic topic models},
author={Steyvers, M. and Griffiths, T.},
journal={Handbook of latent semantic analysis},
volume={427},
number={7},
pages={424--440},
year={2007},
publisher={Lawrence Erlbaum Associates}
}