[talks] S Gerrish general exam

Fri Jan 15 07:57:05 EST 2010

Sean Gerrish will present his research seminar/general exam on Wednesday January 20, at
11AM in 
room 402.  The members of his committee are:  David Blei (advisor), Ed Felten, and
Christiane Fellbaum. 
Everyone is invited to attend his talk, and those faculty wishing to remain for the oral
exam are welcome 
to do so.  His abstract and reading list follow below.
--------------------------------------

Modeling Influence in Text Corpora

   Identifying the most influential documents in a corpus is an
important problem in many fields, ranging from information science and
historiography to text summarization and news aggregation. A
traditional method of assessing the impact of articles is to analyze
the citations to it. The impact factor of a journal, for example, is
based largely on academic citation analysis; and Google’s successful
PageRank algorithm is based on hyperlink citations between Webpages.

   Often, however, citation information is not present: certain legal
documents, news stories, blog posts, sermons, and email, for example,
all might lack such citation metadata, while there is a clear notion
of influence among articles in these collections.

   In this talk, I will describe an unsupervised method for
determining the influence of a document. Our intuition is that
language changes over time, and that influential documents contribute
to this change. I will formalize this intuition with a probabilistic
graphical model and present an algorithm which takes a sequence of
documents as input and computes a vector of "influence" for each
document characterizing the document's influence. I will also discuss
evaluation of this method and describe when this method provides a
citation-free measure of bibliometric impact.

Books / chapters
1. Selected chapters from Machine Learning and Pattern Recognition, Bishop
Chapters: 1; 2; 3.1-3.2; 4.3-4.5; 7.1; 8; 9; 10; 11.1-11.4; 12.1-12.3;
14.1, 14.3

2. An Introduction to Probabilistic Graphical Models (unpublished
manuscript), by Michael I. Jordan, 2002
Chapters: 2, 3, 4, 5, 8, 9,11

Papers:
1. D. Blei, A. Ng, and M. Jordan.  "Latent Dirichlet Allocation",
JMLR, 3:993-1022, 2003.
2. D. Blei and J. Lafferty. "Dynamic Topic Models", in ICML, 2006.
3. M. Jordan, Z. Ghahramani, T. Jakkola, and L. Saul.  "An
introduction to variational methods for graphical models.", Machine
Learning, 37(2):183-233, 1999.
4. R. Nallapati and W. Cohen. "Link-plsa-lda: A new unsupervised model
for topics and influence of blogs", in Proceedings of the
International Conference on Weblogs and Social Media (ICWSM), 2008.
5. Gelman, A., Meng, X.L., & Stern, H. "Posterior predictive
assessment of model fitness via realized discrepancies", Statistica
Sinica, 6:733-807, 1996.
6. G. E. P. Box. "Sampling and Bayes' Inference in Scientific
Modelling and Robustness", Journal of the Royal Statistical Society,
143(4):383-430. 1980.
7. L. Deitz, S. Bickel, and T. Scheffer. "Unsupervised prediction of
citation influences", in Proceedings of the 24th International
Conference on Machine Learning. Corvallis, Oregon, USA, June 2007.
8. D. Cohn and T. Hofmann. "The missing link - a probabilistic model
of document content and hyptertext connectivity", in T. Leen et al,
eds, Advances in Neural Information Processing Systems 13.
9. T. Griffiths and M. Steyvers. "Finding Scientific Topics", Proc
Natl Acad Sci USA, 101 Suppl 1:5228-5235, April 2004.
10. A. Budanitsky and G. Hirst. "Evaluating WordNet-based Measures of
Lexical Semantic Relatedness.",
Computational Linguistics , 32 (1): 13-47, 2006.
11. Clinton, J. and Jackman, S. and Rivers, D. "The Statistical Analysis of
Roll Call Data,", American Political Science Review. 98(2) 2004.