
Sean Gerrish will present his research seminar/general exam on Wednesday January 20, at 11AM in room 402. The members of his committee are: David Blei (advisor), Ed Felten, and Christiane Fellbaum. Everyone is invited to attend his talk, and those faculty wishing to remain for the oral exam are welcome to do so. His abstract and reading list follow below. -------------------------------------- Modeling Influence in Text Corpora Identifying the most influential documents in a corpus is an important problem in many fields, ranging from information science and historiography to text summarization and news aggregation. A traditional method of assessing the impact of articles is to analyze the citations to it. The impact factor of a journal, for example, is based largely on academic citation analysis; and Googles successful PageRank algorithm is based on hyperlink citations between Webpages. Often, however, citation information is not present: certain legal documents, news stories, blog posts, sermons, and email, for example, all might lack such citation metadata, while there is a clear notion of influence among articles in these collections. In this talk, I will describe an unsupervised method for determining the influence of a document. Our intuition is that language changes over time, and that influential documents contribute to this change. I will formalize this intuition with a probabilistic graphical model and present an algorithm which takes a sequence of documents as input and computes a vector of "influence" for each document characterizing the document's influence. I will also discuss evaluation of this method and describe when this method provides a citation-free measure of bibliometric impact. Books / chapters 1. Selected chapters from Machine Learning and Pattern Recognition, Bishop Chapters: 1; 2; 3.1-3.2; 4.3-4.5; 7.1; 8; 9; 10; 11.1-11.4; 12.1-12.3; 14.1, 14.3 2. An Introduction to Probabilistic Graphical Models (unpublished manuscript), by Michael I. Jordan, 2002 Chapters: 2, 3, 4, 5, 8, 9,11 Papers: 1. D. Blei, A. Ng, and M. Jordan. "Latent Dirichlet Allocation", JMLR, 3:993-1022, 2003. 2. D. Blei and J. Lafferty. "Dynamic Topic Models", in ICML, 2006. 3. M. Jordan, Z. Ghahramani, T. Jakkola, and L. Saul. "An introduction to variational methods for graphical models.", Machine Learning, 37(2):183-233, 1999. 4. R. Nallapati and W. Cohen. "Link-plsa-lda: A new unsupervised model for topics and influence of blogs", in Proceedings of the International Conference on Weblogs and Social Media (ICWSM), 2008. 5. Gelman, A., Meng, X.L., & Stern, H. "Posterior predictive assessment of model fitness via realized discrepancies", Statistica Sinica, 6:733-807, 1996. 6. G. E. P. Box. "Sampling and Bayes' Inference in Scientific Modelling and Robustness", Journal of the Royal Statistical Society, 143(4):383-430. 1980. 7. L. Deitz, S. Bickel, and T. Scheffer. "Unsupervised prediction of citation influences", in Proceedings of the 24th International Conference on Machine Learning. Corvallis, Oregon, USA, June 2007. 8. D. Cohn and T. Hofmann. "The missing link - a probabilistic model of document content and hyptertext connectivity", in T. Leen et al, eds, Advances in Neural Information Processing Systems 13. 9. T. Griffiths and M. Steyvers. "Finding Scientific Topics", Proc Natl Acad Sci USA, 101 Suppl 1:5228-5235, April 2004. 10. A. Budanitsky and G. Hirst. "Evaluating WordNet-based Measures of Lexical Semantic Relatedness.", Computational Linguistics , 32 (1): 13-47, 2006. 11. Clinton, J. and Jackman, S. and Rivers, D. "The Statistical Analysis of Roll Call Data,", American Political Science Review. 98(2) 2004.