Chong Wang will present his research seminar/general exam on Wednesday Jan 21
at 2PM in Room 402. The members of his committee are: David Blei (advisor), Fei-Fei Li,
and Rob Schapire. Everyone is invited to attend his talk, and those faculty wishing to
remain for the oral exam following are welcome to do so. His abstract and reading
list follow below.
--------------------------------------------
Learning topic models from multiple corpora
Abstract
In this talk, we consider the problem of learning topic models from multiple corpora.
Examples of multi-corpora data include news articles from different times or different
locations, and scientific papers from different conferences or different years. The
majority of topic models, however, usually ignore this addition information. Simply
combining multiple corpora into one large corpus or treating each corpus individually
can't provide the ability to analyze the high-level relations among multiple corpora,
e.g., are topics similar or different from time to time or from location to location? How
are they related to each other?
I will describe two new approaches for learning topic models from multiple corpora:
continuous time dynamic topic model (cDTM) and Markov topic model(MTM). In cDTMs,
documents from the same time point are considered as a corpus. The cDTM extends dynamic
topic models
(DTMs) by using Brownian motion to model the latent topics through the time line. We
derive an efficient variational approximate inference algorithm that takes advantage of
the sparsity of observations in text, a property that lets us easily handle many time
points. Thus, cDTM is able to discover topic evolutions in a much finer time resolution.
In MTMs, papers from the same conference are treated as a corpus. Then we apply Gaussian
(Markov) random fields to model the correlations of different corpora. MTMs capture both
the internal topic structure within each corpus and the relationships between topics
across the corpora. In addition, we will show cDTMs and DTMs can be formulated as special
cases of MTMs. Quantitative results and qualitative discoveries (interesting topic
patterns) will also be presented.
Books:
1) Pattern Recognition and Machine Learning, by Christopher M. Bishop, Springer, 2006
chapters: 1, 2, 3.1-3.5, 4, 5, 8, 9, 10, 11.1-11.3, 12.1-12.2, 13, 14.
2) An Introduction to Probabilistic Graphical Models (unpublished manuscript), by Michael
I. Jordan, 2002
chapters: 2, 3, 4, 5, 8, 9, 10, 11, 12, 13, 14, 15
4) (Optional) Artificial Intelligence: A Modern Approach, by Stuart Russell and Peter
Norvig, Prentice Hall Series in Artificial Intelligence, 2003
chapters: 3.1-3.5, 4.1-4.3, 13, 14, 15.1-15.5, 18.1-18.3, 20.1-20.5, 23.2-23.3
Papers:
1) D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. JMLR, 3:993-1022, 2003.
2) D. Blei and J. Lafferty. Dynamic topic models. In ICML, 2006.
3) T. Griffiths and M. Steyvers. Finding scientific topics. Proc Natl Acad Sci USA, 101
Suppl 1:5228-5235, April 2004.
4) C. Andrieu, N. de Freitas, A. Doucet, and M. Jordan. An introduction to MCMC for
machine learning. Machine Learning, 50(1):5-43, January 2003.
5) M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul. An introduction to variational
methods for graphical models. Machine Learning, 37(2):183-233, 1999.
6) Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical Dirichlet processes. Journal of
the American Statistical Association, 2006.
7) L. Fei-Fei and P. Perona. A Bayesian hierarchical model for learning natural scene
categories. In CVPR, 2005.
8) Havard Rue and Turid Follestad, Gaussian Markov Random field models with applications
in spatial statistics, preprint, 2003
9) D. Blei and J. Lafferty. Correlated topic models. In NIPS, 2005.
10) R. Kalman. A new approach to linear filtering and prediction problems. Transaction of
the AMSE: Journal of Basic Engineering, 82:35-45, 1960.
11) L. Ruschendorf. Convergence of the iterative proportional fitting procedure. The
Annals of Statistics, 23(4):1160-1174, 1995.
12) X. Wang and A. McCallum. Topics over time: a non-Markov continuous-time model of
topical trends. In KDD, 2006.