Tim Lee will present his research seminar/general exam on Monday May 10 at 2PM in Room 402. The members of his committee are: Ed Felten (advisor), David Blei, and Mike Freedman. Everyone is invited to attend his talk, and those faculty wishing to remain for the oral exam following are welcome to do so. His abstract and reading list follow below. ------------------------------- Abstract: Recent years have seen increasing interest in the problem of web- enabled government transparency. Last year, I helped create RECAP, a FireFox plugin that helps users to build a free, open repository of federal court records. One of the key challenges in building the RECAP archive is privacy. Parties to court cases are supposed to redact sensitive information such as Social Security numbers and bank account numbers, but they often fail to do so, creating a potential privacy problem if these documents are made available for free on the web. With more than 2 million documents in our repository, there are too many documents for manual human inspection. And releasing the documents without filtering them first could compromise the privacy of Americans who are the subjects of those records. In my talk I will describe my use of machine learning techniques to identify documents requiring redaction. Starting with 5,926 documents in the RECAP archive that had already been redacted by human beings and 17,021 randomly-selected non-redacted documents, I built several classifiers: one using a logistic regression, and several others using combinations of boosting and topic models. The latter classifiers proved highly accurate, with the best having an area under the ROC curve of 0.9735. This technique has immediate application to the judicial redaction problem. Because redacted documents are highly similar to documents that should have been redacted, a classifier trained on the former will also be good at finding the latter. These classifiers dramatically reduce the amount of human labor required to find documents with sensitive information in our RECAP archive, as well as in the much larger PACER archive. Variants of this technique could have wide-ranging applications, including protecting attorney-client privilege during the discovery process and protecting national security when releasing some kinds of executive branch documents. -- Reading List: Textbooks [1] Stuart Russell and Peter Norvig, "Artificial Intelligence: A Modern Approach." Chapters 3-6, 13-15, 18-21 [2] Christopher M. Bishop, "Pattern Recognition and Machine Learning", Chapters 3, 4 Papers [3] Mark Steyvers and Tom Griffiths, "Probabilistic Topic Models." http://psiexp.ss.uci.edu/research/papers/SteyversGriffithsLSABookFormatted.p... [4] David M. Blei and John D. Lafferty, "Topic Models." http://www.cs.princeton.edu/~blei/papers/BleiLafferty2009.pdf [5] Robert E. Schapire and Yoram Singer. "BoosTexter: A Boosting-based System for Text Categorization." Machine Learning, 2000. http://www.cis.upenn.edu/~mkearns/finread/boostexter.pdf [6] David Robinson, Harlan Yu, William Zeller and Edward W. Felten, "Government Data and the Invisible Hand." Yale Journal of Law and Technology, Fall 2008. http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1138083 [7] Peter A. Winn, "Judicial Information Management in an Electronic Age: Old Standards, New Challenges." Federal Courts Law Review. http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1438674 [8] Peter W. Martin, "Online Access to Court Records - from Documents to Data, Particulars to Patterns." Villanova Law Review, vol. 53, no. 5 (2008) http://scholarship.law.cornell.edu/cgi/viewcontent.cgi?article=1092&context=lsrp_papers