[talks] T Lee general exam
mml at CS.Princeton.EDU
Tue May 4 16:23:10 EDT 2010
Tim Lee will present his research seminar/general exam on Monday May 10 at 2PM in Room
The members of his committee are: Ed Felten (advisor), David Blei, and Mike Freedman.
is invited to attend his talk, and those faculty wishing to remain for the oral exam
welcome to do so. His abstract and reading list follow below.
Recent years have seen increasing interest in the problem of web-
enabled government transparency. Last year, I helped create RECAP, a
FireFox plugin that helps users to build a free, open repository of
federal court records. One of the key challenges in building the RECAP
archive is privacy. Parties to court cases are supposed to redact
sensitive information such as Social Security numbers and bank account
numbers, but they often fail to do so, creating a potential privacy
problem if these documents are made available for free on the web.
With more than 2 million documents in our repository, there are too
many documents for manual human inspection. And releasing the
documents without filtering them first could compromise the privacy of
Americans who are the subjects of those records.
In my talk I will describe my use of machine learning techniques to
identify documents requiring redaction. Starting with 5,926 documents
in the RECAP archive that had already been redacted by human beings
and 17,021 randomly-selected non-redacted documents, I built several
classifiers: one using a logistic regression, and several others using
combinations of boosting and topic models. The latter classifiers
proved highly accurate, with the best having an area under the ROC
curve of 0.9735.
This technique has immediate application to the judicial redaction
problem. Because redacted documents are highly similar to documents
that should have been redacted, a classifier trained on the former
will also be good at finding the latter. These classifiers
dramatically reduce the amount of human labor required to find
documents with sensitive information in our RECAP archive, as well as
in the much larger PACER archive. Variants of this technique could
have wide-ranging applications, including protecting attorney-client
privilege during the discovery process and protecting national
security when releasing some kinds of executive branch documents.
 Stuart Russell and Peter Norvig, "Artificial Intelligence: A
Modern Approach." Chapters 3-6, 13-15, 18-21
 Christopher M. Bishop, "Pattern Recognition and Machine Learning",
Chapters 3, 4
 Mark Steyvers and Tom Griffiths, "Probabilistic Topic Models."
 David M. Blei and John D. Lafferty, "Topic Models."
 Robert E. Schapire and Yoram Singer. "BoosTexter: A Boosting-based
System for Text Categorization." Machine Learning, 2000.
 David Robinson, Harlan Yu, William Zeller and Edward W. Felten,
"Government Data and the Invisible Hand." Yale Journal of Law and
Technology, Fall 2008. http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1138083
 Peter A. Winn, "Judicial Information Management in an Electronic
Age: Old Standards, New Challenges." Federal Courts Law Review.
 Peter W. Martin, "Online Access to Court Records - from Documents
to Data, Particulars to Patterns." Villanova Law Review, vol. 53, no.
More information about the talks