[talks] T Lee general exam

Melissa Lawson mml at CS.Princeton.EDU
Tue May 4 16:23:10 EDT 2010

Tim Lee will present his research seminar/general exam on Monday May 10 at 2PM in Room
The members of his committee are:  Ed Felten (advisor), David Blei, and Mike Freedman.
is invited to attend his talk, and those faculty wishing to remain for the oral exam
following are 
welcome to do so.  His abstract and reading list follow below.


Recent years have seen increasing interest in the problem of web- 
enabled government transparency. Last year, I helped create RECAP, a  
FireFox plugin that helps users to build a free, open repository of  
federal court records. One of the key challenges in building the RECAP  
archive is privacy. Parties to court cases are supposed to redact  
sensitive information such as Social Security numbers and bank account  
numbers, but they often fail to do so, creating a potential privacy  
problem if these documents are made available for free on the web.  
With more than 2 million documents in our repository, there are too  
many documents for manual human inspection. And releasing the  
documents without filtering them first could compromise the privacy of  
Americans who are the subjects of those records.

In my talk I will describe my use of machine learning techniques to  
identify documents requiring redaction. Starting with 5,926 documents  
in the RECAP archive that had already been redacted by human beings  
and 17,021 randomly-selected non-redacted documents, I built several  
classifiers: one using a logistic regression, and several others using  
combinations of boosting and topic models. The latter classifiers  
proved highly accurate, with the best having an area under the ROC  
curve of 0.9735.

This technique has immediate application to the judicial redaction  
problem. Because redacted documents are highly similar to documents  
that should have been redacted, a classifier trained on the former  
will also be good at finding the latter. These classifiers  
dramatically reduce the amount of human labor required to find  
documents with sensitive information in our RECAP archive, as well as  
in the much larger PACER archive. Variants of this technique could  
have wide-ranging applications, including protecting attorney-client  
privilege during the discovery process and protecting national  
security when releasing some kinds of executive branch documents.


Reading List:


[1] Stuart Russell and Peter Norvig, "Artificial Intelligence: A  
Modern Approach." Chapters 3-6, 13-15, 18-21

[2] Christopher M. Bishop, "Pattern Recognition and Machine Learning",  
Chapters 3, 4


[3] Mark Steyvers and Tom Griffiths, "Probabilistic Topic Models."

[4] David M. Blei and John D. Lafferty, "Topic Models."

[5] Robert E. Schapire and Yoram Singer. "BoosTexter: A Boosting-based  
System for Text Categorization." Machine Learning, 2000.

[6] David Robinson, Harlan Yu, William Zeller and Edward W. Felten,  
"Government Data and the Invisible Hand." Yale Journal of Law and  
Technology, Fall 2008. http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1138083

[7] Peter A. Winn, "Judicial Information Management in an Electronic  
Age: Old Standards, New Challenges." Federal Courts Law Review.

[8] Peter W. Martin, "Online Access to Court Records - from Documents  
to Data, Particulars to Patterns." Villanova Law Review, vol. 53, no.  
5 (2008)

More information about the talks mailing list