[talks] Q Xi general exam

Melissa M Lawson mml at CS.Princeton.EDU
Mon May 5 13:06:53 EDT 2008


Qian Xi will present her research seminar/general exam on Tuesday May 13 at 2PM in 
Room 402.  The members of her committee are: David Walker (advisor), David Blei, 
and Andrea LaPaugh.  Everyone is invited to attend her talk and those faculty wishing 
to remain for the oral exam following are welcome to do so.  Her abstract and reading 
list follow below.
---------------------------
Ad hoc data sources are semi-structured data sources that are widely used in many research
fields, but lack useful data analysis and transformation tools. The goal of the PADS
system is to help programmers manipulate ad hoc data sources by providing automatic
support for the generation of data processing tools. The tool generation infrastructure
has multiple parts in it:
(1) a simple tokenization phase
(2) a rough data format inference phase
(3) a data format refinement phase
(4) a tool generation phase
One serious limitation of the current implementation involves the tokenization phase. In
particular, the current system is unable to handle ambiguous tokens.

In this research, we tackle the token ambiguity problem with a more general approach. Two
probabilistic models, a character-by-character Hidden Markov Model and a hierarchical
model, are established by extracting statistical information from a collection of training
data sources, labeled using a tool generated from hand-written descriptions. We show how
to incorporate the learned Markov models into the rest of the algorithm to enhance format
inference. We will explain the algorithms and implementations behind our new system and we
will evaluate its success using a number of measures.

Books:

- Benjamin C. Pierce: Types and Programming Languages, Chapter 1-11, 13-24.

- Stuart J. Russell, Peter Norvig: Artificial Intelligence, A Modern Approach (2nd
edition), Chapter 13, 15 (15.1-15.3, 15.6)

- Christopher M. Bishop: Pattern Recognition and Machine Learning, Chapter 8, Chapter 13
(13.1-13.2)

Papers:

- Kathleen Fisher, David Walker, Kenny Q. Zhu, and Peter White. From Dirt to Shovels:
Fully Automatic Tool Generation from Ad Hoc Data. Proceedings of the 35th annual ACM
SIGPLAN-SIGACT symposium on Principles of programming languages, 2008.

- Nicholas Kushmerick, Daniel S. Weld, Robert Doorenbos. Wrapper Induction for Information
Extraction. International Joint Conference on Artificial Intelligence, 1997.

- David Pinto, Andrew McCallum, Xing Wei and W. Bruce Croft. Table Extraction Using
Conditional Random Fields. Proceedings of the 2003 annual national conference on Digital
government research.

- Geert Jan Bex, Frank Neven, Thomas Schwentick, and Karl Tuyls. Inference of concise DTDs
from XML data. Proceedings of the 32nd international conference on Very Large Data Bases,
2006.

- Vinayak Borkar, Kaustubh Deshmukhy, and Sunita Sarawagiz. Automatic Segmentation of Text
into Structured Records. Proceedings of the 2001 ACM SIGMOD
international conference on Management of data, 2001.

- Arvind Arasu and Hector Garcia-Molina. Extracting Structured Data From Web Pages.
Proceedings of the ACM SIGMOD international conference on Management of data, pages
337¨C348, 2003.

- E. M. Gold. Language Identification in the Limit. Information and Control, 1967.

- Stephen Soderland. Learning Information Extraction Rules for Semistructured and Free
Text. Machine Learning, 1999.



More information about the talks mailing list