
Qian Xi will present her research seminar/general exam on Tuesday May 13 at 2PM in Room 402. The members of her committee are: David Walker (advisor), David Blei, and Andrea LaPaugh. Everyone is invited to attend her talk and those faculty wishing to remain for the oral exam following are welcome to do so. Her abstract and reading list follow below. --------------------------- Ad hoc data sources are semi-structured data sources that are widely used in many research fields, but lack useful data analysis and transformation tools. The goal of the PADS system is to help programmers manipulate ad hoc data sources by providing automatic support for the generation of data processing tools. The tool generation infrastructure has multiple parts in it: (1) a simple tokenization phase (2) a rough data format inference phase (3) a data format refinement phase (4) a tool generation phase One serious limitation of the current implementation involves the tokenization phase. In particular, the current system is unable to handle ambiguous tokens. In this research, we tackle the token ambiguity problem with a more general approach. Two probabilistic models, a character-by-character Hidden Markov Model and a hierarchical model, are established by extracting statistical information from a collection of training data sources, labeled using a tool generated from hand-written descriptions. We show how to incorporate the learned Markov models into the rest of the algorithm to enhance format inference. We will explain the algorithms and implementations behind our new system and we will evaluate its success using a number of measures. Books: - Benjamin C. Pierce: Types and Programming Languages, Chapter 1-11, 13-24. - Stuart J. Russell, Peter Norvig: Artificial Intelligence, A Modern Approach (2nd edition), Chapter 13, 15 (15.1-15.3, 15.6) - Christopher M. Bishop: Pattern Recognition and Machine Learning, Chapter 8, Chapter 13 (13.1-13.2) Papers: - Kathleen Fisher, David Walker, Kenny Q. Zhu, and Peter White. From Dirt to Shovels: Fully Automatic Tool Generation from Ad Hoc Data. Proceedings of the 35th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, 2008. - Nicholas Kushmerick, Daniel S. Weld, Robert Doorenbos. Wrapper Induction for Information Extraction. International Joint Conference on Artificial Intelligence, 1997. - David Pinto, Andrew McCallum, Xing Wei and W. Bruce Croft. Table Extraction Using Conditional Random Fields. Proceedings of the 2003 annual national conference on Digital government research. - Geert Jan Bex, Frank Neven, Thomas Schwentick, and Karl Tuyls. Inference of concise DTDs from XML data. Proceedings of the 32nd international conference on Very Large Data Bases, 2006. - Vinayak Borkar, Kaustubh Deshmukhy, and Sunita Sarawagiz. Automatic Segmentation of Text into Structured Records. Proceedings of the 2001 ACM SIGMOD international conference on Management of data, 2001. - Arvind Arasu and Hector Garcia-Molina. Extracting Structured Data From Web Pages. Proceedings of the ACM SIGMOD international conference on Management of data, pages 337¨C348, 2003. - E. M. Gold. Language Identification in the Limit. Information and Control, 1967. - Stephen Soderland. Learning Information Extraction Rules for Semistructured and Free Text. Machine Learning, 1999.
participants (1)
-
Melissa M Lawson