[talks] Q Xi preFPO

Thu Mar 31 09:40:10 EDT 2011

Qian Xi will present her preFPO on Tuesday April 5 at 1PM in Room 301 (note room).  The members of her committee are 
David Walker, advisor; Andrew Appel and Yitzhak Mandelbaum (AT&T), readers; Brian Kernighan and Andrea LaPaugh, nonreaders. 
Everyone is invited to attend her talk. Her abstract follows below.

Title: Generation of Data Processing Tools for Context-free and Context Dependent Grammars

Abstract: 

An ad hoc data format is any nonstandard, semi-structured data format for which robust data processing tools are not easily available. Ad hoc data exists in many domains, but processing it is challenging. The goal of this research is to find a generic, reliable and efficient method to extract the structural information from ad hoc data files, which is the key step to complete before generating auxiliary data processing tools, such as parsers, pretty printers, query engines, XML converters etc.

In this dissertation, we first focus on the context-free grammars in ad hoc data sources. We design a markup language called Anne that programmers can use to annotate raw text with a set of grammatical directives. The Anne system then uses a combination of user annotations and the raw data itself to extract a context-free grammar from which the system further generates a PADS description, a Yakker description, and an XML representation of the data. To thoroughly understand Anne, we formalize a semantic theory to capture the core features of the language. This semantic theory includes the editing process, which translates a raw, unannotated document into an annotated one, and the grammar extraction process, which generates a context-free grammar from the annotated document. When reasoning about the relationship between the raw data and the generated grammar, we also find an interesting connection to the field of classic relevance logic. This relevance analysis allows us to prove important theorems concerning the expressiveness of the language. We implement the system and conduct two major experiments measuring the amount of effort needed to annotate documents and the quality of the generated grammars.

Then we extend the coverage of Anne to context dependent grammars by adding new features in the syntax and semantic rules of the formalization, including variable bindings, data-dependent constraint, and transformations. We demonstrate the relevance analysis is a robust theory by proving all the principle theorems still hold after this extension.

Our experience of using dependent parser generator shows that arbitrary dependent grammars can be costly to parse, and thus we are motivated to build more efficient parsers. We apply an analysis of dependent grammars to find particular subparts that do not rely on general dependent parsing and can therefore use conventional optimization techniques. We give formal definitions of the analysis and use it to guide implementation. The results of the implementation demonstrate that the efficiency of parsing is improved by the analysis.