[colloquium] TODAY: Automatic Generation of Data-Processing Tools
TITLE: Automatic Generation of Data-Processing Tools SPEAKERS: Yitzhak Mandelbaum and David Walker Department of Computer Science, Princeton University TIME: Monday, November 14, 2005 Seminar begins at 12:30 p.m. (lunch provided ~12:20) LOCATION: Room 302 ABSTRACT: An ad hoc data format is any non-standard data format for which parsing, querying, analysis, or transformation tools are not readily available. Despite the increasing use of standard data formats such as XML, ad hoc data sources continue to arise in numerous industries such as finance, health care, transportation, and telecommunications as well as in scientific domains, such as computational biology and chemistry. The absence of tools for processing ad hoc data formats complicates the daily data-management tasks of data analysts, who may have to cope with numerous ad hoc formats even within a single application. Common characteristics of ad hoc data complicate the building of tools to perform even basic data processing tasks. For example, documentation of ad hoc formats, is often incomplete or inaccurate, making it difficult to define a database schema or to build a reliable parser. In addition, the data itself often contains numerous kinds of errors, which can thwart standard database loaders. In this talk we will describe PADS, a system for automatic generation of data processing tools. PADS allows programmers to write simple, high-level descriptions of their data format. Descriptions include information on both the physical layout of the data within a file as well as semantic constraints such as the range of allowed values and correlations between different parts of the data. Once the data has been properly described, the PADS compiler can generate a suite of programming libraries and stand-alone tools. In particular, the PADS compiler generates a parser library capable of detecting and recovering from data errors and a printing library for the format. On top of these basic libraries PADS provides generic tools that can translate ad hoc data into XML, format the data in a canonical form, query the data using the semi-structured query language XQuery as if it were XML (but not actually incur the overhead of translation to XML), and generate a statistical summary of data characteristics such as the range of values in different data fields and the number of errors in each field. ** PICASso: ** Program in Integrative Information, Computer and Application Sciences ** www.cs.princeton.edu/picasso ** Monday, November 14, 2005 SIGN UP FOR THE PICASSO MAILING LIST: If you would like to be kept informed of computationally-oriented events in (and around) Princeton, please SUBSCRIBE to the PICASso mailing list by visiting https://lists.cs.princeton.edu/mailman/listinfo/picasso. This page also contains information on how to UNSUBSCRIBE. PLEASE FORWARD THIS MESSAGE TO OTHER COMPUTATIONALLY-ORIENTED RESEARCHERS WHO MAY BE INTERESTED IN THESE EVENTS, OR FUTURE PROGRAMS. THANKS!
participants (1)
-
Steven Kleinstein