[talks] G Ottoni preFPO

Mon Sep 24 14:21:30 EDT 2007

 Guilherme Ottoni will present his preFPO on Monday October 1 
at 11AM in Room 301.  The members of his committee are:  David 
August, advisor; Sharad Malik (EE) and Vivek Sarkar (Rice), readers; 
Margaret Martonosi and David Walker, nonreaders.  Everyone is 
invited to attend his talk.  His abstract follows below.
--------------------

Global Instruction Scheduling for Multi-Threaded Architectures

Recently, the microprocessor industry has moved toward chip multiprocessor (CMP) designs
as a means of utilizing the increasing transistor counts in the face of physical and
micro-architectural limitations. Despite this move, CMPs do not directly improve the
performance of single-threaded codes, a characteristic of most applications. In effect,
the move to CMPs has shifted even more the task of improving performance from the hardware
to the software.

Since developing parallel applications has long been recognized as significantly harder
than developing sequential ones, it is very desirable to have automatic tools to extract
parallelism from sequential applications. Unfortunately, automatic parallelization has
only been successful in the restricted domain of scientific and data-parallel
applications, which usually have regular array-based memory accesses and little control
flow. In order to support parallelization of general-purpose applications, computer
architects have proposed CMPs with light-weight, fine-grained (scalar) communication
mechanisms.  Despite such support, most existing compiler multi-threading techniques have
generally demonstrated little effectiveness in extracting parallelism from non-scientific
applications. The main reason for this is that such techniques are mostly restricted to
extracting parallelism within local, straight-line regions of code. As observed in several
limit studies, local regions of code have limited parallelism, and control dependence
analysis and multiple control units are necessary to extract most of the parallelism
opportunities.

In this talk, I will describe a general compiler framework to perform Global
Multi-Threaded (GMT) instruction scheduling, based on a Program Dependence Graph (PDG)
representation, graph partitioning algorithms, and a novel Multi-Threaded Code Generation
(MTCG) algorithm. Based on this framework, two techniques to perform GMT instructions
scheduling are proposed. The first one, Decoupled Software Pipelining (DSWP), extracts
pipeline parallelism from arbitrary loop nests. The second technique, GREMIO, is
applicable to arbitrary (intra-procedural) code regions, and can also exploit other forms
of parallelism.  Finally, I will present several communication optimizations to improve
the MTCG algorithm, and thus all GMT instruction scheduling techniques based on it. All
these techniques are implemented in the VELOCITY compiler and evaluated on an accurate CMP
simulator built on top of validated Itanium 2 core models. The results demonstrate the
effectiveness of the proposed compiler techniques, with speedups on a number of real
applications written in C.