
Guilherme Ottoni will present his preFPO on Monday October 1 at 11AM in Room 301. The members of his committee are: David August, advisor; Sharad Malik (EE) and Vivek Sarkar (Rice), readers; Margaret Martonosi and David Walker, nonreaders. Everyone is invited to attend his talk. His abstract follows below. -------------------- Global Instruction Scheduling for Multi-Threaded Architectures Recently, the microprocessor industry has moved toward chip multiprocessor (CMP) designs as a means of utilizing the increasing transistor counts in the face of physical and micro-architectural limitations. Despite this move, CMPs do not directly improve the performance of single-threaded codes, a characteristic of most applications. In effect, the move to CMPs has shifted even more the task of improving performance from the hardware to the software. Since developing parallel applications has long been recognized as significantly harder than developing sequential ones, it is very desirable to have automatic tools to extract parallelism from sequential applications. Unfortunately, automatic parallelization has only been successful in the restricted domain of scientific and data-parallel applications, which usually have regular array-based memory accesses and little control flow. In order to support parallelization of general-purpose applications, computer architects have proposed CMPs with light-weight, fine-grained (scalar) communication mechanisms. Despite such support, most existing compiler multi-threading techniques have generally demonstrated little effectiveness in extracting parallelism from non-scientific applications. The main reason for this is that such techniques are mostly restricted to extracting parallelism within local, straight-line regions of code. As observed in several limit studies, local regions of code have limited parallelism, and control dependence analysis and multiple control units are necessary to extract most of the parallelism opportunities. In this talk, I will describe a general compiler framework to perform Global Multi-Threaded (GMT) instruction scheduling, based on a Program Dependence Graph (PDG) representation, graph partitioning algorithms, and a novel Multi-Threaded Code Generation (MTCG) algorithm. Based on this framework, two techniques to perform GMT instructions scheduling are proposed. The first one, Decoupled Software Pipelining (DSWP), extracts pipeline parallelism from arbitrary loop nests. The second technique, GREMIO, is applicable to arbitrary (intra-procedural) code regions, and can also exploit other forms of parallelism. Finally, I will present several communication optimizations to improve the MTCG algorithm, and thus all GMT instruction scheduling techniques based on it. All these techniques are implemented in the VELOCITY compiler and evaluated on an accurate CMP simulator built on top of validated Itanium 2 core models. The results demonstrate the effectiveness of the proposed compiler techniques, with speedups on a number of real applications written in C.
participants (1)
-
Melissa M Lawson