[talks] A Soviani preFPO

Tue May 11 14:19:48 EDT 2010

Adrian Soviani will present his preFPO on Monday May 17 at 4PM in Room 402.  The members 
of his committee are:  J.P. Singh, advisor; Kai Li and David August, readers; Brian
Kernighan and 
Ken Steiglitz, nonreaders.  Everyone is invited to attend his talk.  His abstract follows
below.
---------------------------------------------

Building scalable parallel applications remains a difficult task due
to implementation complexity, performance portability, and inability
to understand hidden costs and bottlenecks. Specific machine
architectures and software layers further influence application
efficiency and performance transparency, making optimization a time
consuming burden. The parallel programming model has a great impact
on addressing these issues and reaching a good tradeoff between
implementation effort and application efficiency.

This thesis presents a hybrid SPMD - Coarse Grain Dataflow programming
model (CGD) that describes data and task parallelism at high level.
CGD applications are specified as dependencies between computation
modules and data distributions while communication and synchronization
are added automatically and optimized for specific architectures.
We claim the CGD model can make application development and design
space exploration simpler compared to message passing, at the same
time providing similar or better performance.

Results on the 128 CPU SGI Altix 4700 show our optimized CGD FT is
27% faster than the original NPB 2.3 MPI implementation, the optimized
CGD stencil has a 41% advantage over handwritten MPI, while the CGD
Barnes-Hut 1M particle benchmark is 15% faster than the pthreads
Splash2 implementation. CGD takes advantage of its dataflow semantics
and explicit distribution rules to automatically schedule computations
and insert, aggregate and overlap communication, simplifying application
development and optimization. E.g., the CGD NPB FT implementation requires
85 lines of dataflow and 700 lines of sequential C++ code, while the
original MPI implementation has 1260 lines of code and exhibits poorer
performance portability.