[parsec-users] Parsec 2.0, M5 Simulations, Linux Idle loop.

Rick Strong rstrong at cs.ucsd.edu
Wed Sep 16 14:58:36 EDT 2009

Dear all,

I am current Ph.D. student at UCSD studying computer architecture for 
multicore systems and its interaction with the OS. My goal for last half 
of year has been to run Parsec-2.0 on the M5 simulator for the alpha ISA 
for a many many core  architectures.

I have most of the benchmarks compiled and ready to go but I find that 
IPC is smaller than what I would expect. The figure attached shows IPC 
for 2 cores, 4 cores, 8 cores, 16 cores and 32 cores for a hypothetical 
22nm process technology running @ 3.5GHz in an Out-of-Order processor 
modeling the Alpha EV6.  The IPC seems fine for 2 cores, but as more 
cores are added an alarming amount of time is spent in the idle loop of 
the linux kernel which puts the processor to sleep through a quiesce 
instruction ... you may find the amount of time spent sleeping in 
profile_quiesce.png that was also attached (This stat is gathered in 
gprof like manner).  The input set that was being used was simsmall and 
I started simulation measurement at the beginning of the Region of 

There are many things that can be going wrong but the problem seems to 
be related to a lack of work available to be scheduled on the idle 
cores. Some possible causes include:
(1) The linux scheduler has not load balanced the parallel application 
leaving some cores unscheduled.
(2) The threads are stalling on a barrier and the core has nothing left 
to do.
(3) Poor startup performance. I see this occur when I simulate the 
benchmarks for simsmall on a x86 nehalem architecture where the 8 
virtual cpu's never get up to 100% utilization.

This introduction brings the following questions for the parsec team, as 
I am hoping your experience and expert knowledge can direct my 
instrumentation more effectively.

(1) Have you noticed that linux scheduler load balancing takes longer 
than the proportion of time of execution in simsmall?

(2) Is there an easy way to determine that the parsec benchmark is 
indeed scheduled and running on all available cores?

(3) Does simsmall contain enough work to saturate core utilization or is 
it just too small? If so, which sim size is optimal?

(4) Are there known reasons why the parsec benchmark suite would not 
play nice with the Alpha architecture running a linux kernel for those 
benchmarks compiled using pthreads (I am purposely leaving out OpenMP)?

(5) Is there a way to easily test the barrier stall hypothesis?

Thanks in advance,
-Richard Strong



More information about the parsec-users mailing list