[parsec-users] Parsec 2.0, M5 Simulations, Linux Idle loop.

Christian Bienia cbienia at CS.Princeton.EDU
Fri Sep 18 11:03:19 EDT 2009

Hi Rick,

Here are a few thoughts about the idle time:

1.) In some cases the idle time is clearly affected by the startup cost. In
particular, programs with a pipeline (like dedup and ferret) need to fill
the whole pipeline before all threads have something to do. You should fast
forward enough before you start the detailed simulation.

2.) Wait time at barriers will have some impact. Of all the programs
streamcluster is affected the most because it makes very liberal use of
barriers. For PARSEC 2.1 I wrote a pthread barrier drop-in replacement that
implements a barrier with a mutex and a condition variable which spins a
while before it blocks. This way the number of context switches is minimized
and you should see less idle time.

3.) Some PARSEC users told me that the Linux scheduler has issues with
higher numbers of cores (other operating systems are probably also affected
by this to varying degrees). It causes too much thread migration, which
limits scalability because the caches need to be warmed up after every
context switch. You need to pin the individual threads to the CPUs they're
supposed to run on. The thread affinity code of the hooks library is not
enough because it only assigns a subset of CPUs to the program, which are
then mapped to threads by the scheduler. So it's the same situation again,
you really need to map individual cores past a certain number.

4.) As Major already pointed out, you can use blackscholes as a test. The
program is so simple it should always scale linearly. If it doesn't then
chances are good that there's an issue with a methodology.

5.) Simsmall is too small for 8 cores. With 8 cores you should at least use
simmedium. Generally, always use the biggest input set that you can afford
to run.


-----Original Message-----
From: parsec-users-bounces at lists.cs.princeton.edu
[mailto:parsec-users-bounces at lists.cs.princeton.edu] On Behalf Of Rick
Sent: Wednesday, September 16, 2009 5:46 PM
To: Major Bhadauria
Cc: PARSEC Users
Subject: Re: [parsec-users] Parsec 2.0, M5 Simulations, Linux Idle loop.

Major Bhadauria wrote:
> Blackscholes does not have much interaction between threads, so it 
> seems unlikely there's thread contention or threads stuck at barriers, 
> I'm not familiar enough with your simulator to know if all the threads 
> are running on all available cores or if you're mapping all the 
> threads to run on just two cores and the other cores are sleeping.
The simulator is booting the linux operating system with the linux 
scheduler handling thread movement. Thus, if you have a methodology when 
running the benchmarks on a vanilla linux kernel, then I can duplicate 
that. The only problem here is that the amount of time I can simulate 
the benchmarks is limited, so I would rather have the threads 
communicate to me when they are all scheduled to a different core and 
then take a checkpoint. Is there a way to get a new ROI after the 
threads have all been scheduled to each available core?

In addition, I have compiled parsec 2.0 with option -c gcc-hooks and 
have not set any of the OS scheduling affinity variables under the 
assumption that the benchmarks will attempt to use all of the cores. 
Should I be setting affinity? Also, I do see performance improvements 
for 2, 4,  8,  16, and 32 cores which indicates that the cores are being 
> Do you have the same behavior with larger input sets? simsmall is 
> really tiny, you should try the largest size that can finish in a 
> reasonable amount of time, perhaps sim-medium (for the larger 32 core 
> simulation)?
I have tried a 1e9 total instruction execution simulation with 
sim-large, which takes around 10 hours to finish  and is around 25-275ms 
of execution in simulation time. I did not see much difference in 
performance for this small execution. One possible explanation might be 
that I am seeing startup effects of the ROI?
> Its unclear what the IPC graph shows, is it IPCs for all the procs 
> combined? I;m assuming all the instructions are useful insns since  at 
> sync points the kernel puts the  cores to sleep?
The IPC is the summed IPC for all cores in the system.  It is possible 
that not all instructions are useful instructions and would be dependent 
on the parallelization model used whether the parallel threads would 
release the processor while waiting for more work to do.  If the 
parallel thread went to sleep and the idle thread had no other threads 
to currently schedule, then the linux kernel executes a quiesce 
instruction which puts the core to sleep.
> Rick Strong wrote:
>> I have attached the pictures this time. Hopefully, they make it to 
>> the mailing list.
>> -Rick
>> Rick Strong wrote:
>>> Dear all,
>>> I am current Ph.D. student at UCSD studying computer architecture 
>>> for multicore systems and its interaction with the OS. My goal for 
>>> last half of year has been to run Parsec-2.0 on the M5 simulator for 
>>> the alpha ISA for a many many core  architectures.
>>> I have most of the benchmarks compiled and ready to go but I find 
>>> that IPC is smaller than what I would expect. The figure attached 
>>> shows IPC for 2 cores, 4 cores, 8 cores, 16 cores and 32 cores for a 
>>> hypothetical 22nm process technology running @ 3.5GHz in an 
>>> Out-of-Order processor modeling the Alpha EV6.  The IPC seems fine 
>>> for 2 cores, but as more cores are added an alarming amount of time 
>>> is spent in the idle loop of the linux kernel which puts the 
>>> processor to sleep through a quiesce instruction ... you may find 
>>> the amount of time spent sleeping in profile_quiesce.png that was 
>>> also attached (This stat is gathered in gprof like manner).  The 
>>> input set that was being used was simsmall and I started simulation 
>>> measurement at the beginning of the Region of Interest.
>>> There are many things that can be going wrong but the problem seems 
>>> to be related to a lack of work available to be scheduled on the 
>>> idle cores. Some possible causes include:
>>> (1) The linux scheduler has not load balanced the parallel 
>>> application leaving some cores unscheduled.
>>> (2) The threads are stalling on a barrier and the core has nothing 
>>> left to do.
>>> (3) Poor startup performance. I see this occur when I simulate the 
>>> benchmarks for simsmall on a x86 nehalem architecture where the 8 
>>> virtual cpu's never get up to 100% utilization.
>>> This introduction brings the following questions for the parsec 
>>> team, as I am hoping your experience and expert knowledge can direct 
>>> my instrumentation more effectively.
>>> (1) Have you noticed that linux scheduler load balancing takes 
>>> longer than the proportion of time of execution in simsmall?
>>> (2) Is there an easy way to determine that the parsec benchmark is 
>>> indeed scheduled and running on all available cores?
>>> (3) Does simsmall contain enough work to saturate core utilization 
>>> or is it just too small? If so, which sim size is optimal?
>>> (4) Are there known reasons why the parsec benchmark suite would not 
>>> play nice with the Alpha architecture running a linux kernel for 
>>> those benchmarks compiled using pthreads (I am purposely leaving out 
>>> OpenMP)?
>>> (5) Is there a way to easily test the barrier stall hypothesis?
>>> Thanks in advance,
>>> -Richard Strong
>>> ------------------------------------------------------------------------

>>> ------------------------------------------------------------------------

>>> _______________________________________________
>>> parsec-users mailing list
>>> parsec-users at lists.cs.princeton.edu
>>> https://lists.cs.princeton.edu/mailman/listinfo/parsec-users
>> ------------------------------------------------------------------------
>> ------------------------------------------------------------------------
>> ------------------------------------------------------------------------
>> _______________________________________________
>> parsec-users mailing list
>> parsec-users at lists.cs.princeton.edu
>> https://lists.cs.princeton.edu/mailman/listinfo/parsec-users

parsec-users mailing list
parsec-users at lists.cs.princeton.edu

More information about the parsec-users mailing list