[parsec-users] Parsec 2.0, M5 Simulations, Linux Idle loop.
cbienia at CS.Princeton.EDU
Wed Sep 23 22:12:18 EDT 2009
1.) Pipeline fillup
You could add a magic instruction to the code of the last pipeline stage,
right after it dequeues items from its input queue. This way you'll be
notified that the last stage has started to work. You should still fast
forward by some number instructions to guarantee warmup and also that the
other threads for this particular pipeline stage have started to work.
2.) Linux scheduler
Unfortunately I don't know which Linux scheduler tries to migrate threads
excessively. I suspect they are all affected, and probably also many
schedulers of other operating systems
From: parsec-users-bounces at lists.cs.princeton.edu
[mailto:parsec-users-bounces at lists.cs.princeton.edu] On Behalf Of Rick
Sent: Wednesday, September 23, 2009 6:18 PM
To: PARSEC Users
Subject: Re: [parsec-users] Parsec 2.0, M5 Simulations, Linux Idle loop.
Christian Bienia wrote:
> Hi Rick,
> Here are a few thoughts about the idle time:
> 1.) In some cases the idle time is clearly affected by the startup
> cost. In particular, programs with a pipeline (like dedup and ferret)
> need to fill the whole pipeline before all threads have something to
> do. You should fast forward enough before you start the detailed
I see. Is there a good way of measuring this pipeline fillup? For instance,
are there obvious points in the code that might suggest the time to start
> 2.) Wait time at barriers will have some impact. Of all the programs
> streamcluster is affected the most because it makes very liberal use
> of barriers. For PARSEC 2.1 I wrote a pthread barrier drop-in
> replacement that implements a barrier with a mutex and a condition
> variable which spins a while before it blocks. This way the number of
> context switches is minimized and you should see less idle time.
That is very cool. My results confirm that streamcluster is indeed spending
a lot of time in context switches compared to the other benchmarks (see
attached graph sys_kern_swap_contexts). I hope that I will see better
performance with parsec 2.1 for this benchmark with this modification.
> 3.) Some PARSEC users told me that the Linux scheduler has issues with
> higher numbers of cores (other operating systems are probably also
> affected by this to varying degrees). It causes too much thread
> migration, which limits scalability because the caches need to be
> warmed up after every context switch. You need to pin the individual
> threads to the CPUs they're supposed to run on. The thread affinity
> code of the hooks library is not enough because it only assigns a
> subset of CPUs to the program, which are then mapped to threads by the
> scheduler. So it's the same situation again, you really need to map
individual cores past a certain number.
This is helpful and I suspected that this might be the case. I guess the
easiest thing to do here is to add some instrumentation code that keeps
track of when the kernel migrates threads betweens cores to see if I am
seeing lots of migrations. I have one question about the linux scheduler
note that you made. When users had problems with the linux scheduler, were
they referring to the perfectly fair scheduler or the O(1) scheduler?
> 4.) As Major already pointed out, you can use blackscholes as a test.
> The program is so simple it should always scale linearly. If it
> doesn't then chances are good that there's an issue with a methodology.
This is also helpful as it will give me a sanity test for the results.
> 5.) Simsmall is too small for 8 cores. With 8 cores you should at
> least use simmedium. Generally, always use the biggest input set that
> you can afford to run.
This may be a vital oversight in my results so far. I have tried simlarge
with 10X the execution time (10e9 total instructions) and each core is
getting greater than .9 IPC which is comforting for up to 8 cores. I am
still waiting on the other results.
Thanks for the help,
> -----Original Message-----
> From: parsec-users-bounces at lists.cs.princeton.edu
> [mailto:parsec-users-bounces at lists.cs.princeton.edu] On Behalf Of Rick
> Sent: Wednesday, September 16, 2009 5:46 PM
> To: Major Bhadauria
> Cc: PARSEC Users
> Subject: Re: [parsec-users] Parsec 2.0, M5 Simulations, Linux Idle loop.
> Major Bhadauria wrote:
>> Blackscholes does not have much interaction between threads, so it
>> seems unlikely there's thread contention or threads stuck at
>> barriers, I'm not familiar enough with your simulator to know if all
>> the threads are running on all available cores or if you're mapping
>> all the threads to run on just two cores and the other cores are
> The simulator is booting the linux operating system with the linux
> scheduler handling thread movement. Thus, if you have a methodology
> when running the benchmarks on a vanilla linux kernel, then I can
> duplicate that. The only problem here is that the amount of time I can
> simulate the benchmarks is limited, so I would rather have the threads
> communicate to me when they are all scheduled to a different core and
> then take a checkpoint. Is there a way to get a new ROI after the
> threads have all been scheduled to each available core?
> In addition, I have compiled parsec 2.0 with option -c gcc-hooks and
> have not set any of the OS scheduling affinity variables under the
> assumption that the benchmarks will attempt to use all of the cores.
> Should I be setting affinity? Also, I do see performance improvements
> for 2, 4, 8, 16, and 32 cores which indicates that the cores are
> being used.
>> Do you have the same behavior with larger input sets? simsmall is
>> really tiny, you should try the largest size that can finish in a
>> reasonable amount of time, perhaps sim-medium (for the larger 32 core
> I have tried a 1e9 total instruction execution simulation with
> sim-large, which takes around 10 hours to finish and is around
> 25-275ms of execution in simulation time. I did not see much
> difference in performance for this small execution. One possible
> explanation might be that I am seeing startup effects of the ROI?
>> Its unclear what the IPC graph shows, is it IPCs for all the procs
>> combined? I;m assuming all the instructions are useful insns since
>> at sync points the kernel puts the cores to sleep?
> The IPC is the summed IPC for all cores in the system. It is possible
> that not all instructions are useful instructions and would be
> dependent on the parallelization model used whether the parallel
> threads would release the processor while waiting for more work to do.
> If the parallel thread went to sleep and the idle thread had no other
> threads to currently schedule, then the linux kernel executes a
> quiesce instruction which puts the core to sleep.
>> Rick Strong wrote:
>>> I have attached the pictures this time. Hopefully, they make it to
>>> the mailing list.
>>> Rick Strong wrote:
>>>> Dear all,
>>>> I am current Ph.D. student at UCSD studying computer architecture
>>>> for multicore systems and its interaction with the OS. My goal for
>>>> last half of year has been to run Parsec-2.0 on the M5 simulator
>>>> for the alpha ISA for a many many core architectures.
>>>> I have most of the benchmarks compiled and ready to go but I find
>>>> that IPC is smaller than what I would expect. The figure attached
>>>> shows IPC for 2 cores, 4 cores, 8 cores, 16 cores and 32 cores for
>>>> a hypothetical 22nm process technology running @ 3.5GHz in an
>>>> Out-of-Order processor modeling the Alpha EV6. The IPC seems fine
>>>> for 2 cores, but as more cores are added an alarming amount of time
>>>> is spent in the idle loop of the linux kernel which puts the
>>>> processor to sleep through a quiesce instruction ... you may find
>>>> the amount of time spent sleeping in profile_quiesce.png that was
>>>> also attached (This stat is gathered in gprof like manner). The
>>>> input set that was being used was simsmall and I started simulation
>>>> measurement at the beginning of the Region of Interest.
>>>> There are many things that can be going wrong but the problem seems
>>>> to be related to a lack of work available to be scheduled on the
>>>> idle cores. Some possible causes include:
>>>> (1) The linux scheduler has not load balanced the parallel
>>>> application leaving some cores unscheduled.
>>>> (2) The threads are stalling on a barrier and the core has nothing
>>>> left to do.
>>>> (3) Poor startup performance. I see this occur when I simulate the
>>>> benchmarks for simsmall on a x86 nehalem architecture where the 8
>>>> virtual cpu's never get up to 100% utilization.
>>>> This introduction brings the following questions for the parsec
>>>> team, as I am hoping your experience and expert knowledge can
>>>> direct my instrumentation more effectively.
>>>> (1) Have you noticed that linux scheduler load balancing takes
>>>> longer than the proportion of time of execution in simsmall?
>>>> (2) Is there an easy way to determine that the parsec benchmark is
>>>> indeed scheduled and running on all available cores?
>>>> (3) Does simsmall contain enough work to saturate core utilization
>>>> or is it just too small? If so, which sim size is optimal?
>>>> (4) Are there known reasons why the parsec benchmark suite would
>>>> not play nice with the Alpha architecture running a linux kernel
>>>> for those benchmarks compiled using pthreads (I am purposely
>>>> leaving out OpenMP)?
>>>> (5) Is there a way to easily test the barrier stall hypothesis?
>>>> Thanks in advance,
>>>> -Richard Strong
>>>> parsec-users mailing list
>>>> parsec-users at lists.cs.princeton.edu
>>> parsec-users mailing list
>>> parsec-users at lists.cs.princeton.edu
> parsec-users mailing list
> parsec-users at lists.cs.princeton.edu
> parsec-users mailing list
> parsec-users at lists.cs.princeton.edu
More information about the parsec-users