[parsec-users] Fluid animate

Jim Dempsey jim at quickthreadprogramming.com
Thu Jun 10 13:58:17 EDT 2010


Major,
 
I was surprised by the degradation for oversubscription as well. There may
be an internal issue with the thread pool scheduler when oversubscription is
reached. This will take some time with the debugger to isolate the problem.
 
The QuickThread system is task model where a pool of threads is established
and feed off a task queuing system. The system is designed to run from 1
thread to a full subscripton of threads. Oversubscription is not recommended
but should be tolerated. I think the issue may be related to a scheduling at
oversubscription as opposed to a cache issue.
 
QuickThread is an affinity pinned system. At 17 theads, the expected result
(for this application) would have been approximately 2x 16 threads time.
 
   ((16 threads time) x 16/17) x 2
 
or written another way
 
--- time
------------------------------------------------------------------------>
   ((16 threads time) x 16/17) + ((16 threads time) x 16/17)
   ((16 threads time) x 16/17) 
   ((16 threads time) x 16/17) 
   ((16 threads time) x 16/17) 
   ((16 threads time) x 16/17) 
   ((16 threads time) x 16/17) 
   ((16 threads time) x 16/17) 
   ((16 threads time) x 16/17) 
   ((16 threads time) x 16/17) 
   ((16 threads time) x 16/17) 
   ((16 threads time) x 16/17) 
   ((16 threads time) x 16/17) 
   ((16 threads time) x 16/17) 
   ((16 threads time) x 16/17) 
   ((16 threads time) x 16/17) 
   ((16 threads time) x 16/17)
 
18 threads should have been approximately the same time as 17 threads. 
 
--- time
------------------------------------------------------------------------>
   ((16 threads time) x 16/18) + ((16 threads time) x 16/18)
   ((16 threads time) x 16/18) + ((16 threads time) x 16/18) 
   ((16 threads time) x 16/18) 
   ((16 threads time) x 16/18) 
   ((16 threads time) x 16/18) 
   ((16 threads time) x 16/18) 
   ((16 threads time) x 16/18) 
   ((16 threads time) x 16/18) 
   ((16 threads time) x 16/18) 
   ((16 threads time) x 16/18) 
   ((16 threads time) x 16/18) 
   ((16 threads time) x 16/18) 
   ((16 threads time) x 16/18) 
   ((16 threads time) x 16/18) 
   ((16 threads time) x 16/18) 
   ((16 threads time) x 16/18)
 
And decreasing from 17 theads to 32 threads which should yield approximately
the same run time as 16 threads extended by any adverse cache interactions.
 
--- time
------------------------------------------------------------------------>
   ((16 threads time) x 16/32) + ((16 threads time) x 16/32)
   ((16 threads time) x 16/32) + ((16 threads time) x 16/32) 
   ((16 threads time) x 16/32) + ((16 threads time) x 16/32)  
   ((16 threads time) x 16/32) + ((16 threads time) x 16/32)  
   ((16 threads time) x 16/32) + ((16 threads time) x 16/32)  
   ((16 threads time) x 16/32) + ((16 threads time) x 16/32)  
   ((16 threads time) x 16/32) + ((16 threads time) x 16/32)  
   ((16 threads time) x 16/32) + ((16 threads time) x 16/32)  
   ((16 threads time) x 16/32) + ((16 threads time) x 16/32)  
   ((16 threads time) x 16/32) + ((16 threads time) x 16/32)  
   ((16 threads time) x 16/32) + ((16 threads time) x 16/32)  
   ((16 threads time) x 16/32) + ((16 threads time) x 16/32)  
   ((16 threads time) x 16/32) + ((16 threads time) x 16/32)  
   ((16 threads time) x 16/32) + ((16 threads time) x 16/32)  
   ((16 threads time) x 16/32) + ((16 threads time) x 16/32)  
   ((16 threads time) x 16/32) + ((16 threads time) x 16/32) 
 
The code in this test application attempts to evenly distribute the work. So
going from full subscription (16) to full subscription+1 (17) would
effectively produce a 2x the slightly faster run time (as extrapolated from
the under to full subscription curve).
 
The QuickThread system is currently in beta test so there could be an
oversight in the code with respect to oversubscription.
 
Note, for applications with blend of I/O and compute tasks, QuickThread has
two classes of thead pools: compute and I/O. The general rule of thumb is to
not oversubscribe the compute class pool. I/O class threads are not affinity
pinned and would not have an issue relating to what was observed here.
 
Jim
 
 

  _____  

From: parsec-users-bounces at lists.cs.princeton.edu
[mailto:parsec-users-bounces at lists.cs.princeton.edu] On Behalf Of Major
Bhadauria
Sent: Thursday, June 10, 2010 12:22 PM
To: PARSEC Users
Subject: Re: [parsec-users] Fluid animate


Thanks Jim, the note1-3 results look quite interesting.

The 18 thread degradation level is very surprising, it might be an anomaly.
It'd be great if later you could verify no other activity on the system.

Regards,

-Major


2010/6/10 Jim Dempsey <jim at quickthreadprogramming.com>


Chris and others:
 
I got some time on a Dell R610 with dual Intel Xeon 5570 processors.
The readers of this mailing list might find it of interest.
 
Results from running fluidanimate using in_500K.fluid with 100 iterations
Runtimes using QuickThread threading toolkit:
 
Threads
1  Total time spent in ROI:         92.494s  1.0000x
2  Total time spent in ROI:         48.265s  1.9164x
3  Total time spent in ROI:         35.771s  2.5857x
4  Total time spent in ROI:         28.770s  3.2149x
5  Total time spent in ROI:         23.912s  3.8681x
6  Total time spent in ROI:         21.912s  4.2212x
7  Total time spent in ROI:         20.918s  4.4217x
8  Total time spent in ROI:         18.428s  5.0192x
9  Total time spent in ROI:         18.897s  4.8946x * note 1
10 Total time spent in ROI:         18.396s  5.0279x
11 Total time spent in ROI:         18.002s  5.1380x
12 Total time spent in ROI:         17.991s  5.1411x
13 Total time spent in ROI:         17.946s  5.1540x
14 Total time spent in ROI:         16.071s  5.7553x
15 Total time spent in ROI:         16.057s  5.7604x
16 Total time spent in ROI:         14.398s  6.4241x
17 Total time spent in ROI:         41.042s  2.2536x ** note 2
18 Total time spent in ROI:        553.489s  0.1671x ** note 3
 
Each processor has 4 cores with HyperThreading
Total of 8 cores and 16 hardware threads
fluidanimate is a floating point and memory access intensive application.
 
Note 1:
On this configuration, QuickThread distributes work to cores first, then
back fills to HyperThread siblings second.
Result being fairly steady slope from 1 thread to 8 threads (full set of
cores) then shallower slope as the HT threads are filled in.
 
Note 2:
At 17 threads we have oversubscription of threads. Note the adverse effect
on cache.
 
Note 3:
At 18 threads, the adverse effect on cache appears to be exponential.
Additional run data would provide some insight as would profiling.
 
The above results were from one set of test runs on a remote system.
IOW I could not verify no other activity was present on the system.
 
Jim Dempsey

 

_______________________________________________
parsec-users mailing list
parsec-users at lists.cs.princeton.edu
https://lists.cs.princeton.edu/mailman/listinfo/parsec-users




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cs.princeton.edu/pipermail/parsec-users/attachments/20100610/c3587376/attachment-0001.htm>


More information about the parsec-users mailing list