[parsec-users] multithread vs single thread

Yungang Bao ybao at CS.Princeton.EDU
Tue Apr 19 16:57:30 EDT 2011


I have run blackscholes on Simics and also got the similar results as yours. (See the following log) 
As Jim mentioned, the reason why the cycles decreases should be due to the reduced number of cache transactions as well as the increasing number of shared cache lines among multiple CPUs.  

But I have no idea why the number of cache transactions declines. It is probably not due to the blackscholes itself but the libm. When I used Oprofile on an X86_64+Linux machine to collect performance counters on real machine, I found that over 60% cycles are spent in libm becuase of the mathematic calculations such as sqrt, log and exp. 


========================================
1p
========================================
<con0>[HOOKS] Entering ROI
Leaving ROI
CPU 0 : 4322123714 cycles
TOTAL CYCLE COUNT: 4322123714

Cache Info: cache
-----------
Number of cache lines : 131072
Cache line size       : 64 bytes
Total cache size      : 8192 kbytes
Associativity         : 4
Index                 : physical
Tag                   : physical
Write allocate        : yes
Write policy          : write-back
Replacement policy    : lru

Connected to CPUs     : cpu0 
Next level cache      : <the trans-staller 'staller'>

Read penalty          : 0 cycle
Read-next penalty     : 0 cycle
Write penalty         : 0 cycle
Write-next penalty    : 0 cycle


Cache statistics: cache
-----------------

                Total number of transactions:      18271484

                     Device data reads (DMA):           925
                    Device data writes (DMA):          1880

                      Uncacheable data reads:           308
                     Uncacheable data writes:           278
             Uncacheable instruction fetches:             0

                      Copy back transactions:        188885

                                   Load Hits:      10243066
                                 Load Misses:         76241
                               Load Accesses:      10319307
                              Load Miss Rate:          0.74%

                                  Store Hits:       7759393
                                Store Misses:        189393
                              Store Accesses:       7948786
                             Store Miss Rate:          2.38%

                                   Inst Hits:             0
                                 Inst Misses:             0
                               Inst Accesses:             0
                              Inst Miss Rate:          0.00%

                                  Total Hits:      18002459
                                Total Misses:        265634
                              Total Accesses:      18268093
                             Total Miss Rate:          1.45%

               Lines read shared by   0 CPUs:         52500
               Lines read shared by   1 CPUs:         78572

              Lines write shared by   0 CPUs:         36106
              Lines write shared by   1 CPUs:         94966

            Lines totally shared by   0 CPUs:         21580
            Lines totally shared by   1 CPUs:        109492

         Loads from lines shared by   0 CPUs:         87443
         Loads from lines shared by   1 CPUs:      10231864

          Stores to lines shared by   0 CPUs:        197858
          Stores to lines shared by   1 CPUs:       7750928

    True loads from lines shared by   0 CPUs:             0
    True loads from lines shared by   1 CPUs:             0

     True stores to lines shared by   0 CPUs:             0
     True stores to lines shared by   1 CPUs:             0

                    Loads not issued by CPUs:             0
                   Stores not issued by CPUs:             0

Aborting, simulation of ROI complete
[cpu0] v:0x00000000ff3806e8 p:0x000ddb986e8  magic (sethi 0x40000, %g0)


========================================
2p
========================================
<con0>[HOOKS] Entering ROI
Leaving ROI
CPU 0 : 1832838877 cycles
CPU 1 : 1832838532 cycles
TOTAL CYCLE COUNT: 3665677409

Cache Info: cache
-----------
Number of cache lines : 131072
Cache line size       : 64 bytes
Total cache size      : 8192 kbytes
Associativity         : 4
Index                 : physical
Tag                   : physical
Write allocate        : yes
Write policy          : write-back
Replacement policy    : lru

Connected to CPUs     : cpu0 cpu1 
Next level cache      : <the trans-staller 'staller'>

Read penalty          : 0 cycle
Read-next penalty     : 0 cycle
Write penalty         : 0 cycle
Write-next penalty    : 0 cycle


Cache statistics: cache
-----------------

                Total number of transactions:      14201827

                     Device data reads (DMA):           911
                    Device data writes (DMA):          1002

                      Uncacheable data reads:           298
                     Uncacheable data writes:           303
             Uncacheable instruction fetches:             0

                      Copy back transactions:        125705

                                   Load Hits:       9026930
                                 Load Misses:         50217
                               Load Accesses:       9077147
                              Load Miss Rate:          0.55%

                                  Store Hits:       5007494
                                Store Misses:        114672
                              Store Accesses:       5122166
                             Store Miss Rate:          2.24%

                                   Inst Hits:             0
                                 Inst Misses:             0
                               Inst Accesses:             0
                              Inst Miss Rate:          0.00%

                                  Total Hits:      14034424
                                Total Misses:        164889
                              Total Accesses:      14199313
                             Total Miss Rate:          1.16%

               Lines read shared by   0 CPUs:         60682
               Lines read shared by   1 CPUs:         48447
               Lines read shared by   2 CPUs:         21943

              Lines write shared by   0 CPUs:         47337
              Lines write shared by   1 CPUs:         67560
              Lines write shared by   2 CPUs:         16175

            Lines totally shared by   0 CPUs:         28367
            Lines totally shared by   1 CPUs:         68890
            Lines totally shared by   2 CPUs:         33815

         Loads from lines shared by   0 CPUs:         62300
         Loads from lines shared by   1 CPUs:       1283534
         Loads from lines shared by   2 CPUs:       7731313

          Stores to lines shared by   0 CPUs:        122437
          Stores to lines shared by   1 CPUs:       2234123
          Stores to lines shared by   2 CPUs:       2765606

    True loads from lines shared by   0 CPUs:          5189
    True loads from lines shared by   1 CPUs:         18650
    True loads from lines shared by   2 CPUs:       2504735

     True stores to lines shared by   0 CPUs:          3542
     True stores to lines shared by   1 CPUs:          7823
     True stores to lines shared by   2 CPUs:         70696

                    Loads not issued by CPUs:             0
                   Stores not issued by CPUs:             0

Aborting, simulation of ROI complete
[cpu1] v:0x00000000ff3806e8 p:0x001f6f986e8  magic (sethi 0x40000, %g0)

========================================
4p
========================================
<con0>[HOOKS] Entering ROI
Leaving ROI
CPU 0 : 807572000 cycles
CPU 1 : 807572000 cycles
CPU 2 : 807571578 cycles
CPU 3 : 807571521 cycles
TOTAL CYCLE COUNT: 3230287099

Cache Info: cache
-----------
Number of cache lines : 131072
Cache line size       : 64 bytes
Total cache size      : 8192 kbytes
Associativity         : 4
Index                 : physical
Tag                   : physical
Write allocate        : yes
Write policy          : write-back
Replacement policy    : lru

Connected to CPUs     : cpu0 cpu1 cpu2 cpu3 
Next level cache      : <the trans-staller 'staller'>

Read penalty          : 0 cycle
Read-next penalty     : 0 cycle
Write penalty         : 0 cycle
Write-next penalty    : 0 cycle


Cache statistics: cache
-----------------

                Total number of transactions:       6341940

                     Device data reads (DMA):           184
                    Device data writes (DMA):           448

                      Uncacheable data reads:           162
                     Uncacheable data writes:           152
             Uncacheable instruction fetches:             0

                      Copy back transactions:         58868

                                   Load Hits:       4288398
                                 Load Misses:         25724
                               Load Accesses:       4314122
                              Load Miss Rate:          0.60%

                                  Store Hits:       1976251
                                Store Misses:         50621
                              Store Accesses:       2026872
                             Store Miss Rate:          2.50%

                                   Inst Hits:             0
                                 Inst Misses:             0
                               Inst Accesses:             0
                              Inst Miss Rate:          0.00%

                                  Total Hits:       6264649
                                Total Misses:         76345
                              Total Accesses:       6340994
                             Total Miss Rate:          1.20%

               Lines read shared by   0 CPUs:         83741
               Lines read shared by   1 CPUs:         31076
               Lines read shared by   2 CPUs:          6004
               Lines read shared by   3 CPUs:          2820
               Lines read shared by   4 CPUs:          7431

              Lines write shared by   0 CPUs:         75648
              Lines write shared by   1 CPUs:         42393
              Lines write shared by   2 CPUs:          5083
              Lines write shared by   3 CPUs:          2768
              Lines write shared by   4 CPUs:          5180

            Lines totally shared by   0 CPUs:         58952
            Lines totally shared by   1 CPUs:         50357
            Lines totally shared by   2 CPUs:          8988
            Lines totally shared by   3 CPUs:          4048
            Lines totally shared by   4 CPUs:          8727

         Loads from lines shared by   0 CPUs:         36913
         Loads from lines shared by   1 CPUs:        710306
         Loads from lines shared by   2 CPUs:        363456
         Loads from lines shared by   3 CPUs:        316635
         Loads from lines shared by   4 CPUs:       2886812

          Stores to lines shared by   0 CPUs:         57111
          Stores to lines shared by   1 CPUs:       1060647
          Stores to lines shared by   2 CPUs:        223041
          Stores to lines shared by   3 CPUs:        162582
          Stores to lines shared by   4 CPUs:        523491

    True loads from lines shared by   0 CPUs:          7031
    True loads from lines shared by   1 CPUs:         14542
    True loads from lines shared by   2 CPUs:         15708
    True loads from lines shared by   3 CPUs:         22776
    True loads from lines shared by   4 CPUs:        490066

     True stores to lines shared by   0 CPUs:          4285
     True stores to lines shared by   1 CPUs:          5019
     True stores to lines shared by   2 CPUs:          5464
     True stores to lines shared by   3 CPUs:          6523
     True stores to lines shared by   4 CPUs:         34113

                    Loads not issued by CPUs:             0
                   Stores not issued by CPUs:             0

Aborting, simulation of ROI complete
[cpu2] v:0x00000000ff3806e8 p:0x003dab986e8  magic (sethi 0x40000, %g0)


Yungang

----- Original Message -----
From: "Mahmood Naderan" <nt_mahmood at yahoo.com>
To: "PARSEC Users" <parsec-users at lists.cs.princeton.edu>
Sent: Tuesday, April 19, 2011 1:46:58 PM
Subject: Re: [parsec-users] multithread vs single thread

What I understand from your response is that the difference between the 
execution time (elapsed cycles) is dependent on cache hit raio and compiler 
optimization. Although they are not only the factors but I will discuss on them.

The system I ran both single thread and 4 threads are exactly the same except in 
the number of processors.
Also I ran the precompiled sparc binaries provided by the developers.

./blackscholes 1 in_64K.txt prices.txt
./blackscholes 4 in_64K.txt prices.txt

Sorry I forgot to attach the log. You can find the logs now.
I will be glad if someone test this example to see the results.

> (excepting for those controlling threading).
What are they?

// Naderan *Mahmood;



----- Original Message ----
From: Jim Dempsey <jim at quickthreadprogramming.com>
To: PARSEC Users <parsec-users at lists.cs.princeton.edu>
Sent: Tue, April 19, 2011 8:42:20 PM
Subject: Re: [parsec-users] multithread vs single thread

Elapse time, and potentially cycles, can be diminished by higher L1 and L2
cache hit ratios. If the single thread version has lower cache hit ratios
then the counts will be higher. Also check to see if your optimizations
switches are the same (excepting for those controlling threading).

Jim 

-----Original Message-----
From: parsec-users-bounces at lists.cs.princeton.edu
[mailto:parsec-users-bounces at lists.cs.princeton.edu] On Behalf Of Mahmood
Naderan
Sent: Tuesday, April 19, 2011 10:28 AM
To: PARSEC Users
Subject: Re: [parsec-users] multithread vs single thread

>Perhaps the 7.5B cycles includes the cycles outside the ROI (region of 
>interest).
 
No. Please find the attached log.

What I want to know is: How the program elapsed cycles changes when the
number of threads are increased?
I mean, Do we expect a reduction in cycles count? What I observed, is that
with a given input size (64k in blackscholes), running the benchmark with 4
threads has a less cycle counts (850M total cycles) compare to single thread
(7.5B cycles).

How one can exaplain that reduction?
Thanks,
// Naderan *Mahmood;



----- Original Message ----
From: Jim Dempsey <jim at quickthreadprogramming.com>
To: PARSEC Users <parsec-users at lists.cs.princeton.edu>
Sent: Tue, April 19, 2011 6:59:17 PM
Subject: Re: [parsec-users] multithread vs single thread

Perhaps the 7.5B cycles includes the cycles outside the ROI (region of
interest).

Try using RDTSC to be called from each thread to read the simulated clock
counter. Place this at the start and end of the do work function that all
threads pass through. Each thread can compute there own number of clock
ticks, then issue an interlocked add to produce a summation. This will give
you the simulated clock ticks but not the instruction counts. If you want
the instruction counts, then you will have to use the performance counters
via RDPMC (you will have to do some googling to find an example).

SIMICS may have C function call that you can insert into your code that
controls collection of counter data. If it does, then this may be the way
for you to go.

Jim Dempsey

-----Original Message-----
From: parsec-users-bounces at lists.cs.princeton.edu
[mailto:parsec-users-bounces at lists.cs.princeton.edu] On Behalf Of Mahmood
Naderan
Sent: Tuesday, April 19, 2011 12:49 AM
To: PARSEC
Subject: [parsec-users] multithread vs single thread

Hi,
When I run blackscholes with one thread the total cycles reported by simics
are about 7.5B cycles. However when I run it with 4 threads, each core runs
for 850M cycles (totally 3.3B cycles). Is that normal? Does this
multithreading represents parallel execution?

thanks, 

// Naderan *Mahmood;

_______________________________________________
parsec-users mailing list
parsec-users at lists.cs.princeton.edu
https://lists.cs.princeton.edu/mailman/listinfo/parsec-users

_______________________________________________
parsec-users mailing list
parsec-users at lists.cs.princeton.edu
https://lists.cs.princeton.edu/mailman/listinfo/parsec-users

_______________________________________________
parsec-users mailing list
parsec-users at lists.cs.princeton.edu
https://lists.cs.princeton.edu/mailman/listinfo/parsec-users

_______________________________________________
parsec-users mailing list
parsec-users at lists.cs.princeton.edu
https://lists.cs.princeton.edu/mailman/listinfo/parsec-users

_______________________________________________
parsec-users mailing list
parsec-users at lists.cs.princeton.edu
https://lists.cs.princeton.edu/mailman/listinfo/parsec-users


More information about the parsec-users mailing list