[parsec-users] Freqmine Question Help

Joseph Greathouse jlgreath at umich.edu
Wed Aug 8 17:49:13 EDT 2012


Hi Raghav,

I did not include an output file in my runs because the it is not done 
when using the parsecmgmt script (see the *.runconf files in 
pkgs/apps/freqmine/parsec, or use the parsecmgmt script and see the line 
e.g. "[PARSEC] Running 'time {directories}/freqmine kosarak_250k.dat 220':")

You're right that the printing functions (printSet and printset, which 
are called from e.g. powerset, which is called by every thread in the 
OMP parallel region) do not wait on any threads to finish. In fact, they 
are not called in a thread-safe manner in this program. If you try to 
use an output file when OMP_NUM_THREADS>1, your output will be garbled 
as multiple threads will try to write into the fprintf buffer at the 
same time.

I'm not entirely sure why this would cause an increase in runtime 
(passing the file lock around? cache/TLB ping-pong due to the fprintf 
buffer being moved?), but suffice it to say, you shouldn't be comparing 
runtimes when printing an output file.

As for your other problem, however, where you are seeing poor scaling 
even without printing to an output file:

The runtimes still make very little sense. It's strange to see runtimes 
randomly go up and down like that. It's also strange that your 
single-threaded performance is so bad. On my RHEL 5.8 box with a 
generation-older core than yours, my single-threaded FPgrowth time is 
around 11.5 seconds. On a RHEL6.3 box with a Core 2 Q6600 (GCC 4.4.6-4), 
I see single-threaded FPgrowth times of about 11.3 seconds. I wonder why 
you see 23.5 seconds?

Simple question first: are you the only one using this box right now? Is 
anyone else logged in? Are they running anything that would take up CPU 
time?

Second question: Why are you using GCC 4.7.0? The version that comes 
with RHEL6.3 is 4.4.6, AFAIK. Is this a version you compiled yourself? 
What happens to your results when you use the default RHEL version?

-Joe

On 8/8/2012 4:59 PM, Raghav Mohan wrote:
> I am using RHEL 6.3, GCC 4.7.0 and openMP is enabled, as if I run top alongside the program, I can see the %CPU usage by the program, which is equivalent to the threads I provide. Here is the requested output.
>
> COMMAND: ./freqmine kosarak_990k.dat 790
>
> OUTPUT:
> NUMTHREADS: 1 the data preparation cost 0.595663 seconds, the FPgrowth cost 23.579778 seconds
> NUMTHREADS: 2 the data preparation cost 0.595737 seconds, the FPgrowth cost 29.178112 seconds
> NUMTHREADS: 3 the data preparation cost 0.595698 seconds, the FPgrowth cost 24.463154 seconds
> NUMTHREADS: 4 the data preparation cost 0.595958 seconds, the FPgrowth cost 20.679875 seconds
> NUMTHREADS: 5 the data preparation cost 0.631013 seconds, the FPgrowth cost 21.178104 seconds
> NUMTHREADS: 6 the data preparation cost 0.595853 seconds, the FPgrowth cost 19.078028 seconds
> NUMTHREADS: 7 the data preparation cost 0.598170 seconds, the FPgrowth cost 17.646492 seconds
> NUMTHREADS: 8 the data preparation cost 0.597291 seconds, the FPgrowth cost 18.438906 seconds
> NUMTHREADS: 9 the data preparation cost 0.596892 seconds, the FPgrowth cost 17.640142 seconds
> NUMTHREADS: 10 the data preparation cost 0.601513 seconds, the FPgrowth cost 16.806145 seconds
> NUMTHREADS: 11 the data preparation cost 0.597656 seconds, the FPgrowth cost 17.051052 seconds
> NUMTHREADS: 12 the data preparation cost 0.600122 seconds, the FPgrowth cost 15.583760 seconds
> NUMTHREADS: 13 the data preparation cost 0.601045 seconds, the FPgrowth cost 16.162628 seconds
> NUMTHREADS: 14 the data preparation cost 0.598893 seconds, the FPgrowth cost 15.565976 seconds
> NUMTHREADS: 15 the data preparation cost 0.599190 seconds, the FPgrowth cost 15.765923 seconds
> NUMTHREADS: 16 the data preparation cost 0.600952 seconds, the FPgrowth cost 15.196432 seconds
> NUMTHREADS: 17 the data preparation cost 0.601894 seconds, the FPgrowth cost 14.385916 seconds
> NUMTHREADS: 18 the data preparation cost 0.601292 seconds, the FPgrowth cost 15.297303 seconds
> NUMTHREADS: 19 the data preparation cost 0.609123 seconds, the FPgrowth cost 15.814151 seconds
> NUMTHREADS: 20 the data preparation cost 0.599771 seconds, the FPgrowth cost 16.419628 seconds
> NUMTHREADS: 21 the data preparation cost 0.601651 seconds, the FPgrowth cost 15.231015 seconds
> NUMTHREADS: 22 the data preparation cost 0.602804 seconds, the FPgrowth cost 14.558048 seconds
>
> So running this without the output file gives some speedup, however not to the magnitude that you attached in your email, or reported in the papers. (I would expect atleast a speedup of 4 in the best case).
> I noticed that you do not provide an output file in your run. This drastically changes my results, as running this with the output file, I getCOMMAND:./freqmine kosarak_250k.dat 220 /scratch/mohan/out.txt
>
>
> OUTPUT:
> NUMTHREADS: 1   the data preparation cost 0.161744 seconds, the FPgrowth cost 2.922024 seconds
> NUMTHREADS: 2   the data preparation cost 0.190038 seconds, the FPgrowth cost 4.751599 seconds
> NUMTHREADS: 3   the data preparation cost 0.161909 seconds, the FPgrowth cost 6.822731 seconds
> NUMTHREADS: 4   the data preparation cost 0.161746 seconds, the FPgrowth cost 7.654892 seconds
> NUMTHREADS: 5   the data preparation cost 0.163237 seconds, the FPgrowth cost 8.025010 seconds
> NUMTHREADS: 6   the data preparation cost 0.162682 seconds, the FPgrowth cost 8.104605 seconds
> NUMTHREADS: 7   the data preparation cost 0.163109 seconds, the FPgrowth cost 7.985950 seconds
> NUMTHREADS: 8   the data preparation cost 0.162191 seconds, the FPgrowth cost 8.088410 seconds
>
> NUMTHREADS: 9   the data preparation cost 0.162928 seconds, the FPgrowth cost 8.148432 seconds
> NUMTHREADS: 10  the data preparation cost 0.192140 seconds, the FPgrowth cost 8.509589 seconds
> NUMTHREADS: 11  the data preparation cost 0.167107 seconds, the FPgrowth cost 8.685088 seconds
> NUMTHREADS: 12  the data preparation cost 0.162842 seconds, the FPgrowth cost 9.417641 seconds
>
>
>
> Looking at the code, I see that the only difference is the fact that in the last routine FP_growth , fout is NULL vs. not, however, this routine is not threaded, and does not wait for any threads to complete execution. Hence I am curious as to why this computation time increases with the number of threads. Again, I apologize if I am missing something obvious here, that I could not spot.
>
>
> Thank you for your prompt responses and help,
> Raghav
>
>
>
>
> Also, I get different results depending on whether or not I supply an output file for the 3rd argument. Specifically,
>
> On 08/08/12, Joseph Greathouse wrote:
>> On 8/8/2012 3:04 PM, Raghav Mohan wrote:
>>> Hi,
>>>
>>> I am trying to parallelize the Freqmine(parsec v 2.1) benchmark with my own parallel library instead of Open MP. I ran the freqmine benchmark and compared the results from the sequential to open MP version. I would expect the Open MP time to be drastically less, however, it keeps increasing by the magnitude of threads. (Essentially reverse speedup). I am running freqmine on a Hyper threaded Intel Xeon E5620 CPU. This machine has 8 cores that are hyperthreaded, giving 16 threads. Here are the sample results:
>>>
>>>
>>> Command:
>>> ./freqmine kosarak_250k.dat 220 out.txt
>>>
>>>
>>>
>>> Sequential Version Result :
>>> the data preparation cost 0.163102 seconds, the FPgrowth cost 2.720993 seconds
>>>
>>>
>>> OMP Version Result (16 threads):
>>> the data preparation cost 0.191582 seconds, the FPgrowth cost 9.168250 seconds
>>>
>>>
>>>
>>>
>>> As one can see, the FPgrowth cost for the threaded is about 4 times more than the sequential. This is the behavior is replicated for all inputs.
>>>
>>>
>>> I apologize if I am missing something or interpreting the results wrongly, and this is the expected behavior, however, I read the manual, and I could not find any information on this.
>>> Any help provided is more than greatly appreciated.
>>>
>>>
>>> Thank you.
>>
>> Hi Raghav,
>>
>> I agree with Yungang, those numbers appear strange. I've attached outputs from a few freqmine runs on a Xeon E5520 (which is a Nehalem-based core, rather than a Westmere-based core like yours, but otherwise also has 8 physical cores and 16 virtual cores). This is running on RHEL 5.8, compiled with GCC 4.1.2 (Red Hat patch 52).
>>
>> As you can see, adding more threads gives a steady decrease in runtime.
>>
>> What OS and compiler are you using? What environment variables are set?
>>
>> Also, you showed the FPgrowth output for the serial version and your 16-thread version. Could you show us the outputs of the 2-, 4-, and 8-threaded versions as well?
>>
>> -Joe
>>
>> -----------------------------------------------
>>
>> bash-3.2$ cd ../inst/amd64-linux.gcc-serial/bin/
>> bash-3.2$ time ./freqmine ../../../inputs/webdocs_250k.dat 11000
>> ...
>> the data preparation cost 4.136187 seconds, the FPgrowth cost 935.675228 seconds
>>
>> real 15m39.923s
>> user 15m38.940s
>> sys 0m0.729s
>>
>> bash-3.2$ cd ../../amd64-linux.gcc-openmp/bin/
>> bash-3.2$ OMP_NUM_THREADS=4
>> bash-3.2$ export OMP_NUM_THREADS
>> bash-3.2$ time ./freqmine ../../../inputs/webdocs_250k.dat 11000
>> ...
>> the data preparation cost 4.151570 seconds, the FPgrowth cost 215.969161 seconds
>>
>> real 3m40.163s
>> user 14m26.022s
>> sys 0m0.891s
>>
>> bash-3.2$ OMP_NUM_THREADS=8
>> bash-3.2$ export OMP_NUM_THREADS
>> bash-3.2$ time ./freqmine ../../../inputs/webdocs_250k.dat 11000
>> ...
>> the data preparation cost 4.094214 seconds, the FPgrowth cost 116.869059 seconds
>>
>> real 2m0.972s
>> user 15m21.003s
>> sys 0m1.030s
>>
>> bash-3.2$ OMP_NUM_THREADS=16
>> bash-3.2$ export OMP_NUM_THREADS
>> bash-3.2$ time ./freqmine ../../../inputs/webdocs_250k.dat 11000
>> ...
>> the data preparation cost 4.145387 seconds, the FPgrowth cost 92.685972 seconds
>>
>> real 1m36.841s
>> user 21m38.168s
>> sys 0m1.801s


More information about the parsec-users mailing list