[parsec-users] Porting Bodytrack on GP-GPUs -- ProblemsandIssues

Jim Dempsey jim at quickthreadprogramming.com
Sat Aug 13 09:05:07 EDT 2011


Take a look at:
http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/
cpp/win/intref_cls/common/intref_bk_trig_ops.htm
 
This lists the intrinsics for trignometric operations. This is for the Intel
C++ compiler, but MS and GCC have most of the same
 
     __m128d == double | double            two doubles at a time
    __m128 == float | float | float | float    four floats at a time
    __m256d == double | double | double | double    four doubles at a time
    __m256 == float | float | float | float | float | float | float | float
eight floats at a time
 
*** If you have Intel Core i7 2600(K), and an OS that supports AVX (Windows
7 SP1, Linux??) then you can use the AVX variants (provided your compiler
supports them) then you can use __m256d and __m256
 
 
extern __m128d _mm_sincos_pd(__m128d *p_cos, __m128d v1);    // 2 doubles
extern __m128 _mm_sincos_ps(__m128 *p_cos, __m128 v1);    // 4 floats
 
The iteration count nSSEvectors would vary depending on number of doubles or
floats held in the chosen small vector.
 
   __m128 has 4 floats
    __m128d has 2 doubles
    __m256d has 4 doubles (only with AVX support)
    __m256 has 8 floats (only with AVX support)
 
Most C++ compilers generate SSE now, and if the data is layed out as stream,
then the optimization will convert to 2-up or 4-up (or 4-up or 8-up if AVX
supported).
 
There are other archetectures Intel has comming down the line to support 512
bit and perhaps 1024 bit small vectors. These might not show up in the
mainstream of computing.
 
As for: my supervisor is more interested in finding a way out to correct the
sin/cos calculations on GPU
 
While this is laudable, it may not be practicle for you to do this. The
proper individules to do this are the people providing the tools and library
kernals. For you to do this is not only a duplication of effort but a
mis-use of your time.
 
1) I assume you are already a proficient x86 C++ programmer
2) Your GPGPU does not run standalone (i.e. you are using an x86 based O/S)
3) Your application is already hybrid (parts in x86, parts in GPGPU)
4) none of 1-3 are going to change while you write your thesis
5) Efficiently Integrating GPGPU app with SSE and parallel programming
techniques is an effective approach
 
Jim Dempsey
 
 

  _____  

From: parsec-users-bounces at lists.cs.princeton.edu
[mailto:parsec-users-bounces at lists.cs.princeton.edu] On Behalf Of aftab
hussain
Sent: Friday, August 12, 2011 7:40 AM
To: PARSEC Users
Subject: Re: [parsec-users] Porting Bodytrack on GP-GPUs --
ProblemsandIssues


Thanks Jim for such a detailed solution.  

Before trying out the SSE implementation, my supervisor is more interested
in finding a way out to correct the sin/cos calculations on GPU, if no luck
your solution will be the best option and I hope we get rid of the
performance bottle neck. 


Just one more question, Can I use __m128 to handle floats and if I can what
should be nSSEvectors. Would it be (no_particles * 3)/4?

2011/8/11 Jim Dempsey <jim at quickthreadprogramming.com>



Try this as a first step:
 
Use sincos(x, &sin_x, &cos_x);
 

for (i=0; i < no_particles; i++){
  sincos(mNewParticles[i].at(6), &cu_particles[cnt], &cu_particles[cnt+1]);
cnt += 2;

  sincos(mNewParticles[i].at(7), &cu_particles[cnt], &cu_particles[cnt+1]);
cnt += 2;
  sincos(mNewParticles[i].at(8), &cu_particles[cnt], &cu_particles[cnt+1]);
cnt += 2;
 }
 
The second step:
 
Make data layout friendly for SSE vecroization
 
Note that your transport of data into cu_... is organized as a stream of
data of the sin/cos pairs of particles and NOT a stream of the entire
particle data (mNewPrticles). I recommend that for the above routine (to
produce sin/cos pairs for input data) you change the cu to CPU data format
to a stream of doubles of only the at(6), at(7), at(8) of each particle. IOW
be as cognizant (fair play) about streamability of data on the CPU side as
well as in the GPGPU side.
 
Have the CPU buffer that receives the stream of doubles from the GPGPU be
aligned on 16 byte address
Rework the output from the CPU code into two streams: one for a stream of
sines and one for a stream of cosines.
These too should be 16 byte aligned.
 
Then use:


extern __m128d _mm_sincos_pd(__m128d *p_cos, __m128d v1);    // most CPUs
support this

extern __m256d _mm256_sincos_pd(__m256d *p_cos, __m256d v1);  // CPU with
AVX (SandyBridge)
 
void doSinCos(__m128d* p_X, __m128d* p_sin, __m128d* p_cos, int nSSEvectors)
{
    // nSSEvectors is (no_particles * 3) / 2    // non-AVX
    // You handle remaining odd data value (or padd to even as 0.0)
     for(int i = 0; i < nSSEvectors; ++i)
        p_sin[i] = _mm_sincos_pd(&p_cos[i], &p_X[i]);
}
 
The work to this point is to inprove your serial CPU side code as (close to)
best as possible.
I will let you rework the above in the event you have a CPU with AVX (it
handles 4-up doubles small vectors)
 
The next step would be to see if you can multi-thread the CPU side. For this
I would suggest using OpenMP over pthreads (easier to do)
 
void doSinCos(__m128d* p_X, __m128d* p_sin, __m128d* p_cos, int nSSEvectors)
{
    // nSSEvectors is (no_particles * 3) / 2    // non-AVX
    // You handle remaining odd data value (or padd to even as 0.0)
     #pragma omp parallel for
     for(int i = 0; i < nSSEvectors; ++i) {
        p_sin[i] = _mm_sincos_pd(&p_cos[i], &p_X[i]); }
}
 
The principle optimization strategy above is: Vector-Inner, Parallel-Outer
 
Jim Dempsey
 

  _____  


From: parsec-users-bounces at lists.cs.princeton.edu
[mailto:parsec-users-bounces at lists.cs.princeton.edu] On Behalf Of aftab
hussain

Sent: Thursday, August 11, 2011 12:51 AM
To: PARSEC Users
Subject: Re: [parsec-users] Porting Bodytrack on GP-GPUs -- Problems
andIssues


Hi Matt, 
             Thanks for the link. Sorry for the confusion about the memcpy
on Host. But its right though that when I do the memcpy on Host from CPU
data structure to GPU data structure, this has a performance bottle neck. I
think the following code would help understand it.

for (i=0; i < no_particles; i++){
  cu_particles[cnt] = sin(mNewParticles[i].at(6)); cnt++;
cu_particles[cnt] = cos(mNewParticles[i].at(6)); cnt++; 
cu_particles[cnt] = sin(mNewParticles[i].at(7)); cnt++;
cu_particles[cnt] = cos(mNewParticles[i].at(7)); cnt++;
cu_particles[cnt] = sin(mNewParticles[i].at(8)); cnt++;
cu_particles[cnt] = cos(mNewParticles[i].at(8)); cnt++;
 }

The above mentioned code is used for copying one limb. I know that there are
cache issues with the above code because the 6th, 7th and 8th elements are
taken from successive mNewParticles vectors, not the same mNewParticles
vector. The same procedure is repeated for all the limbs (10 limbs in the
bodytrack code). But I don't have any other idea of doing this except use
pthreads for the above calculations and try to avoid the cache issues
somehow. I have not tried pthreads idea though but I am planning on doing it
if I don't get any success with the sin/cos precision issue on GPU.

What I meant about the performance of copying from CPU to GPU(cudaMemcpy)
does not have any issue once I have the GPU data structure ready from the
mem copying from the above code.

Regarding the optimization options to the compiler, I am not using any
--use-fast-math/--unsafe-optimizations. I have tried compiling the code with
both sm_20 and the default one (sm_10). Both have the precision issues I
discussed. Compilation option with sm_20 does support double precision on
Fermi (GTX 480).

Now regarding Jim's suggestion (thanks Jim for your suggestions). I have
been thinking about his suggestions and the overlapped communication and
computations, body track is a data parallel application and the code I have
written on GPU is based on this fact. The Pthread and OpenMP versions of
Bodytrack are also data parallel.

I did not understand the following point from Jim:

----------------------------------------------------------------------------
-----------------------------------

1) Move the observation point from where you measure the arc angle between
particles such that the angle differences tend to be large.
2) When performing calculations containing both large numbers and small
numbers see if you can apply a bias to the large number(s) such that it
becomes small-ish. This avoids drop-off of precision bits in the binary
mantissa. Example
 
      result = func(large, small)
 
becomes
 
     result = func(large-bias, small) + bias
----------------------------------------------------------------------------
-------------------------------------

Actually I don't calculate the arc angles between the particles (mentioned
in point 1, what I understood from it). What actually happens is that
suppose I have the following 31 values in radians except the ones which are
bold.

0.0915466, 0.0118652, 0.957689, 134.693, 940.095, 867.287, -0.106135,
-0.1814, -0.138087, -0.0927003, -0.1666, 0.196366, 2.72613, 0.0793269,
0.361596, -0.141301, 1.84194, 0.311537, 1.38892, 0.853814, -0.331812,
-0.512472, 0.857448, 1.72155, -0.725819, 0.459579, 0.0942842, -0.0892392,
0.143076, -0.0309185, 0.383283

I have N such arrays of 31 elements. I will call one array of 31 elements as
one particle.

The first six values correspond to the first limb (Torso). I calculate sin
and cos of the first three values and form rotation matrices about x, y and
z dimentions respectively. I do the same for the other values in all the N
particles except the bold ones.

Now I don't understand how would I use Jim's suggestion about using the bias
in the above scenario.

Thanks and sorry for such a long email.

On Wed, Aug 10, 2011 at 12:43 PM, Matt Sinclair <msinclair at wisc.edu> wrote:


Hi Aftab,

Here's the link to the video I referenced from GTC.  Like I said, I'm
not sure if it will be of direct help or not though, but he also had
the same/similar problem:

http://nvidia.fullviewmedia.com/gtc2010/0922-guadalupe-2082.html

I believe that there is a way to tell the compiler not to use as many
FMAs.  Are you using the --use-fast-math compiler option (or the
--unsafe-optimizations (sp?) option)?  If so, you might try not using
it and see if that helps with performance/rounding.  It might also
remove the issues with the other exponentials.  In my GPU work, I've
never had to use those functions, and everything was ok precision-wise
by not using the Also, if you're using CUDA < 4.0 on a Fermi GPU, I'm
not sure what that means when you set your "sm" option in the
makefile.  I guess a better question would be -- are you using your
own makefile or the common.mk structure that the SDK examples use?

In regards to the memcpy thing, in your original email you stated that
adding the extra copy (in order to do the sin/cos on the host) was
prohibitive and hurting performance, but now you're saying it isn't
hurting performance, so I'm a bit confused...

Additionally, I would suggest looking into the questions/suggestions
Jim posed in his email.

Finally, what is your plan for releasing this when you're done?

Thanks,
Matt

2011/8/10 aftab hussain <aftab.hussain at seecs.edu.pk>:

> Thanks Matt,
>             Actually I am using the non-native versions of sin/cos (sinf,
> cosf). I also have been having issues with the results of multiplication,
> division and square root calculations. Specifically the calculations of
the
> following format:
> c = a*x +b*y +d   -- resulted into Fused Multiply Add operation on GPU to
> speed it up with less accurate results.
> I worked around the above mentioned problems by using the slower versions
of
> the division, multiplication and square root (__fdiv_rn, __fmul_rn etc).
But
> I don't have a work around for sin/cos calculations.
> I am using CUDA 3.2 on Fermi (GTX480) GPU. In my implementation the memory
> transfer from CPU to GPU and GPU to CPU is not a problem and it takes
quite
> less time.
> If the work around of the talk at GTC 2010 would help me, I would
definitely
> like to have a look. Can you please send me the link to the paper/Talk?
> Thanks for your answer again.
>
>
>
> On Tue, Aug 9, 2011 at 5:30 AM, Matt Sinclair <msinclair at wisc.edu> wrote:
>>
>> Hi Aftab,
>>
>> What version of sine and cosine are you using for your GPU kernels?
>> Are you using the native ones?  Because those are less precise than
>> the slower, non-native ones.  So, if you're using the native ones,
>> even though it will hurt performance, you might try them and see if
>> they solve your issue.  Also, there was a talk @ GTC 2010 that dealt
>> with the imprecision of the sin/cos functions in CUDA and how they
>> affected some astronomy calculations, and how they got around them.  I
>> can send a link to it if you think that would be helpful.
>>
>> Also, what version of CUDA are you using (I'm assuming you're using
>> CUDA?)?  If you're using 4.0+, then you might be able to look into
>> their overlapping memory transfers, which would alleviate some of the
>> performance bottlenecks you're seeing.  If you're using OpenCL, are
>> you setting the memory transferring to be blocking or non-blocking?
>>
>> I've done quite a bit of work myself on porting the PARSEC benchmarks
>> to GPUs, and I thought bodytrack was a pretty tough one to easily port
>> (just because of how it's written, and the fact that there's so much
>> code), so good for you to have made this much progress!  What are your
>> plans on releasing it eventually?
>>
>> Thanks,
>> Matt
>>
>> 2011/8/9 aftab hussain <aftab.hussain at seecs.edu.pk>:
>> > Dear All,
>> >              I am trying to port Bodytrack application to GP-GPUs as my
>> > MS
>> > thesis. I have a working code but my tracking results are screwed.
>> > When I further investigated the code I found that the difference in
>> > sin/cos
>> > calculations on CPU and GPU are messing things up.
>> > For some particles the difference (error uptill 6th-7th decimal place)
>> > in
>> > sin/cos calculations gets accumulated in later stages
>> > (Body Geometry calculations, projection calculations, Error term
>> > calculations). In the edge error term calculations I get one extra
>> > sample point due to which the error weight gets changed and the final
>> > normalized weight for that particular particle is different
>> > upto 4th decimal place (a lot of error). And this is in the
>> > Initialization
>> > stage of the particle filter (weight calculation).
>> > This in turn produces error for the next iterations because in the
>> > particle
>> > generation stage for the next iteration, a wrong particle is
>> > selected which further introduces error and finally the estimate for a
>> > frame
>> > is very different from the CPU estimate.
>> > I have the following stages implemented on GPU because these are the
>> > most
>> > compute intensive stages of the application.
>> > 1- Body Geometry
>> > 2- Projection Calculation
>> > 3- Error Terms (Inside Error Term, Edge Error Term)
>> > When I move the sin/cos calculation to CPU, the improvement in
execution
>> > time I get on the GPU stages in screwed up by the particle generation
>> > stage because I have to arrange (copy from CPU data structure to GPU
>> > data
>> > structure plus sin/cos calculation) the data structure suitable for GPU
>> > implementation that gives speed up in the execution. The overall
>> > application
>> > speed up is not very interesting due to this problem.
>> > Can any help me in this issue because my Thesis is stuck due to this
>> > problem.
>> > --
>> > Best Regards
>> >
>> > Aftab Hussain
>> > Research Assistant,
>> > High Performance Computing Lab,
>> > NUST School of Electrical Engineering and Computer Science
>> > +923225046338 <tel:%2B923225046338> 
>> >
>> > _______________________________________________
>> > parsec-users mailing list
>> > parsec-users at lists.cs.princeton.edu
>> > https://lists.cs.princeton.edu/mailman/listinfo/parsec-users
>> >
>> >
>> _______________________________________________
>> parsec-users mailing list
>> parsec-users at lists.cs.princeton.edu
>> https://lists.cs.princeton.edu/mailman/listinfo/parsec-users
>
>
>
> --
> Best Regards
>
> Aftab Hussain
> Research Assistant,
> High Performance Computing Lab,
> NUST School of Electrical Engineering and Computer Science
> +923225046338 <tel:%2B923225046338> 
>
> _______________________________________________
> parsec-users mailing list
> parsec-users at lists.cs.princeton.edu
> https://lists.cs.princeton.edu/mailman/listinfo/parsec-users
>
>
_______________________________________________
parsec-users mailing list
parsec-users at lists.cs.princeton.edu
https://lists.cs.princeton.edu/mailman/listinfo/parsec-users





-- 

Best Regards

Aftab Hussain
Research Assistant,
High Performance Computing Lab,
NUST School of Electrical Engineering and Computer Science
+923225046338 <tel:%2B923225046338> 


_______________________________________________
parsec-users mailing list
parsec-users at lists.cs.princeton.edu
https://lists.cs.princeton.edu/mailman/listinfo/parsec-users






-- 

Best Regards

Aftab Hussain
Research Assistant,
High Performance Computing Lab,
NUST School of Electrical Engineering and Computer Science
+923225046338

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cs.princeton.edu/pipermail/parsec-users/attachments/20110813/699a9202/attachment-0001.html>


More information about the parsec-users mailing list