[parsec-users] Porting Bodytrack on GP-GPUs -- Problems and Issues

aftab hussain aftab.hussain at seecs.edu.pk
Thu Aug 11 01:51:10 EDT 2011


Hi Matt,
             Thanks for the link. Sorry for the confusion about the memcpy
on Host. But its right though that when I do the memcpy on Host from CPU
data structure to GPU data structure, this has a performance bottle neck. I
think the following code would help understand it.

for (i=0; i < no_particles; i++){
  cu_particles[cnt] = sin(mNewParticles[i].at(6)); cnt++;
cu_particles[cnt] = cos(mNewParticles[i].at(6)); cnt++;
cu_particles[cnt] = sin(mNewParticles[i].at(7)); cnt++;
cu_particles[cnt] = cos(mNewParticles[i].at(7)); cnt++;
cu_particles[cnt] = sin(mNewParticles[i].at(8)); cnt++;
cu_particles[cnt] = cos(mNewParticles[i].at(8)); cnt++;
 }

The above mentioned code is used for copying one limb. I know that there are
cache issues with the above code because the 6th, 7th and 8th elements are
taken from successive mNewParticles vectors, not the same mNewParticles
vector. The same procedure is repeated for all the limbs (10 limbs in the
bodytrack code). But I don't have any other idea of doing this except use
pthreads for the above calculations and try to avoid the cache issues
somehow. I have not tried pthreads idea though but I am planning on doing it
if I don't get any success with the sin/cos precision issue on GPU.

What I meant about the performance of copying from CPU to GPU(cudaMemcpy)
does not have any issue once I have the GPU data structure ready from the
mem copying from the above code.

Regarding the optimization options to the compiler, I am not using
any --use-fast-math/--unsafe-optimizations. I have tried compiling the code
with both sm_20 and the default one (sm_10). Both have the precision issues
I discussed. Compilation option with sm_20 does support double precision on
Fermi (GTX 480).

Now regarding Jim's suggestion (thanks Jim for your suggestions). I have
been thinking about his suggestions and the overlapped communication and
computations, body track is a data parallel application and the code I have
written on GPU is based on this fact. The Pthread and OpenMP versions of
Bodytrack are also data parallel.

I did not understand the following point from Jim:

---------------------------------------------------------------------------------------------------------------
1) Move the observation point from where you measure the arc angle between
particles such that the angle differences tend to be large.
2) When performing calculations containing both large numbers and small
numbers see if you can apply a bias to the large number(s) such that it
becomes small-ish. This avoids drop-off of precision bits in the binary
mantissa. Example

      result = func(large, small)

becomes

     result = func(large-bias, small) + bias
-----------------------------------------------------------------------------------------------------------------

Actually I don't calculate the arc angles between the particles (mentioned
in point 1, what I understood from it). What actually happens is that
suppose I have the following 31 values in radians except the ones which are
bold.

*0.0915466, 0.0118652, 0.957689, 134.693, 940.095, 867.287*, -0.106135,
-0.1814, -0.138087, -0.0927003, -0.1666, 0.196366, 2.72613, 0.0793269,
0.361596, -0.141301, 1.84194, 0.311537, 1.38892, 0.853814, -0.331812,
-0.512472, 0.857448, 1.72155, -0.725819, 0.459579, 0.0942842, -0.0892392,
0.143076, -0.0309185, 0.383283

I have *N* such arrays of 31 elements. I will call one array of 31 elements
as one particle.

The first six values correspond to the first limb (Torso). I calculate *sin
*and *cos *of the first three values and form rotation matrices about x, y
and z dimentions respectively. I do the same for the other values in all the
*N* particles except the bold ones.

Now I don't understand how would I use Jim's suggestion about using the bias
in the above scenario.

Thanks and sorry for such a long email.

On Wed, Aug 10, 2011 at 12:43 PM, Matt Sinclair <msinclair at wisc.edu> wrote:

> Hi Aftab,
>
> Here's the link to the video I referenced from GTC.  Like I said, I'm
> not sure if it will be of direct help or not though, but he also had
> the same/similar problem:
>
> http://nvidia.fullviewmedia.com/gtc2010/0922-guadalupe-2082.html
>
> I believe that there is a way to tell the compiler not to use as many
> FMAs.  Are you using the --use-fast-math compiler option (or the
> --unsafe-optimizations (sp?) option)?  If so, you might try not using
> it and see if that helps with performance/rounding.  It might also
> remove the issues with the other exponentials.  In my GPU work, I've
> never had to use those functions, and everything was ok precision-wise
> by not using the Also, if you're using CUDA < 4.0 on a Fermi GPU, I'm
> not sure what that means when you set your "sm" option in the
> makefile.  I guess a better question would be -- are you using your
> own makefile or the common.mk structure that the SDK examples use?
>
> In regards to the memcpy thing, in your original email you stated that
> adding the extra copy (in order to do the sin/cos on the host) was
> prohibitive and hurting performance, but now you're saying it isn't
> hurting performance, so I'm a bit confused...
>
> Additionally, I would suggest looking into the questions/suggestions
> Jim posed in his email.
>
> Finally, what is your plan for releasing this when you're done?
>
> Thanks,
> Matt
>
> 2011/8/10 aftab hussain <aftab.hussain at seecs.edu.pk>:
> > Thanks Matt,
> >             Actually I am using the non-native versions of sin/cos (sinf,
> > cosf). I also have been having issues with the results of multiplication,
> > division and square root calculations. Specifically the calculations of
> the
> > following format:
> > c = a*x +b*y +d   -- resulted into Fused Multiply Add operation on GPU to
> > speed it up with less accurate results.
> > I worked around the above mentioned problems by using the slower versions
> of
> > the division, multiplication and square root (__fdiv_rn, __fmul_rn etc).
> But
> > I don't have a work around for sin/cos calculations.
> > I am using CUDA 3.2 on Fermi (GTX480) GPU. In my implementation the
> memory
> > transfer from CPU to GPU and GPU to CPU is not a problem and it takes
> quite
> > less time.
> > If the work around of the talk at GTC 2010 would help me, I would
> definitely
> > like to have a look. Can you please send me the link to the paper/Talk?
> > Thanks for your answer again.
> >
> >
> >
> > On Tue, Aug 9, 2011 at 5:30 AM, Matt Sinclair <msinclair at wisc.edu>
> wrote:
> >>
> >> Hi Aftab,
> >>
> >> What version of sine and cosine are you using for your GPU kernels?
> >> Are you using the native ones?  Because those are less precise than
> >> the slower, non-native ones.  So, if you're using the native ones,
> >> even though it will hurt performance, you might try them and see if
> >> they solve your issue.  Also, there was a talk @ GTC 2010 that dealt
> >> with the imprecision of the sin/cos functions in CUDA and how they
> >> affected some astronomy calculations, and how they got around them.  I
> >> can send a link to it if you think that would be helpful.
> >>
> >> Also, what version of CUDA are you using (I'm assuming you're using
> >> CUDA?)?  If you're using 4.0+, then you might be able to look into
> >> their overlapping memory transfers, which would alleviate some of the
> >> performance bottlenecks you're seeing.  If you're using OpenCL, are
> >> you setting the memory transferring to be blocking or non-blocking?
> >>
> >> I've done quite a bit of work myself on porting the PARSEC benchmarks
> >> to GPUs, and I thought bodytrack was a pretty tough one to easily port
> >> (just because of how it's written, and the fact that there's so much
> >> code), so good for you to have made this much progress!  What are your
> >> plans on releasing it eventually?
> >>
> >> Thanks,
> >> Matt
> >>
> >> 2011/8/9 aftab hussain <aftab.hussain at seecs.edu.pk>:
> >> > Dear All,
> >> >              I am trying to port Bodytrack application to GP-GPUs as
> my
> >> > MS
> >> > thesis. I have a working code but my tracking results are screwed.
> >> > When I further investigated the code I found that the difference in
> >> > sin/cos
> >> > calculations on CPU and GPU are messing things up.
> >> > For some particles the difference (error uptill 6th-7th decimal place)
> >> > in
> >> > sin/cos calculations gets accumulated in later stages
> >> > (Body Geometry calculations, projection calculations, Error term
> >> > calculations). In the edge error term calculations I get one extra
> >> > sample point due to which the error weight gets changed and the final
> >> > normalized weight for that particular particle is different
> >> > upto 4th decimal place (a lot of error). And this is in the
> >> > Initialization
> >> > stage of the particle filter (weight calculation).
> >> > This in turn produces error for the next iterations because in the
> >> > particle
> >> > generation stage for the next iteration, a wrong particle is
> >> > selected which further introduces error and finally the estimate for a
> >> > frame
> >> > is very different from the CPU estimate.
> >> > I have the following stages implemented on GPU because these are the
> >> > most
> >> > compute intensive stages of the application.
> >> > 1- Body Geometry
> >> > 2- Projection Calculation
> >> > 3- Error Terms (Inside Error Term, Edge Error Term)
> >> > When I move the sin/cos calculation to CPU, the improvement in
> execution
> >> > time I get on the GPU stages in screwed up by the particle generation
> >> > stage because I have to arrange (copy from CPU data structure to GPU
> >> > data
> >> > structure plus sin/cos calculation) the data structure suitable for
> GPU
> >> > implementation that gives speed up in the execution. The overall
> >> > application
> >> > speed up is not very interesting due to this problem.
> >> > Can any help me in this issue because my Thesis is stuck due to this
> >> > problem.
> >> > --
> >> > Best Regards
> >> >
> >> > Aftab Hussain
> >> > Research Assistant,
> >> > High Performance Computing Lab,
> >> > NUST School of Electrical Engineering and Computer Science
> >> > +923225046338
> >> >
> >> > _______________________________________________
> >> > parsec-users mailing list
> >> > parsec-users at lists.cs.princeton.edu
> >> > https://lists.cs.princeton.edu/mailman/listinfo/parsec-users
> >> >
> >> >
> >> _______________________________________________
> >> parsec-users mailing list
> >> parsec-users at lists.cs.princeton.edu
> >> https://lists.cs.princeton.edu/mailman/listinfo/parsec-users
> >
> >
> >
> > --
> > Best Regards
> >
> > Aftab Hussain
> > Research Assistant,
> > High Performance Computing Lab,
> > NUST School of Electrical Engineering and Computer Science
> > +923225046338
> >
> > _______________________________________________
> > parsec-users mailing list
> > parsec-users at lists.cs.princeton.edu
> > https://lists.cs.princeton.edu/mailman/listinfo/parsec-users
> >
> >
> _______________________________________________
> parsec-users mailing list
> parsec-users at lists.cs.princeton.edu
> https://lists.cs.princeton.edu/mailman/listinfo/parsec-users
>



-- 
Best Regards

Aftab Hussain
Research Assistant,
High Performance Computing Lab,
NUST School of Electrical Engineering and Computer Science
+923225046338
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cs.princeton.edu/pipermail/parsec-users/attachments/20110810/fb808412/attachment-0001.html>


More information about the parsec-users mailing list