[parsec-users] Porting Bodytrack on GP-GPUs -- Problems and Issues

aftab hussain aftab.hussain at seecs.edu.pk
Wed Aug 10 02:25:43 EDT 2011

Thanks Matt,
            Actually I am using the non-native versions of sin/cos (sinf,
cosf). I also have been having issues with the results of multiplication,
division and square root calculations. Specifically the calculations of the
following format:

c = a*x +b*y +d   -- resulted into Fused Multiply Add operation on GPU to
speed it up with less accurate results.

I worked around the above mentioned problems by using the slower versions of
the division, multiplication and square root (__fdiv_rn, __fmul_rn etc). But
I don't have a work around for sin/cos calculations.

I am using CUDA 3.2 on Fermi (GTX480) GPU. In my implementation the memory
transfer from CPU to GPU and GPU to CPU is not a problem and it takes quite
less time.

If the work around of the talk at GTC 2010 would help me, I would definitely
like to have a look. Can you please send me the link to the paper/Talk?

Thanks for your answer again.

On Tue, Aug 9, 2011 at 5:30 AM, Matt Sinclair <msinclair at wisc.edu> wrote:

> Hi Aftab,
> What version of sine and cosine are you using for your GPU kernels?
> Are you using the native ones?  Because those are less precise than
> the slower, non-native ones.  So, if you're using the native ones,
> even though it will hurt performance, you might try them and see if
> they solve your issue.  Also, there was a talk @ GTC 2010 that dealt
> with the imprecision of the sin/cos functions in CUDA and how they
> affected some astronomy calculations, and how they got around them.  I
> can send a link to it if you think that would be helpful.
> Also, what version of CUDA are you using (I'm assuming you're using
> CUDA?)?  If you're using 4.0+, then you might be able to look into
> their overlapping memory transfers, which would alleviate some of the
> performance bottlenecks you're seeing.  If you're using OpenCL, are
> you setting the memory transferring to be blocking or non-blocking?
> I've done quite a bit of work myself on porting the PARSEC benchmarks
> to GPUs, and I thought bodytrack was a pretty tough one to easily port
> (just because of how it's written, and the fact that there's so much
> code), so good for you to have made this much progress!  What are your
> plans on releasing it eventually?
> Thanks,
> Matt
> 2011/8/9 aftab hussain <aftab.hussain at seecs.edu.pk>:
> > Dear All,
> >              I am trying to port Bodytrack application to GP-GPUs as my
> MS
> > thesis. I have a working code but my tracking results are screwed.
> > When I further investigated the code I found that the difference in
> sin/cos
> > calculations on CPU and GPU are messing things up.
> > For some particles the difference (error uptill 6th-7th decimal place) in
> > sin/cos calculations gets accumulated in later stages
> > (Body Geometry calculations, projection calculations, Error term
> > calculations). In the edge error term calculations I get one extra
> > sample point due to which the error weight gets changed and the final
> > normalized weight for that particular particle is different
> > upto 4th decimal place (a lot of error). And this is in the
> Initialization
> > stage of the particle filter (weight calculation).
> > This in turn produces error for the next iterations because in the
> particle
> > generation stage for the next iteration, a wrong particle is
> > selected which further introduces error and finally the estimate for a
> frame
> > is very different from the CPU estimate.
> > I have the following stages implemented on GPU because these are the most
> > compute intensive stages of the application.
> > 1- Body Geometry
> > 2- Projection Calculation
> > 3- Error Terms (Inside Error Term, Edge Error Term)
> > When I move the sin/cos calculation to CPU, the improvement in execution
> > time I get on the GPU stages in screwed up by the particle generation
> > stage because I have to arrange (copy from CPU data structure to GPU data
> > structure plus sin/cos calculation) the data structure suitable for GPU
> > implementation that gives speed up in the execution. The overall
> application
> > speed up is not very interesting due to this problem.
> > Can any help me in this issue because my Thesis is stuck due to this
> > problem.
> > --
> > Best Regards
> >
> > Aftab Hussain
> > Research Assistant,
> > High Performance Computing Lab,
> > NUST School of Electrical Engineering and Computer Science
> > +923225046338
> >
> > _______________________________________________
> > parsec-users mailing list
> > parsec-users at lists.cs.princeton.edu
> > https://lists.cs.princeton.edu/mailman/listinfo/parsec-users
> >
> >
> _______________________________________________
> parsec-users mailing list
> parsec-users at lists.cs.princeton.edu
> https://lists.cs.princeton.edu/mailman/listinfo/parsec-users

Best Regards

Aftab Hussain
Research Assistant,
High Performance Computing Lab,
NUST School of Electrical Engineering and Computer Science
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cs.princeton.edu/pipermail/parsec-users/attachments/20110809/562f9e90/attachment.html>

More information about the parsec-users mailing list