[parsec-users] Porting Bodytrack on GP-GPUs -- Problems and Issues

Matt Sinclair msinclair at wisc.edu
Wed Aug 10 15:43:10 EDT 2011

Hi Aftab,

Here's the link to the video I referenced from GTC.  Like I said, I'm
not sure if it will be of direct help or not though, but he also had
the same/similar problem:


I believe that there is a way to tell the compiler not to use as many
FMAs.  Are you using the --use-fast-math compiler option (or the
--unsafe-optimizations (sp?) option)?  If so, you might try not using
it and see if that helps with performance/rounding.  It might also
remove the issues with the other exponentials.  In my GPU work, I've
never had to use those functions, and everything was ok precision-wise
by not using the Also, if you're using CUDA < 4.0 on a Fermi GPU, I'm
not sure what that means when you set your "sm" option in the
makefile.  I guess a better question would be -- are you using your
own makefile or the common.mk structure that the SDK examples use?

In regards to the memcpy thing, in your original email you stated that
adding the extra copy (in order to do the sin/cos on the host) was
prohibitive and hurting performance, but now you're saying it isn't
hurting performance, so I'm a bit confused...

Additionally, I would suggest looking into the questions/suggestions
Jim posed in his email.

Finally, what is your plan for releasing this when you're done?


2011/8/10 aftab hussain <aftab.hussain at seecs.edu.pk>:
> Thanks Matt,
>             Actually I am using the non-native versions of sin/cos (sinf,
> cosf). I also have been having issues with the results of multiplication,
> division and square root calculations. Specifically the calculations of the
> following format:
> c = a*x +b*y +d   -- resulted into Fused Multiply Add operation on GPU to
> speed it up with less accurate results.
> I worked around the above mentioned problems by using the slower versions of
> the division, multiplication and square root (__fdiv_rn, __fmul_rn etc). But
> I don't have a work around for sin/cos calculations.
> I am using CUDA 3.2 on Fermi (GTX480) GPU. In my implementation the memory
> transfer from CPU to GPU and GPU to CPU is not a problem and it takes quite
> less time.
> If the work around of the talk at GTC 2010 would help me, I would definitely
> like to have a look. Can you please send me the link to the paper/Talk?
> Thanks for your answer again.
> On Tue, Aug 9, 2011 at 5:30 AM, Matt Sinclair <msinclair at wisc.edu> wrote:
>> Hi Aftab,
>> What version of sine and cosine are you using for your GPU kernels?
>> Are you using the native ones?  Because those are less precise than
>> the slower, non-native ones.  So, if you're using the native ones,
>> even though it will hurt performance, you might try them and see if
>> they solve your issue.  Also, there was a talk @ GTC 2010 that dealt
>> with the imprecision of the sin/cos functions in CUDA and how they
>> affected some astronomy calculations, and how they got around them.  I
>> can send a link to it if you think that would be helpful.
>> Also, what version of CUDA are you using (I'm assuming you're using
>> CUDA?)?  If you're using 4.0+, then you might be able to look into
>> their overlapping memory transfers, which would alleviate some of the
>> performance bottlenecks you're seeing.  If you're using OpenCL, are
>> you setting the memory transferring to be blocking or non-blocking?
>> I've done quite a bit of work myself on porting the PARSEC benchmarks
>> to GPUs, and I thought bodytrack was a pretty tough one to easily port
>> (just because of how it's written, and the fact that there's so much
>> code), so good for you to have made this much progress!  What are your
>> plans on releasing it eventually?
>> Thanks,
>> Matt
>> 2011/8/9 aftab hussain <aftab.hussain at seecs.edu.pk>:
>> > Dear All,
>> >              I am trying to port Bodytrack application to GP-GPUs as my
>> > MS
>> > thesis. I have a working code but my tracking results are screwed.
>> > When I further investigated the code I found that the difference in
>> > sin/cos
>> > calculations on CPU and GPU are messing things up.
>> > For some particles the difference (error uptill 6th-7th decimal place)
>> > in
>> > sin/cos calculations gets accumulated in later stages
>> > (Body Geometry calculations, projection calculations, Error term
>> > calculations). In the edge error term calculations I get one extra
>> > sample point due to which the error weight gets changed and the final
>> > normalized weight for that particular particle is different
>> > upto 4th decimal place (a lot of error). And this is in the
>> > Initialization
>> > stage of the particle filter (weight calculation).
>> > This in turn produces error for the next iterations because in the
>> > particle
>> > generation stage for the next iteration, a wrong particle is
>> > selected which further introduces error and finally the estimate for a
>> > frame
>> > is very different from the CPU estimate.
>> > I have the following stages implemented on GPU because these are the
>> > most
>> > compute intensive stages of the application.
>> > 1- Body Geometry
>> > 2- Projection Calculation
>> > 3- Error Terms (Inside Error Term, Edge Error Term)
>> > When I move the sin/cos calculation to CPU, the improvement in execution
>> > time I get on the GPU stages in screwed up by the particle generation
>> > stage because I have to arrange (copy from CPU data structure to GPU
>> > data
>> > structure plus sin/cos calculation) the data structure suitable for GPU
>> > implementation that gives speed up in the execution. The overall
>> > application
>> > speed up is not very interesting due to this problem.
>> > Can any help me in this issue because my Thesis is stuck due to this
>> > problem.
>> > --
>> > Best Regards
>> >
>> > Aftab Hussain
>> > Research Assistant,
>> > High Performance Computing Lab,
>> > NUST School of Electrical Engineering and Computer Science
>> > +923225046338
>> >
>> > _______________________________________________
>> > parsec-users mailing list
>> > parsec-users at lists.cs.princeton.edu
>> > https://lists.cs.princeton.edu/mailman/listinfo/parsec-users
>> >
>> >
>> _______________________________________________
>> parsec-users mailing list
>> parsec-users at lists.cs.princeton.edu
>> https://lists.cs.princeton.edu/mailman/listinfo/parsec-users
> --
> Best Regards
> Aftab Hussain
> Research Assistant,
> High Performance Computing Lab,
> NUST School of Electrical Engineering and Computer Science
> +923225046338
> _______________________________________________
> parsec-users mailing list
> parsec-users at lists.cs.princeton.edu
> https://lists.cs.princeton.edu/mailman/listinfo/parsec-users

More information about the parsec-users mailing list