[parsec-users] Porting Bodytrack on GP-GPUs -- Problems andIssues
aftab hussain
aftab.hussain at seecs.edu.pk
Fri Aug 12 08:40:16 EDT 2011
Thanks Jim for such a detailed solution.
Before trying out the SSE implementation, my supervisor is more interested
in finding a way out to correct the sin/cos calculations on GPU, if no luck
your solution will be the best option and I hope we get rid of the
performance bottle neck.
Just one more question, Can I use* __m128* to handle floats and if I can
what should be *nSSEvectors*. Would it be (no_particles * 3)/4?
2011/8/11 Jim Dempsey <jim at quickthreadprogramming.com>
> **
> Try this as a first step:
>
> Use sincos(x, &sin_x, &cos_x);
>
> for (i=0; i < no_particles; i++){
> sincos(mNewParticles[i].at(6), &cu_particles[cnt], &cu_particles[cnt+1]);
> cnt += 2;
> sincos(mNewParticles[i].at(7), &cu_particles[cnt],
> &cu_particles[cnt+1]); cnt += 2;
> sincos(mNewParticles[i].at(8), &cu_particles[cnt], &cu_particles[cnt+1]);
> cnt += 2;
> }
>
> The second step:
>
> Make data layout friendly for SSE vecroization
>
> Note that your transport of data into cu_... is organized as a stream of
> data of the sin/cos pairs of particles and NOT a stream of the entire
> particle data (mNewPrticles). I recommend that for the above routine (to
> produce sin/cos pairs for input data) you change the cu to CPU data format
> to a stream of doubles of only the at(6), at(7), at(8) of each particle. IOW
> be as cognizant (fair play) about streamability of data on the CPU side as
> well as in the GPGPU side.
>
> Have the CPU buffer that receives the stream of doubles from the GPGPU be
> aligned on 16 byte address
> Rework the output from the CPU code into two streams: one for a stream of
> sines and one for a stream of cosines.
> These too should be 16 byte aligned.
>
> Then use:
> extern __m128d _mm_sincos_pd(__m128d *p_cos, __m128d v1); // most
> CPUs support this
> extern __m256d _mm256_sincos_pd(__m256d *p_cos, __m256d v1); // CPU
> with AVX (SandyBridge)
>
> void doSinCos(__m128d* p_X, __m128d* p_sin, __m128d* p_cos, int
> nSSEvectors)
> {
> // nSSEvectors is (no_particles * 3) / 2 // non-AVX
> // You handle remaining odd data value (or padd to even as 0.0)
> for(int i = 0; i < nSSEvectors; ++i)
> p_sin[i] = _mm_sincos_pd(&p_cos[i], &p_X[i]);
> }
>
> The work to this point is to inprove your serial CPU side code as (close
> to) best as possible.
> I will let you rework the above in the event you have a CPU with AVX (it
> handles 4-up doubles small vectors)
>
> The next step would be to see if you can multi-thread the CPU side. For
> this I would suggest using OpenMP over pthreads (easier to do)
>
> void doSinCos(__m128d* p_X, __m128d* p_sin, __m128d* p_cos, int
> nSSEvectors)
> {
> // nSSEvectors is (no_particles * 3) / 2 // non-AVX
> // You handle remaining odd data value (or padd to even as 0.0)
> #pragma omp parallel for
> for(int i = 0; i < nSSEvectors; ++i) {
> p_sin[i] = _mm_sincos_pd(&p_cos[i], &p_X[i]); }
> }
>
> The principle optimization strategy above is: Vector-Inner, Parallel-Outer
>
> Jim Dempsey
>
>
> ------------------------------
> *From:* parsec-users-bounces at lists.cs.princeton.edu [mailto:
> parsec-users-bounces at lists.cs.princeton.edu] *On Behalf Of *aftab hussain
> *Sent:* Thursday, August 11, 2011 12:51 AM
> *To:* PARSEC Users
> *Subject:* Re: [parsec-users] Porting Bodytrack on GP-GPUs -- Problems
> andIssues
>
> Hi Matt,
> Thanks for the link. Sorry for the confusion about the memcpy
> on Host. But its right though that when I do the memcpy on Host from CPU
> data structure to GPU data structure, this has a performance bottle neck. I
> think the following code would help understand it.
>
> for (i=0; i < no_particles; i++){
> cu_particles[cnt] = sin(mNewParticles[i].at(6)); cnt++;
> cu_particles[cnt] = cos(mNewParticles[i].at(6)); cnt++;
> cu_particles[cnt] = sin(mNewParticles[i].at(7)); cnt++;
> cu_particles[cnt] = cos(mNewParticles[i].at(7)); cnt++;
> cu_particles[cnt] = sin(mNewParticles[i].at(8)); cnt++;
> cu_particles[cnt] = cos(mNewParticles[i].at(8)); cnt++;
> }
>
> The above mentioned code is used for copying one limb. I know that there
> are cache issues with the above code because the 6th, 7th and 8th elements
> are taken from successive mNewParticles vectors, not the same mNewParticles
> vector. The same procedure is repeated for all the limbs (10 limbs in the
> bodytrack code). But I don't have any other idea of doing this except use
> pthreads for the above calculations and try to avoid the cache issues
> somehow. I have not tried pthreads idea though but I am planning on doing it
> if I don't get any success with the sin/cos precision issue on GPU.
>
> What I meant about the performance of copying from CPU to GPU(cudaMemcpy)
> does not have any issue once I have the GPU data structure ready from the
> mem copying from the above code.
>
> Regarding the optimization options to the compiler, I am not using
> any --use-fast-math/--unsafe-optimizations. I have tried compiling the code
> with both sm_20 and the default one (sm_10). Both have the precision issues
> I discussed. Compilation option with sm_20 does support double precision on
> Fermi (GTX 480).
>
> Now regarding Jim's suggestion (thanks Jim for your suggestions). I have
> been thinking about his suggestions and the overlapped communication and
> computations, body track is a data parallel application and the code I have
> written on GPU is based on this fact. The Pthread and OpenMP versions of
> Bodytrack are also data parallel.
>
> I did not understand the following point from Jim:
>
>
> ---------------------------------------------------------------------------------------------------------------
> 1) Move the observation point from where you measure the arc angle
> between particles such that the angle differences tend to be large.
> 2) When performing calculations containing both large numbers and small
> numbers see if you can apply a bias to the large number(s) such that it
> becomes small-ish. This avoids drop-off of precision bits in the binary
> mantissa. Example
>
> result = func(large, small)
>
> becomes
>
> result = func(large-bias, small) + bias
>
> -----------------------------------------------------------------------------------------------------------------
>
> Actually I don't calculate the arc angles between the particles (mentioned
> in point 1, what I understood from it). What actually happens is that
> suppose I have the following 31 values in radians except the ones which are
> bold.
>
> *0.0915466, 0.0118652, 0.957689, 134.693, 940.095, 867.287*, -0.106135,
> -0.1814, -0.138087, -0.0927003, -0.1666, 0.196366, 2.72613, 0.0793269,
> 0.361596, -0.141301, 1.84194, 0.311537, 1.38892, 0.853814, -0.331812,
> -0.512472, 0.857448, 1.72155, -0.725819, 0.459579, 0.0942842, -0.0892392,
> 0.143076, -0.0309185, 0.383283
>
> I have *N* such arrays of 31 elements. I will call one array of 31
> elements as one particle.
>
> The first six values correspond to the first limb (Torso). I calculate *sin
> *and *cos *of the first three values and form rotation matrices about x, y
> and z dimentions respectively. I do the same for the other values in all the
> *N* particles except the bold ones.
>
> Now I don't understand how would I use Jim's suggestion about using the
> bias in the above scenario.
>
> Thanks and sorry for such a long email.
>
> On Wed, Aug 10, 2011 at 12:43 PM, Matt Sinclair <msinclair at wisc.edu>wrote:
>
>> Hi Aftab,
>>
>> Here's the link to the video I referenced from GTC. Like I said, I'm
>> not sure if it will be of direct help or not though, but he also had
>> the same/similar problem:
>>
>> http://nvidia.fullviewmedia.com/gtc2010/0922-guadalupe-2082.html
>>
>> I believe that there is a way to tell the compiler not to use as many
>> FMAs. Are you using the --use-fast-math compiler option (or the
>> --unsafe-optimizations (sp?) option)? If so, you might try not using
>> it and see if that helps with performance/rounding. It might also
>> remove the issues with the other exponentials. In my GPU work, I've
>> never had to use those functions, and everything was ok precision-wise
>> by not using the Also, if you're using CUDA < 4.0 on a Fermi GPU, I'm
>> not sure what that means when you set your "sm" option in the
>> makefile. I guess a better question would be -- are you using your
>> own makefile or the common.mk structure that the SDK examples use?
>>
>> In regards to the memcpy thing, in your original email you stated that
>> adding the extra copy (in order to do the sin/cos on the host) was
>> prohibitive and hurting performance, but now you're saying it isn't
>> hurting performance, so I'm a bit confused...
>>
>> Additionally, I would suggest looking into the questions/suggestions
>> Jim posed in his email.
>>
>> Finally, what is your plan for releasing this when you're done?
>>
>> Thanks,
>> Matt
>>
>> 2011/8/10 aftab hussain <aftab.hussain at seecs.edu.pk>:
>> > Thanks Matt,
>> > Actually I am using the non-native versions of sin/cos
>> (sinf,
>> > cosf). I also have been having issues with the results of
>> multiplication,
>> > division and square root calculations. Specifically the calculations of
>> the
>> > following format:
>> > c = a*x +b*y +d -- resulted into Fused Multiply Add operation on GPU
>> to
>> > speed it up with less accurate results.
>> > I worked around the above mentioned problems by using the slower
>> versions of
>> > the division, multiplication and square root (__fdiv_rn, __fmul_rn etc).
>> But
>> > I don't have a work around for sin/cos calculations.
>> > I am using CUDA 3.2 on Fermi (GTX480) GPU. In my implementation the
>> memory
>> > transfer from CPU to GPU and GPU to CPU is not a problem and it takes
>> quite
>> > less time.
>> > If the work around of the talk at GTC 2010 would help me, I would
>> definitely
>> > like to have a look. Can you please send me the link to the paper/Talk?
>> > Thanks for your answer again.
>> >
>> >
>> >
>> > On Tue, Aug 9, 2011 at 5:30 AM, Matt Sinclair <msinclair at wisc.edu>
>> wrote:
>> >>
>> >> Hi Aftab,
>> >>
>> >> What version of sine and cosine are you using for your GPU kernels?
>> >> Are you using the native ones? Because those are less precise than
>> >> the slower, non-native ones. So, if you're using the native ones,
>> >> even though it will hurt performance, you might try them and see if
>> >> they solve your issue. Also, there was a talk @ GTC 2010 that dealt
>> >> with the imprecision of the sin/cos functions in CUDA and how they
>> >> affected some astronomy calculations, and how they got around them. I
>> >> can send a link to it if you think that would be helpful.
>> >>
>> >> Also, what version of CUDA are you using (I'm assuming you're using
>> >> CUDA?)? If you're using 4.0+, then you might be able to look into
>> >> their overlapping memory transfers, which would alleviate some of the
>> >> performance bottlenecks you're seeing. If you're using OpenCL, are
>> >> you setting the memory transferring to be blocking or non-blocking?
>> >>
>> >> I've done quite a bit of work myself on porting the PARSEC benchmarks
>> >> to GPUs, and I thought bodytrack was a pretty tough one to easily port
>> >> (just because of how it's written, and the fact that there's so much
>> >> code), so good for you to have made this much progress! What are your
>> >> plans on releasing it eventually?
>> >>
>> >> Thanks,
>> >> Matt
>> >>
>> >> 2011/8/9 aftab hussain <aftab.hussain at seecs.edu.pk>:
>> >> > Dear All,
>> >> > I am trying to port Bodytrack application to GP-GPUs as
>> my
>> >> > MS
>> >> > thesis. I have a working code but my tracking results are screwed.
>> >> > When I further investigated the code I found that the difference in
>> >> > sin/cos
>> >> > calculations on CPU and GPU are messing things up.
>> >> > For some particles the difference (error uptill 6th-7th decimal
>> place)
>> >> > in
>> >> > sin/cos calculations gets accumulated in later stages
>> >> > (Body Geometry calculations, projection calculations, Error term
>> >> > calculations). In the edge error term calculations I get one extra
>> >> > sample point due to which the error weight gets changed and the final
>> >> > normalized weight for that particular particle is different
>> >> > upto 4th decimal place (a lot of error). And this is in the
>> >> > Initialization
>> >> > stage of the particle filter (weight calculation).
>> >> > This in turn produces error for the next iterations because in the
>> >> > particle
>> >> > generation stage for the next iteration, a wrong particle is
>> >> > selected which further introduces error and finally the estimate for
>> a
>> >> > frame
>> >> > is very different from the CPU estimate.
>> >> > I have the following stages implemented on GPU because these are the
>> >> > most
>> >> > compute intensive stages of the application.
>> >> > 1- Body Geometry
>> >> > 2- Projection Calculation
>> >> > 3- Error Terms (Inside Error Term, Edge Error Term)
>> >> > When I move the sin/cos calculation to CPU, the improvement in
>> execution
>> >> > time I get on the GPU stages in screwed up by the particle generation
>> >> > stage because I have to arrange (copy from CPU data structure to GPU
>> >> > data
>> >> > structure plus sin/cos calculation) the data structure suitable for
>> GPU
>> >> > implementation that gives speed up in the execution. The overall
>> >> > application
>> >> > speed up is not very interesting due to this problem.
>> >> > Can any help me in this issue because my Thesis is stuck due to this
>> >> > problem.
>> >> > --
>> >> > Best Regards
>> >> >
>> >> > Aftab Hussain
>> >> > Research Assistant,
>> >> > High Performance Computing Lab,
>> >> > NUST School of Electrical Engineering and Computer Science
>> >> > +923225046338
>> >> >
>> >> > _______________________________________________
>> >> > parsec-users mailing list
>> >> > parsec-users at lists.cs.princeton.edu
>> >> > https://lists.cs.princeton.edu/mailman/listinfo/parsec-users
>> >> >
>> >> >
>> >> _______________________________________________
>> >> parsec-users mailing list
>> >> parsec-users at lists.cs.princeton.edu
>> >> https://lists.cs.princeton.edu/mailman/listinfo/parsec-users
>> >
>> >
>> >
>> > --
>> > Best Regards
>> >
>> > Aftab Hussain
>> > Research Assistant,
>> > High Performance Computing Lab,
>> > NUST School of Electrical Engineering and Computer Science
>> > +923225046338
>> >
>> > _______________________________________________
>> > parsec-users mailing list
>> > parsec-users at lists.cs.princeton.edu
>> > https://lists.cs.princeton.edu/mailman/listinfo/parsec-users
>> >
>> >
>> _______________________________________________
>> parsec-users mailing list
>> parsec-users at lists.cs.princeton.edu
>> https://lists.cs.princeton.edu/mailman/listinfo/parsec-users
>>
>
>
>
> --
> Best Regards
>
> Aftab Hussain
> Research Assistant,
> High Performance Computing Lab,
> NUST School of Electrical Engineering and Computer Science
> +923225046338
>
>
> _______________________________________________
> parsec-users mailing list
> parsec-users at lists.cs.princeton.edu
> https://lists.cs.princeton.edu/mailman/listinfo/parsec-users
>
>
--
Best Regards
Aftab Hussain
Research Assistant,
High Performance Computing Lab,
NUST School of Electrical Engineering and Computer Science
+923225046338
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cs.princeton.edu/pipermail/parsec-users/attachments/20110812/cf499311/attachment.html>
More information about the parsec-users
mailing list