[parsec-users] Porting Bodytrack on GP-GPUs -- ProblemsandIssues

aftab hussain aftab.hussain at seecs.edu.pk
Wed Aug 17 05:32:00 EDT 2011


Dear Jim,

Thanks for your help and detailed explanation. I am in the process of
writing the discussed code using SSE intrinsics, though I am wrestling
around with the data structures because of certain dependencies in the code.
After that I am planning on using OpenMP instead of phreads and hope that I
get acceptable speed up.

Thanks once again.

2011/8/13 Jim Dempsey <jim at quickthreadprogramming.com>

> **
>    Take a look at:
> http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/cpp/win/intref_cls/common/intref_bk_trig_ops.htm
>
> This lists the intrinsics for trignometric operations. This is for the
> Intel C++ compiler, but MS and GCC have most of the same
>
>      __m128d == double | double            two doubles at a time
>     __m128 == float | float | float | float    four floats at a time
>     __m256d == double | double | double | double    four doubles at a time
>     __m256 == float | float | float | float | float | float | float |
> float    eight floats at a time
>
> *** If you have Intel Core i7 2600(K), and an OS that supports AVX (Windows
> 7 SP1, Linux??) then you can use the AVX variants (provided your compiler
> supports them) then you can use __m256d and __m256
>
>
> extern __m128d _mm_sincos_pd(__m128d *p_cos, __m128d v1);    // 2 doubles
>  extern __m128 _mm_sincos_ps(__m128 *p_cos, __m128 v1);    // 4 floats
>
> The iteration count nSSEvectors would vary depending on number of doubles
> or floats held in the chosen small vector.
>
>    __m128 has 4 floats
>     __m128d has 2 doubles
>     __m256d has 4 doubles (only with AVX support)
>     __m256 has 8 floats (only with AVX support)
>
> Most C++ compilers generate SSE now, and if the data is layed out as
> stream, then the optimization will convert to 2-up or 4-up (or 4-up or 8-up
> if AVX supported).
>
> There are other archetectures Intel has comming down the line to support
> 512 bit and perhaps 1024 bit small vectors. These might not show up in the
> mainstream of computing.
>
> As for: my supervisor is more interested in finding a way out to correct
> the sin/cos calculations on GPU
>
> While this is laudable, it may not be practicle for you to do this. The
> proper individules to do this are the people providing the tools and library
> kernals. For you to do this is not only a duplication of effort but a
> mis-use of your time.
>
> 1) I assume you are already a proficient x86 C++ programmer
> 2) Your GPGPU does not run standalone (i.e. you are using an x86 based O/S)
> 3) Your application is already hybrid (parts in x86, parts in GPGPU)
> 4) none of 1-3 are going to change while you write your thesis
> 5) Efficiently Integrating GPGPU app with SSE and parallel programming
> techniques is an effective approach
>
> Jim Dempsey
>
>
>
>  ------------------------------
> *From:* parsec-users-bounces at lists.cs.princeton.edu [mailto:
> parsec-users-bounces at lists.cs.princeton.edu] *On Behalf Of *aftab hussain
> *Sent:* Friday, August 12, 2011 7:40 AM
>
> *To:* PARSEC Users
> *Subject:* Re: [parsec-users] Porting Bodytrack on GP-GPUs --
> ProblemsandIssues
>
> Thanks Jim for such a detailed solution.
>
> Before trying out the SSE implementation, my supervisor is more interested
> in finding a way out to correct the sin/cos calculations on GPU, if no luck
> your solution will be the best option and I hope we get rid of the
> performance bottle neck.
>
> Just one more question, Can I use* __m128* to handle floats and if I can
> what should be *nSSEvectors*. Would it be (no_particles * 3)/4?
>
> 2011/8/11 Jim Dempsey <jim at quickthreadprogramming.com>
>
>> **
>> Try this as a first step:
>>
>> Use sincos(x, &sin_x, &cos_x);
>>
>>  for (i=0; i < no_particles; i++){
>>   sincos(mNewParticles[i].at(6), &cu_particles[cnt],
>> &cu_particles[cnt+1]); cnt += 2;
>>    sincos(mNewParticles[i].at(7), &cu_particles[cnt],
>> &cu_particles[cnt+1]); cnt += 2;
>>   sincos(mNewParticles[i].at(8), &cu_particles[cnt],
>> &cu_particles[cnt+1]); cnt += 2;
>>  }
>>
>> The second step:
>>
>> Make data layout friendly for SSE vecroization
>>
>> Note that your transport of data into cu_... is organized as a stream of
>> data of the sin/cos pairs of particles and NOT a stream of the entire
>> particle data (mNewPrticles). I recommend that for the above routine (to
>> produce sin/cos pairs for input data) you change the cu to CPU data format
>> to a stream of doubles of only the at(6), at(7), at(8) of each particle. IOW
>> be as cognizant (fair play) about streamability of data on the CPU side as
>> well as in the GPGPU side.
>>
>> Have the CPU buffer that receives the stream of doubles from the GPGPU be
>> aligned on 16 byte address
>> Rework the output from the CPU code into two streams: one for a stream of
>> sines and one for a stream of cosines.
>> These too should be 16 byte aligned.
>>
>> Then use:
>>    extern __m128d _mm_sincos_pd(__m128d *p_cos, __m128d v1);    // most
>> CPUs support this
>>   extern __m256d _mm256_sincos_pd(__m256d *p_cos, __m256d v1);  // CPU
>> with AVX (SandyBridge)
>>
>> void doSinCos(__m128d* p_X, __m128d* p_sin, __m128d* p_cos, int
>> nSSEvectors)
>> {
>>     // nSSEvectors is (no_particles * 3) / 2    // non-AVX
>>     // You handle remaining odd data value (or padd to even as 0.0)
>>      for(int i = 0; i < nSSEvectors; ++i)
>>         p_sin[i] = _mm_sincos_pd(&p_cos[i], &p_X[i]);
>> }
>>
>> The work to this point is to inprove your serial CPU side code as (close
>> to) best as possible.
>> I will let you rework the above in the event you have a CPU with AVX (it
>> handles 4-up doubles small vectors)
>>
>> The next step would be to see if you can multi-thread the CPU side. For
>> this I would suggest using OpenMP over pthreads (easier to do)
>>
>>  void doSinCos(__m128d* p_X, __m128d* p_sin, __m128d* p_cos, int
>> nSSEvectors)
>> {
>>     // nSSEvectors is (no_particles * 3) / 2    // non-AVX
>>     // You handle remaining odd data value (or padd to even as 0.0)
>>      #pragma omp parallel for
>>      for(int i = 0; i < nSSEvectors; ++i) {
>>         p_sin[i] = _mm_sincos_pd(&p_cos[i], &p_X[i]); }
>> }
>>
>> The principle optimization strategy above is: Vector-Inner, Parallel-Outer
>>
>> Jim Dempsey
>>
>>
>>  ------------------------------
>>  *From:* parsec-users-bounces at lists.cs.princeton.edu [mailto:
>> parsec-users-bounces at lists.cs.princeton.edu] *On Behalf Of *aftab hussain
>> *Sent:* Thursday, August 11, 2011 12:51 AM
>> *To:* PARSEC Users
>> *Subject:* Re: [parsec-users] Porting Bodytrack on GP-GPUs -- Problems
>> andIssues
>>
>>   Hi Matt,
>>              Thanks for the link. Sorry for the confusion about the memcpy
>> on Host. But its right though that when I do the memcpy on Host from CPU
>> data structure to GPU data structure, this has a performance bottle neck. I
>> think the following code would help understand it.
>>
>> for (i=0; i < no_particles; i++){
>>   cu_particles[cnt] = sin(mNewParticles[i].at(6)); cnt++;
>> cu_particles[cnt] = cos(mNewParticles[i].at(6)); cnt++;
>> cu_particles[cnt] = sin(mNewParticles[i].at(7)); cnt++;
>> cu_particles[cnt] = cos(mNewParticles[i].at(7)); cnt++;
>> cu_particles[cnt] = sin(mNewParticles[i].at(8)); cnt++;
>> cu_particles[cnt] = cos(mNewParticles[i].at(8)); cnt++;
>>  }
>>
>> The above mentioned code is used for copying one limb. I know that there
>> are cache issues with the above code because the 6th, 7th and 8th elements
>> are taken from successive mNewParticles vectors, not the same mNewParticles
>> vector. The same procedure is repeated for all the limbs (10 limbs in the
>> bodytrack code). But I don't have any other idea of doing this except use
>> pthreads for the above calculations and try to avoid the cache issues
>> somehow. I have not tried pthreads idea though but I am planning on doing it
>> if I don't get any success with the sin/cos precision issue on GPU.
>>
>> What I meant about the performance of copying from CPU to GPU(cudaMemcpy)
>> does not have any issue once I have the GPU data structure ready from the
>> mem copying from the above code.
>>
>> Regarding the optimization options to the compiler, I am not using
>> any --use-fast-math/--unsafe-optimizations. I have tried compiling the code
>> with both sm_20 and the default one (sm_10). Both have the precision issues
>> I discussed. Compilation option with sm_20 does support double precision on
>> Fermi (GTX 480).
>>
>> Now regarding Jim's suggestion (thanks Jim for your suggestions). I have
>> been thinking about his suggestions and the overlapped communication and
>> computations, body track is a data parallel application and the code I have
>> written on GPU is based on this fact. The Pthread and OpenMP versions of
>> Bodytrack are also data parallel.
>>
>> I did not understand the following point from Jim:
>>
>>
>> ---------------------------------------------------------------------------------------------------------------
>>  1) Move the observation point from where you measure the arc angle
>> between particles such that the angle differences tend to be large.
>> 2) When performing calculations containing both large numbers and small
>> numbers see if you can apply a bias to the large number(s) such that it
>> becomes small-ish. This avoids drop-off of precision bits in the binary
>> mantissa. Example
>>
>>       result = func(large, small)
>>
>> becomes
>>
>>      result = func(large-bias, small) + bias
>>
>> -----------------------------------------------------------------------------------------------------------------
>>
>> Actually I don't calculate the arc angles between the particles (mentioned
>> in point 1, what I understood from it). What actually happens is that
>> suppose I have the following 31 values in radians except the ones which are
>> bold.
>>
>> *0.0915466, 0.0118652, 0.957689, 134.693, 940.095, 867.287*, -0.106135,
>> -0.1814, -0.138087, -0.0927003, -0.1666, 0.196366, 2.72613, 0.0793269,
>> 0.361596, -0.141301, 1.84194, 0.311537, 1.38892, 0.853814, -0.331812,
>> -0.512472, 0.857448, 1.72155, -0.725819, 0.459579, 0.0942842, -0.0892392,
>> 0.143076, -0.0309185, 0.383283
>>
>> I have *N* such arrays of 31 elements. I will call one array of 31
>> elements as one particle.
>>
>> The first six values correspond to the first limb (Torso). I calculate *sin
>> *and *cos *of the first three values and form rotation matrices about x,
>> y and z dimentions respectively. I do the same for the other values in all
>> the *N* particles except the bold ones.
>>
>> Now I don't understand how would I use Jim's suggestion about using the
>> bias in the above scenario.
>>
>> Thanks and sorry for such a long email.
>>
>> On Wed, Aug 10, 2011 at 12:43 PM, Matt Sinclair <msinclair at wisc.edu>wrote:
>>
>>> Hi Aftab,
>>>
>>> Here's the link to the video I referenced from GTC.  Like I said, I'm
>>> not sure if it will be of direct help or not though, but he also had
>>> the same/similar problem:
>>>
>>> http://nvidia.fullviewmedia.com/gtc2010/0922-guadalupe-2082.html
>>>
>>> I believe that there is a way to tell the compiler not to use as many
>>> FMAs.  Are you using the --use-fast-math compiler option (or the
>>> --unsafe-optimizations (sp?) option)?  If so, you might try not using
>>> it and see if that helps with performance/rounding.  It might also
>>> remove the issues with the other exponentials.  In my GPU work, I've
>>> never had to use those functions, and everything was ok precision-wise
>>> by not using the Also, if you're using CUDA < 4.0 on a Fermi GPU, I'm
>>> not sure what that means when you set your "sm" option in the
>>> makefile.  I guess a better question would be -- are you using your
>>> own makefile or the common.mk structure that the SDK examples use?
>>>
>>> In regards to the memcpy thing, in your original email you stated that
>>> adding the extra copy (in order to do the sin/cos on the host) was
>>> prohibitive and hurting performance, but now you're saying it isn't
>>> hurting performance, so I'm a bit confused...
>>>
>>> Additionally, I would suggest looking into the questions/suggestions
>>> Jim posed in his email.
>>>
>>> Finally, what is your plan for releasing this when you're done?
>>>
>>> Thanks,
>>> Matt
>>>
>>> 2011/8/10 aftab hussain <aftab.hussain at seecs.edu.pk>:
>>>  > Thanks Matt,
>>> >             Actually I am using the non-native versions of sin/cos
>>> (sinf,
>>> > cosf). I also have been having issues with the results of
>>> multiplication,
>>> > division and square root calculations. Specifically the calculations of
>>> the
>>> > following format:
>>> > c = a*x +b*y +d   -- resulted into Fused Multiply Add operation on GPU
>>> to
>>> > speed it up with less accurate results.
>>> > I worked around the above mentioned problems by using the slower
>>> versions of
>>> > the division, multiplication and square root (__fdiv_rn, __fmul_rn
>>> etc). But
>>> > I don't have a work around for sin/cos calculations.
>>> > I am using CUDA 3.2 on Fermi (GTX480) GPU. In my implementation the
>>> memory
>>> > transfer from CPU to GPU and GPU to CPU is not a problem and it takes
>>> quite
>>> > less time.
>>> > If the work around of the talk at GTC 2010 would help me, I would
>>> definitely
>>> > like to have a look. Can you please send me the link to the paper/Talk?
>>> > Thanks for your answer again.
>>> >
>>> >
>>> >
>>> > On Tue, Aug 9, 2011 at 5:30 AM, Matt Sinclair <msinclair at wisc.edu>
>>> wrote:
>>> >>
>>> >> Hi Aftab,
>>> >>
>>> >> What version of sine and cosine are you using for your GPU kernels?
>>> >> Are you using the native ones?  Because those are less precise than
>>> >> the slower, non-native ones.  So, if you're using the native ones,
>>> >> even though it will hurt performance, you might try them and see if
>>> >> they solve your issue.  Also, there was a talk @ GTC 2010 that dealt
>>> >> with the imprecision of the sin/cos functions in CUDA and how they
>>> >> affected some astronomy calculations, and how they got around them.  I
>>> >> can send a link to it if you think that would be helpful.
>>> >>
>>> >> Also, what version of CUDA are you using (I'm assuming you're using
>>> >> CUDA?)?  If you're using 4.0+, then you might be able to look into
>>> >> their overlapping memory transfers, which would alleviate some of the
>>> >> performance bottlenecks you're seeing.  If you're using OpenCL, are
>>> >> you setting the memory transferring to be blocking or non-blocking?
>>> >>
>>> >> I've done quite a bit of work myself on porting the PARSEC benchmarks
>>> >> to GPUs, and I thought bodytrack was a pretty tough one to easily port
>>> >> (just because of how it's written, and the fact that there's so much
>>> >> code), so good for you to have made this much progress!  What are your
>>> >> plans on releasing it eventually?
>>> >>
>>> >> Thanks,
>>> >> Matt
>>> >>
>>> >> 2011/8/9 aftab hussain <aftab.hussain at seecs.edu.pk>:
>>> >> > Dear All,
>>> >> >              I am trying to port Bodytrack application to GP-GPUs as
>>> my
>>> >> > MS
>>> >> > thesis. I have a working code but my tracking results are screwed.
>>> >> > When I further investigated the code I found that the difference in
>>> >> > sin/cos
>>> >> > calculations on CPU and GPU are messing things up.
>>> >> > For some particles the difference (error uptill 6th-7th decimal
>>> place)
>>> >> > in
>>> >> > sin/cos calculations gets accumulated in later stages
>>> >> > (Body Geometry calculations, projection calculations, Error term
>>> >> > calculations). In the edge error term calculations I get one extra
>>> >> > sample point due to which the error weight gets changed and the
>>> final
>>> >> > normalized weight for that particular particle is different
>>> >> > upto 4th decimal place (a lot of error). And this is in the
>>> >> > Initialization
>>> >> > stage of the particle filter (weight calculation).
>>> >> > This in turn produces error for the next iterations because in the
>>> >> > particle
>>> >> > generation stage for the next iteration, a wrong particle is
>>> >> > selected which further introduces error and finally the estimate for
>>> a
>>> >> > frame
>>> >> > is very different from the CPU estimate.
>>> >> > I have the following stages implemented on GPU because these are the
>>> >> > most
>>> >> > compute intensive stages of the application.
>>> >> > 1- Body Geometry
>>> >> > 2- Projection Calculation
>>> >> > 3- Error Terms (Inside Error Term, Edge Error Term)
>>> >> > When I move the sin/cos calculation to CPU, the improvement in
>>> execution
>>> >> > time I get on the GPU stages in screwed up by the particle
>>> generation
>>> >> > stage because I have to arrange (copy from CPU data structure to GPU
>>> >> > data
>>> >> > structure plus sin/cos calculation) the data structure suitable for
>>> GPU
>>> >> > implementation that gives speed up in the execution. The overall
>>> >> > application
>>> >> > speed up is not very interesting due to this problem.
>>> >> > Can any help me in this issue because my Thesis is stuck due to this
>>> >> > problem.
>>> >> > --
>>> >> > Best Regards
>>> >> >
>>> >> > Aftab Hussain
>>> >> > Research Assistant,
>>> >> > High Performance Computing Lab,
>>> >> > NUST School of Electrical Engineering and Computer Science
>>> >> > +923225046338
>>> >> >
>>> >> > _______________________________________________
>>> >> > parsec-users mailing list
>>> >> > parsec-users at lists.cs.princeton.edu
>>> >> > https://lists.cs.princeton.edu/mailman/listinfo/parsec-users
>>> >> >
>>> >> >
>>> >> _______________________________________________
>>> >> parsec-users mailing list
>>> >> parsec-users at lists.cs.princeton.edu
>>> >> https://lists.cs.princeton.edu/mailman/listinfo/parsec-users
>>> >
>>> >
>>> >
>>> > --
>>> > Best Regards
>>> >
>>> > Aftab Hussain
>>> > Research Assistant,
>>> > High Performance Computing Lab,
>>> > NUST School of Electrical Engineering and Computer Science
>>> > +923225046338
>>> >
>>> > _______________________________________________
>>> > parsec-users mailing list
>>> > parsec-users at lists.cs.princeton.edu
>>> > https://lists.cs.princeton.edu/mailman/listinfo/parsec-users
>>> >
>>> >
>>> _______________________________________________
>>> parsec-users mailing list
>>> parsec-users at lists.cs.princeton.edu
>>> https://lists.cs.princeton.edu/mailman/listinfo/parsec-users
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Aftab Hussain
>> Research Assistant,
>> High Performance Computing Lab,
>> NUST School of Electrical Engineering and Computer Science
>> +923225046338
>>
>>
>> _______________________________________________
>> parsec-users mailing list
>> parsec-users at lists.cs.princeton.edu
>> https://lists.cs.princeton.edu/mailman/listinfo/parsec-users
>>
>>
>
>
> --
> Best Regards
>
> Aftab Hussain
> Research Assistant,
> High Performance Computing Lab,
> NUST School of Electrical Engineering and Computer Science
> +923225046338
>
>
> _______________________________________________
> parsec-users mailing list
> parsec-users at lists.cs.princeton.edu
> https://lists.cs.princeton.edu/mailman/listinfo/parsec-users
>
>


-- 
Best Regards

Aftab Hussain
Research Assistant,
High Performance Computing Lab,
NUST School of Electrical Engineering and Computer Science
+923225046338
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cs.princeton.edu/pipermail/parsec-users/attachments/20110817/ff13475a/attachment-0001.html>


More information about the parsec-users mailing list