Optimizing RSP performance

Discuss topics related to development here.
Post Reply
User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Optimizing RSP performance

Post by MarathonMan » Mon Dec 29, 2014 9:07 am

One of the things I did with the new RSP core is statically cache the accumulator and RSP flags into 8 xmm registers.

Unfortunately, the compiler doesn't seem to do a good job at optimizing this, so I've started writing functions in assembly by hand. I've noticed a rise in IPC and VI/s with each function that I implement in this manner. I'm expecting some big gains once I get to the multiplies.

With "just" intrinsics, GCC generates this (AVX) code for VCH:

Code: Select all

000000000041ccf0 <RSP_VCH>:
  41ccf0:       c5 f1 ef e0             vpxor  %xmm0,%xmm1,%xmm4
  41ccf4:       c5 e9 65 e4             vpcmpgtw %xmm4,%xmm2,%xmm4
  41ccf8:       c5 79 ef c4             vpxor  %xmm4,%xmm0,%xmm8
  41ccfc:       c5 b9 f9 dc             vpsubw %xmm4,%xmm8,%xmm3
  41cd00:       c5 e9 65 c0             vpcmpgtw %xmm0,%xmm2,%xmm0
  41cd04:       c5 f8 29 5c 24 e8       vmovaps %xmm3,-0x18(%rsp)
  41cd0a:       c5 f1 f9 db             vpsubw %xmm3,%xmm1,%xmm3
  41cd0e:       c5 61 75 ca             vpcmpeqw %xmm2,%xmm3,%xmm9
  41cd12:       c5 61 65 d2             vpcmpgtw %xmm2,%xmm3,%xmm10
  41cd16:       c5 e1 75 dc             vpcmpeqw %xmm4,%xmm3,%xmm3
  41cd1a:       c5 78 29 4c 24 d8       vmovaps %xmm9,-0x28(%rsp)
  41cd20:       c5 29 eb 4c 24 d8       vpor   -0x28(%rsp),%xmm10,%xmm9
  41cd26:       c4 63 31 4c c8 40       vpblendvb %xmm4,%xmm0,%xmm9,%xmm9
  41cd2c:       c4 41 69 75 d2          vpcmpeqw %xmm10,%xmm2,%xmm10
  41cd31:       c5 e1 db dc             vpand  %xmm4,%xmm3,%xmm3
  41cd35:       c4 43 79 4c d2 40       vpblendvb %xmm4,%xmm10,%xmm0,%xmm10
  41cd3b:       c5 e1 eb 44 24 d8       vpor   -0x28(%rsp),%xmm3,%xmm0
  41cd41:       c5 f9 75 d2             vpcmpeqw %xmm2,%xmm0,%xmm2
  41cd45:       c4 c3 31 4c c2 40       vpblendvb %xmm4,%xmm10,%xmm9,%xmm0
  41cd4b:       c4 e3 71 4c 44 24 e8    vpblendvb %xmm0,-0x18(%rsp),%xmm1,%xmm0
  41cd52:       00 
  41cd53:       66 45 0f 6f e1          movdqa %xmm9,%xmm12
  41cd58:       66 45 0f 6f da          movdqa %xmm10,%xmm11
  41cd5d:       66 44 0f 6f f2          movdqa %xmm2,%xmm14
  41cd62:       66 44 0f 6f ec          movdqa %xmm4,%xmm13
  41cd67:       66 44 0f 6f fb          movdqa %xmm3,%xmm15
  41cd6c:       66 0f 6f e8             movdqa %xmm0,%xmm5
  41cd70:       c3                      retq
The hand-crafted AVX code:

Code: Select all

0000000000421034 <RSP_VCH>:
  421034:       c5 71 ef e8             vpxor  %xmm0,%xmm1,%xmm13
  421038:       66 41 0f 71 e5 0f       psraw  $0xf,%xmm13
  42103e:       c5 91 ef d8             vpxor  %xmm0,%xmm13,%xmm3
  421042:       66 41 0f f9 dd          psubw  %xmm13,%xmm3
  421047:       c5 f1 f9 e3             vpsubw %xmm3,%xmm1,%xmm4
  42104b:       66 0f 71 e0 0f          psraw  $0xf,%xmm0
  421050:       c5 e9 75 ec             vpcmpeqw %xmm4,%xmm2,%xmm5
  421054:       c4 41 59 75 fd          vpcmpeqw %xmm13,%xmm4,%xmm15
  421059:       66 45 0f db fd          pand   %xmm13,%xmm15
  42105e:       c5 01 eb f5             vpor   %xmm5,%xmm15,%xmm14
  421062:       66 44 0f 75 f2          pcmpeqw %xmm2,%xmm14
  421067:       66 0f 65 e2             pcmpgtw %xmm2,%xmm4
  42106b:       66 0f eb ec             por    %xmm4,%xmm5
  42106f:       c4 63 51 4c e0 d0       vpblendvb %xmm13,%xmm0,%xmm5,%xmm12
  421075:       66 0f 75 e2             pcmpeqw %xmm2,%xmm4
  421079:       c4 63 79 4c dc d0       vpblendvb %xmm13,%xmm4,%xmm0,%xmm11
  42107f:       c4 c3 19 4c d3 d0       vpblendvb %xmm13,%xmm11,%xmm12,%xmm2
  421085:       c4 e3 71 4c c3 20       vpblendvb %xmm2,%xmm3,%xmm1,%xmm0
  42108b:       66 0f 6f e8             movdqa %xmm0,%xmm5
  42108f:       c3                      retq
Compiler-generated SSE4.1:

Code: Select all

000000000041c110 <RSP_VCH>:
  41c110:       66 0f 6f d9             movdqa %xmm1,%xmm3
  41c114:       66 44 0f 6f c2          movdqa %xmm2,%xmm8
  41c119:       66 44 0f 6f d2          movdqa %xmm2,%xmm10
  41c11e:       66 0f ef d8             pxor   %xmm0,%xmm3
  41c122:       66 44 0f 65 d0          pcmpgtw %xmm0,%xmm10
  41c127:       66 44 0f 65 c3          pcmpgtw %xmm3,%xmm8
  41c12c:       66 0f 6f d8             movdqa %xmm0,%xmm3
  41c130:       66 41 0f ef d8          pxor   %xmm8,%xmm3
  41c135:       66 0f 6f e3             movdqa %xmm3,%xmm4
  41c139:       66 0f 6f d9             movdqa %xmm1,%xmm3
  41c13d:       66 41 0f f9 e0          psubw  %xmm8,%xmm4
  41c142:       0f 29 64 24 d8          movaps %xmm4,-0x28(%rsp)
  41c147:       66 0f f9 dc             psubw  %xmm4,%xmm3
  41c14b:       66 44 0f 6f cb          movdqa %xmm3,%xmm9
  41c150:       66 41 0f 6f e2          movdqa %xmm10,%xmm4
  41c155:       66 44 0f 6f d3          movdqa %xmm3,%xmm10
  41c15a:       66 44 0f 75 ca          pcmpeqw %xmm2,%xmm9
  41c15f:       66 44 0f 65 d2          pcmpgtw %xmm2,%xmm10
  41c164:       66 41 0f 6f c1          movdqa %xmm9,%xmm0
  41c169:       44 0f 29 4c 24 e8       movaps %xmm9,-0x18(%rsp)
  41c16f:       66 41 0f eb c2          por    %xmm10,%xmm0
  41c174:       66 41 0f 75 d8          pcmpeqw %xmm8,%xmm3
  41c179:       66 44 0f 75 d2          pcmpeqw %xmm2,%xmm10
  41c17e:       66 41 0f db d8          pand   %xmm8,%xmm3
  41c183:       66 44 0f 6f c8          movdqa %xmm0,%xmm9
  41c188:       66 41 0f 6f c0          movdqa %xmm8,%xmm0
  41c18d:       66 44 0f 38 10 cc       pblendvb %xmm0,%xmm4,%xmm9
  41c193:       66 41 0f 38 10 e2       pblendvb %xmm0,%xmm10,%xmm4
  41c199:       66 44 0f 6f 54 24 e8    movdqa -0x18(%rsp),%xmm10
  41c1a0:       66 44 0f eb d3          por    %xmm3,%xmm10
  41c1a5:       66 41 0f 75 d2          pcmpeqw %xmm10,%xmm2
  41c1aa:       66 45 0f 6f d1          movdqa %xmm9,%xmm10
  41c1af:       66 44 0f 38 10 d4       pblendvb %xmm0,%xmm4,%xmm10
  41c1b5:       66 41 0f 6f c2          movdqa %xmm10,%xmm0
  41c1ba:       66 0f 38 10 4c 24 d8    pblendvb %xmm0,-0x28(%rsp),%xmm1
  41c1c1:       66 0f 6f c1             movdqa %xmm1,%xmm0
  41c1c5:       66 45 0f 6f e1          movdqa %xmm9,%xmm12
  41c1ca:       66 44 0f 6f dc          movdqa %xmm4,%xmm11
  41c1cf:       66 44 0f 6f f2          movdqa %xmm2,%xmm14
  41c1d4:       66 45 0f 6f e8          movdqa %xmm8,%xmm13
  41c1d9:       66 44 0f 6f fb          movdqa %xmm3,%xmm15
  41c1de:       66 0f 6f e9             movdqa %xmm1,%xmm5
  41c1e2:       c3                      retq

Hand-crafted SSE4.1:

Code: Select all

00000000004206f4 <RSP_VCH>:
  4206f4:       66 0f 6f e9             movdqa %xmm1,%xmm5
  4206f8:       66 44 0f 6f d8          movdqa %xmm0,%xmm11
  4206fd:       66 0f 6f d8             movdqa %xmm0,%xmm3
  420701:       66 0f ef c1             pxor   %xmm1,%xmm0
  420705:       66 0f 71 e0 0f          psraw  $0xf,%xmm0
  42070a:       66 0f ef d8             pxor   %xmm0,%xmm3
  42070e:       66 0f f9 d8             psubw  %xmm0,%xmm3
  420712:       66 0f f9 cb             psubw  %xmm3,%xmm1
  420716:       66 45 0f ef e4          pxor   %xmm12,%xmm12
  42071b:       66 41 0f 71 e3 0f       psraw  $0xf,%xmm11
  420721:       66 44 0f 75 e1          pcmpeqw %xmm1,%xmm12
  420726:       66 44 0f 6f f8          movdqa %xmm0,%xmm15
  42072b:       66 44 0f 75 f9          pcmpeqw %xmm1,%xmm15
  420730:       66 44 0f db f9          pand   %xmm1,%xmm15
  420735:       66 45 0f 6f f7          movdqa %xmm15,%xmm14
  42073a:       66 45 0f eb f4          por    %xmm12,%xmm14
  42073f:       66 44 0f 75 f2          pcmpeqw %xmm2,%xmm14
  420744:       66 0f 65 ca             pcmpgtw %xmm2,%xmm1
  420748:       66 44 0f eb e1          por    %xmm1,%xmm12
  42074d:       66 45 0f 38 10 e3       pblendvb %xmm0,%xmm11,%xmm12
  420753:       66 0f 75 ca             pcmpeqw %xmm2,%xmm1
  420757:       66 41 0f 6f d4          movdqa %xmm12,%xmm2
  42075c:       66 44 0f 38 10 d9       pblendvb %xmm0,%xmm1,%xmm11
  420762:       66 44 0f 6f e8          movdqa %xmm0,%xmm13
  420767:       66 41 0f 38 10 d3       pblendvb %xmm0,%xmm11,%xmm2
  42076d:       66 0f 6f c2             movdqa %xmm2,%xmm0
  420771:       66 0f 38 10 eb          pblendvb %xmm0,%xmm3,%xmm5
  420776:       66 0f 6f c5             movdqa %xmm5,%xmm0
  42077a:       c3                      retq

User avatar
OldGnashburg
Posts: 91
Joined: Tue Nov 19, 2013 3:00 pm
Location: Sherwood Park, Alberta, Canada: A place with free universal healthcare, and lots and lots of oil.

Re: Optimizing RSP performance

Post by OldGnashburg » Mon Dec 29, 2014 5:09 pm

Very nice! What kind of performance boost are we talking about here? Can this be done elsewhere?
Gnash, Gnash, Gnash...

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: Optimizing RSP performance

Post by MarathonMan » Tue Dec 30, 2014 12:40 pm

Just using a prelude to get around it now, not sure why I didn't think of that before:

Code: Select all

movdqa (%r8), %xmm0
movdqa (%r9), %xmm1
pxor %xmm2, %xmm2
Wrapped it around a __VECTORCALL__ macro so it can be omitted once __vectorcall support makes it into mainstream GCC/Clang.

These changes yield big performance improvements. Many VI/s. No, it can't be done elsewhere.

User avatar
Nacho
Posts: 66
Joined: Thu Nov 07, 2013 9:25 am

Re: Optimizing RSP performance

Post by Nacho » Tue Dec 30, 2014 3:00 pm

ohh, that's why you were so angry at the other thread. Now I get it :mrgreen:
Testing CEN64 on: Intel Core i5 520M 2.4 GHz. SSE2 SSE3 SSE4.1 SSE4.2 SSSE3, but no AVX. Ubuntu Linux

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest