Finally: SSE2-only builds

News from administrators.
Post Reply
User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Finally: SSE2-only builds

Post by MarathonMan » Mon Nov 10, 2014 3:35 pm

Just committed something that's bound to make a few people rejoice:
http://git.cen64.com/?p=cen64.git;a=com ... 46298c6a1f

If you want full optimizations (SSSE3, SSE4, AVX, etc.), then be sure to select "NATIVE_BUILD " in CMake. Note that this is missing entirely when configuring with MSVC at the moment.

I will also add options to build portable binaries with SSSE3, SSE4, and AVX support soon. For now, if you need a portable binary, don't build with native optimizations (or build with MSVC)... the SSE2 codepaths will be used and thus it will be compatible with all x86_64 processors.

Haven't tested this out on Windows yet either, but if it doesn't build it'll just be a few turns of the wrench away from working.

EDIT: Also, full-on RSP support still isn't present, so be surprised that most commercial ROMs still don't boot.

User avatar
iwasaperson
Posts: 49
Joined: Tue Apr 22, 2014 12:50 am

Re: Finally: SSE2-only builds

Post by iwasaperson » Mon Nov 10, 2014 5:43 pm

MarathonMan wrote:EDIT: Also, full-on RSP support still isn't present, so be surprised that most commercial ROMs still don't boot.
/s I am very surprised that commercial ROMs don't boot without a fully functioning RSP.

Anyway, great news for those with only SSE2 support, although I don't think CEN64 will come close to full speed on those CPUs.
Last edited by iwasaperson on Tue Nov 11, 2014 2:25 pm, edited 1 time in total.

ShadowFX
Posts: 86
Joined: Sat Oct 05, 2013 2:08 am
Location: The Netherlands

Re: Finally: SSE2-only builds

Post by ShadowFX » Tue Nov 11, 2014 4:28 am

Great news indeed!
I think I'll wait until the RSP is emulated enough so it can boot quite a lot of commercial games and start testing it. Of course, portions of the RDP would be a bonus if added.

MarathonMan, is it correct that you've been mostly working on the RCP and optimizations?
If so, that leaves the RSP, RDRAM, MemPaks/Save support and (real)controller support, correct?
"Change is inevitable; progress is optional"

OS: Windows 10 Pro x64
Specs: Intel Core i7-7700K @ 4.2GHz, 16GB DDR4-RAM, NVIDIA GeForce GTX 1080 Ti
Main build: AVX (official)

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: Finally: SSE2-only builds

Post by MarathonMan » Tue Nov 11, 2014 10:33 am

ShadowFX wrote:MarathonMan, is it correct that you've been mostly working on the RCP and optimizations?
Mostly a combination of VR4300 and RSP, yes. Those two and the RDP will always be the 3 biggest time sinks as far as emulation is concerned.

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: Finally: SSE2-only builds

Post by MarathonMan » Wed Nov 12, 2014 12:16 am

Updated the CMake generator file to allow users to select SSE2/SSE3/SSSE3/SSE4.1/AVX (and Native with GCC/Clang).

As long as you don't select Native, you get a portable binary. Again, you must use SSE2 to be compatible with all x86_64 CPUs.

User avatar
Snowstorm64
Posts: 303
Joined: Sun Oct 20, 2013 8:22 pm

Re: Finally: SSE2-only builds

Post by Snowstorm64 » Wed Nov 12, 2014 8:33 am

MarathonMan wrote:Updated the CMake generator file to allow users to select SSE2/SSE3/SSSE3/SSE4.1/AVX (and Native with GCC/Clang).

As long as you don't select Native, you get a portable binary. Again, you must use SSE2 to be compatible with all x86_64 CPUs.
It seems that if I choose Native, CEN64 will be much faster than others options. Actually, I can get to 60 VI/s and more with Namco Museum! Good job MarathonMan!
Image
(Can you fix that character in the window title?)
OS: Debian GNU/Linux Jessie (8.0)
CPU: Intel i7 4770K @ 3.5 GHz
Build: AVX (compiled from git)

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: Finally: SSE2-only builds

Post by MarathonMan » Wed Nov 12, 2014 8:39 am

Snowstorm64 wrote:It seems that if I choose Native, CEN64 will be much faster than others options. Actually, I can get to 60 VI/s and more with Namco Museum!
Oops... not sure how that got there. Try now?

User avatar
Snowstorm64
Posts: 303
Joined: Sun Oct 20, 2013 8:22 pm

Re: Finally: SSE2-only builds

Post by Snowstorm64 » Wed Nov 12, 2014 8:42 am

Now it's much more beautiful! :)
OS: Debian GNU/Linux Jessie (8.0)
CPU: Intel i7 4770K @ 3.5 GHz
Build: AVX (compiled from git)

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: Finally: SSE2-only builds

Post by MarathonMan » Wed Nov 12, 2014 11:40 pm

Bump.

Another major performance boost. 7.5-10% across the systems I tried.

User avatar
Nacho
Posts: 66
Joined: Thu Nov 07, 2013 9:25 am

Re: Finally: SSE2-only builds

Post by Nacho » Thu Nov 13, 2014 10:30 pm

Performance boost? Does that apply to all CEN64 builds, or just the SSE2 one?
Testing CEN64 on: Intel Core i5 520M 2.4 GHz. SSE2 SSE3 SSE4.1 SSE4.2 SSSE3, but no AVX. Ubuntu Linux

User avatar
Alegend45
Posts: 11
Joined: Mon Oct 07, 2013 11:24 am

Re: Finally: SSE2-only builds

Post by Alegend45 » Thu Nov 13, 2014 10:33 pm

Nacho wrote:Performance boost? Does that apply to all CEN64 builds, or just the SSE2 one?
I think he's referring to this commit.

User avatar
Snowstorm64
Posts: 303
Joined: Sun Oct 20, 2013 8:22 pm

Re: Finally: SSE2-only builds

Post by Snowstorm64 » Fri Nov 14, 2014 7:26 am

No, this commit. Since it affects the VR4300 pipeline, it has an effect on all CEN64 builds, so also the SSE2 one.
OS: Debian GNU/Linux Jessie (8.0)
CPU: Intel i7 4770K @ 3.5 GHz
Build: AVX (compiled from git)

User avatar
Nintendo Maniac 64
Posts: 185
Joined: Fri Oct 04, 2013 11:37 pm

Re: Finally: SSE2-only builds

Post by Nintendo Maniac 64 » Sat Nov 15, 2014 2:18 am

Wouldn't it make more sense to make SSE3-only builds? (not to be confused with SSSE3)

The reason I say this is because all dual-core x86_64 CPUs support SSE3; in fact the only x86_64 CPUs that don't support SSE3 are the first 3 steppings of single-core Athlon 64 CPUs (C0, CG, D0).
CEN64 Forum's resident straight-male kuutsundere
(just "tsundere" makes people think of "Shana clones" *shivers*)

CPU+iGPU: Pentium G3258 @ 4.6GHz/1.281v
dGPU: Radeon HD5870 1GB
RAM: Vengeance 1600 4x4GB
OS: Windows 7

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: Finally: SSE2-only builds

Post by MarathonMan » Sat Nov 15, 2014 4:32 am

Nintendo Maniac 64 wrote:Wouldn't it make more sense to make SSE3-only builds? (not to be confused with SSSE3)
Already possible. Just build with CEN64_ARCH_SUPPORT=SSE3.
Nintendo Maniac 64 wrote:The reason I say this is because all dual-core x86_64 CPUs support SSE3;
False. As you said, only some Athlon 64 support SSE3. Besides, SSSE3 is the actual point at which SSE becomes "truly useful" with regards to shuffling.

User avatar
Nintendo Maniac 64
Posts: 185
Joined: Fri Oct 04, 2013 11:37 pm

Re: Finally: SSE2-only builds

Post by Nintendo Maniac 64 » Sat Nov 15, 2014 4:42 am

MarathonMan wrote:False. As you said, only some Athlon 64 support SSE3.
But all dual-core Athlon 64's support SSE3. Note my quote again:

Nintendo Maniac 64 wrote:The reason I say this is because all dual-core x86_64 CPUs support SSE3;
----------------------------------------------------------------
MarathonMan wrote:Besides, SSSE3 is the actual point at which SSE becomes "truly useful" with regards to shuffling.
Ah, this seems similar to ffvp9 where SSE3 isn't really of any use - it's SSSE3 that's particularly of benefit.
CEN64 Forum's resident straight-male kuutsundere
(just "tsundere" makes people think of "Shana clones" *shivers*)

CPU+iGPU: Pentium G3258 @ 4.6GHz/1.281v
dGPU: Radeon HD5870 1GB
RAM: Vengeance 1600 4x4GB
OS: Windows 7

AIO
Posts: 51
Joined: Wed Nov 05, 2014 4:56 pm

Re: Finally: SSE2-only builds

Post by AIO » Sat Nov 15, 2014 6:01 pm

Wow I must have been out of the loop! For a while, I've been looking at the RSP source on github :D .

What needs to be done to have full SSE2 support for the RSP?

Great job so far!

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: Finally: SSE2-only builds

Post by MarathonMan » Sat Nov 15, 2014 7:40 pm

AIO wrote:What needs to be done to have full SSE2 support for the RSP?
Right now, SSE2 is in the same state as everything else; some of the (vector) instructions are just incomplete. Once those are done, the RSP should hum along nicely.

Bump - another big performance optimization.

Added another CMake option to allow users to select to build a binary with simulation/debugging/developing related features, or just a vanilla version for high performance. Trying to selectively "pick" one mode dynamically and keep everything in the same binary was reducing performance enough that it was irritating, so I effectively split the two. At a later point in time, I'll be able to cleanup the CMake script and make less of a mess of this.

With this, all public domain ROMs and Namco Museum 64 run at a silky smooth 60VI/s on a i7 4770.

User avatar
Snowstorm64
Posts: 303
Joined: Sun Oct 20, 2013 8:22 pm

Re: Finally: SSE2-only builds

Post by Snowstorm64 » Sat Nov 15, 2014 8:06 pm

Awesome! I gained a lot of VI/s with a lot of ROMs! Excellent job, MarathonMan! :D

(I have just found a bug in Namco Museum 64, check on the bug tracker for more info ;) )
OS: Debian GNU/Linux Jessie (8.0)
CPU: Intel i7 4770K @ 3.5 GHz
Build: AVX (compiled from git)

AIO
Posts: 51
Joined: Wed Nov 05, 2014 4:56 pm

Re: Finally: SSE2-only builds

Post by AIO » Sun Nov 30, 2014 7:40 pm

Where can I find the latest RSP code for the SSE2 version? I'd like to take a closer look :D .

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: Finally: SSE2-only builds

Post by MarathonMan » Mon Dec 01, 2014 11:40 am

AIO wrote:Where can I find the latest RSP code for the SSE2 version? I'd like to take a closer look :D .
It's scattered about in arch/x86_64/rsp.

rsp.h/c contain some vector load/store shuffling/shifting/muxing algorithms (one taking advantage of pshufb/SSSE3, the other being SSE2).
The rest of the intrinsics are used to emulate RSP instructions; one header file per instruction. They're all SSE2, except where things get conditionally macro'd in small blocks to take advantage of SSSE3/SSE4.1.

AIO
Posts: 51
Joined: Wed Nov 05, 2014 4:56 pm

Re: Finally: SSE2-only builds

Post by AIO » Wed Dec 03, 2014 1:40 am

MarathonMan wrote:
AIO wrote:Where can I find the latest RSP code for the SSE2 version? I'd like to take a closer look :D .
It's scattered about in arch/x86_64/rsp.

rsp.h/c contain some vector load/store shuffling/shifting/muxing algorithms (one taking advantage of pshufb/SSSE3, the other being SSE2).
The rest of the intrinsics are used to emulate RSP instructions; one header file per instruction. They're all SSE2, except where things get conditionally macro'd in small blocks to take advantage of SSSE3/SSE4.1.
Alright, I took a look at it. The multiplies look good :D . I couldn't find instructions like VMULF or VABS.

Anyway, I recently figured out a good SSE2 algorithm for VSUB and VABS. Let me know if you're interested.

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: Finally: SSE2-only builds

Post by MarathonMan » Wed Dec 03, 2014 10:30 am

Yes, not all of the functions are done...

RE: VABS and VSUB, as long as you're willing to license under 3-clause BSD... sure. ;)

Here are my former attempts:
https://github.com/tj90241/cen64-rsp/bl ... CP2.c#L251
https://github.com/tj90241/cen64-rsp/bl ... P2.c#L1778

AIO
Posts: 51
Joined: Wed Nov 05, 2014 4:56 pm

Re: Finally: SSE2-only builds

Post by AIO » Wed Dec 03, 2014 5:10 pm

MarathonMan wrote:Yes, not all of the functions are done...
Lol just making sure I didn't miss anything :D .
MarathonMan wrote:RE: VABS and VSUB, as long as you're willing to license under 3-clause BSD... sure. ;)

Here are my former attempts:
https://github.com/tj90241/cen64-rsp/bl ... CP2.c#L251
https://github.com/tj90241/cen64-rsp/bl ... P2.c#L1778
I'm fine with that, as long as I get credit :D .

For VABS and VSUB, I made sure the algorithms were good. I ran a giant loop to compare the output with the official algorithms. However, since I goof a lot, I may have made a typo while converting my algorithm to your format xD. Anyway, I don't know how to use git, so I'll just post the code here if you don't mind.

Code: Select all

__m128i
RSPVSUB(struct RSPCP2 *cp2, int16_t *unused(vd),
  __m128i vsReg, __m128i unused(vtReg), __m128i vtShuf, __m128i zero) {
  int16_t *accLow = cp2->accumulatorLow.slices;

#ifdef USE_SSE
  __m128i vaccLow, vdReg, unsatDiff, satDiff, overflow, carryOut;

  carryOut = _mm_load_si128((__m128i*) (cp2->vcolo.slices));
  /* VACC uses unsaturated arithmetic. */
  unsatDiff = _mm_sub_epi16(vtShuf, carryOut);
  satDiff = _mm_subs_epi16(vtShuf, carryOut);

  vaccLow = _mm_sub_epi16(vsReg, unsatDiff);
  vdReg = _mm_subs_epi16(vsReg, satDiff);

  //checks if vtShuf = INT_MAX and carryOut != 0
  overflow = _mm_cmpgt_epi16(satDiff, unsatDiff);
  //saturated subtraction by 1 if vtShuf = INT_MAX and carryOut != 0
  vdReg = _mm_adds_epi16(vdReg, overflow);

  _mm_store_si128((__m128i*) accLow, vaccLow);
  _mm_store_si128((__m128i*) (cp2->vcolo.slices), zero);
  _mm_store_si128((__m128i*) (cp2->vcohi.slices), zero);
  return vdReg;
#else
#warning "Unimplemented function: RSPVSUB (No SSE)."
#endif
}

__m128i
RSPVABS(struct RSPCP2 *cp2, int16_t *unused(vd),
  __m128i vsReg, __m128i unused(vtReg), __m128i vtShuf, __m128i zero) {
  int16_t *accLow = cp2->accumulatorLow.slices;

#ifdef USE_SSE
  __m128i signLessThan, vdReg, equalZero;

  signLessThan = _mm_srai_epi16(vsReg, 15);
  equalZero = _mm_cmpeq_epi16(vsReg, zero);

  vdReg = _mm_xor_si128(vtShuf, signLessThan);
  vdReg = _mm_subs_epi16(vdReg, signLessThan);
  vdReg = _mm_andnot_si128(equalZero, vdReg);

  _mm_store_si128((__m128i*) accLow, vdReg);
  return vdReg;
#else
#warning "Unimplemented function: RSPVABS (No SSE)."
#endif
}
Thanks to your brilliant work with instructions like VSUBC and VADDC, I was able to do a good job with some of the multiply instructions, a while back. I don't like to submit code that could possibly be inaccurate though, so I'd have to do some more testing. I'd love to contribute back, since I have benefited a great deal by reading your RSP source.

Also adopting your method of using -1 instead of 1, for the flags really helped with all flag instructions! I will have to try compiling your source sometime, to see what else could be more optimized.

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: Finally: SSE2-only builds

Post by MarathonMan » Wed Dec 03, 2014 9:34 pm

Ah wonderful! I am a bit tied up at the moment, but I will try to get those in soon.

I must say I am a fan of that VABS implementation due to the fact that it doesn't rely on _mm_sign_epi16. :D

AIO
Posts: 51
Joined: Wed Nov 05, 2014 4:56 pm

Re: Finally: SSE2-only builds

Post by AIO » Thu Dec 04, 2014 12:33 am

I'm glad that I'm able to help :D .

So i reviewed more of your code. For VGE I did

Code: Select all

/* equal = (~vco | ~vne) && (vs == vt) */
  temp = _mm_and_si128(vne, vco);
  equal = _mm_cmpeq_epi16(vsReg, vtShuf);
  equal = _mm_andnot_si128(temp, equal);

  /* ge = vs > vt | equal */
  greaterEqual = _mm_cmpgt_epi16(vsReg, vtShuf);
  greaterEqual = _mm_or_si128(greaterEqual, equal);

  /* vd = ge ? vs : vt; */
#ifdef SSSE3_ONLY
  vdReg = _mm_max_epi16(vsReg, vtShuf);
#else
instead of

Code: Select all

/* equal = (~vco | ~vne) && (vs == vt) */
  temp = _mm_and_si128(vne, vco);
  temp = _mm_cmpeq_epi16(temp, zero);
  equal = _mm_cmpeq_epi16(vsReg, vtShuf);
  equal = _mm_and_si128(temp, equal);

  /* ge = vs > vt | equal */
  greaterEqual = _mm_cmpgt_epi16(vsReg, vtShuf);
  greaterEqual = _mm_or_si128(greaterEqual, equal);

  /* vd = ge ? vs : vt; */
#ifdef SSSE3_ONLY
  vsReg = _mm_and_si128(greaterEqual, vsReg);
  vtShuf = _mm_andnot_si128(greaterEqual, vtShuf);
  vdReg = _mm_or_si128(vsReg, vtShuf);
#else
Are there any instructions you need help with in particular, for a SSE2 implementation?

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: Finally: SSE2-only builds

Post by MarathonMan » Thu Dec 04, 2014 9:37 am

Some of the vector clips, square roots, and others were actually incorrectly vectorized, I think. They're close, but there's a few edge cases that I think I missed. I've been trying to go through very, very carefully and find those at the moment. If there are errors in some of these instructions and bad triangle commands are generated, it can cause crashes elsewhere I think, so... :)

Other than that, optimizations are always a good thing!

AIO
Posts: 51
Joined: Wed Nov 05, 2014 4:56 pm

Re: Finally: SSE2-only builds

Post by AIO » Thu Dec 04, 2014 4:21 pm

MarathonMan wrote:Some of the vector clips, square roots, and others were actually incorrectly vectorized, I think. They're close, but there's a few edge cases that I think I missed. I've been trying to go through very, very carefully and find those at the moment. If there are errors in some of these instructions and bad triangle commands are generated, it can cause crashes elsewhere I think, so... :)

Other than that, optimizations are always a good thing!
Alright, I will definitely check out the vector clips. Then after that, I can help with some of the multiplies like vmulf, vmacf, vmulu, and vmacu, if you want.

I'll get started on double checking some code as soon as I figure out the most reliable source code to use as a reference, for the vector clips. I don't think I could help much with the div instructions though. I'm a bit confused atm. When looking at the docs I have, I saw that it doesn't always seem to match how everyone else implemented certain instructions. Like for VABS, it writes to acc_low before checking if > 0x7FFF and < -32768.

Anyway, if I figure out that my code is good to use, where would you like me to post it?

Post Reply

Who is online

Users browsing this forum: No registered users and 0 guests