A lot of asm? What about ZSNES situation? No const?

Discuss topics related to development here.
Post Reply
User avatar
Narann
Posts: 154
Joined: Mon Jun 16, 2014 4:25 pm
Contact:

A lot of asm? What about ZSNES situation? No const?

Post by Narann » Tue Jul 08, 2014 1:59 pm

Hello guys!

I've digged in the code of krom and see something that make me curious: The whole code is in asm.

I just want to know: ZSNES is now umaintenable because it has everything wrote in asm. Don't you think it could be the same for Cen64 in the futur? Don't you think their could be the main (easy to read) code in C and the SSE one at the same place but in another ifdef part?

I just ask. I wouldn't see Cen64 unmaintained in the futur because of this. :(

Another point: Is there any reason why you don't use const statement? In many places it would make the code easier to read and apply the const correctness principle. Example:

Code: Select all

 8 static inline uint16_t fpu_add_32(
 9   uint32_t *fs, uint32_t *ft, uint32_t *fd) {
10   uint32_t res;
11   uint16_t sw;
12
13   __asm__ volatile(
14     "fclex\n\t"
15     "flds %2\n\t"
16     "flds %3\n\t"
17     "faddp\n\t"
18     "fstps %1\n\t"
19     "fstsw %%ax\n\t"
20
21     : "=a" (sw),
22       "=m" (res)
23     : "m" (*fs),
24       "m" (*ft)
25     : "st"
26   );
27
28   *fd = res;
29   return sw;
30 }
It could be:

Code: Select all

 8 static inline uint16_t fpu_add_32(
 9   const uint32_t *fs, const uint32_t *ft, uint32_t *fd) {
10   uint32_t res;
11   uint16_t sw;
12
13   __asm__ volatile(
14     "fclex\n\t"
15     "flds %2\n\t"
16     "flds %3\n\t"
17     "faddp\n\t"
18     "fstps %1\n\t"
19     "fstsw %%ax\n\t"
20
21     : "=a" (sw),
22       "=m" (res)
23     : "m" (*fs),
24       "m" (*ft)
25     : "st"
26   );
27
28   *fd = res;
29   return sw;
30 }
This way you are sure the compiler will never accept a modification of fs and ft.

Am I right on this or completely wrong? Is there any good reason to don't use const?

Sorry for my silly questions, I try to understand this (well written while quite complex) code.

Thanks in advance!

User avatar
Sintendo
Posts: 25
Joined: Thu Oct 31, 2013 9:11 am

Re: A lot of asm? What about ZSNES situation? No const?

Post by Sintendo » Tue Jul 08, 2014 3:33 pm

Have you even seen the ZSNES source code? It has nearly everything in x86 assembly. Thousands upon thousands of lines, all of which would have to be rewritten manually in order to make ZSNES run on other architectures. Compare that to CEN64: almost entirely in C, except for a few files containing very short (each file seems to be barely 10 lines) and straightforward assembly code, providing a simple function for that can be called from C. Porting those files to something else than AMD64 would take someone with experience a few hours. Porting ZSNES takes months, a task no sane individual is willing to do, because it has since been surpassed by at least two emulators that are portable and more accurate.

As for why this particular code was written in assembly, this is code intended for FPU emulation making use of x87 instructions. I believe it was done that way in the 'old' CEN64 version as well. I asked about it last year and I'm guessing the same reasons still apply in this case. I agree it wouldn't hurt to have some straight-up C code as fallback, though. IIRC, it wasn't as neatly abstracted into separate files and functions in the old CEN64, so if anything, it seems like this would be even easier to do now.

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: A lot of asm? What about ZSNES situation? No const?

Post by MarathonMan » Tue Jul 08, 2014 3:35 pm

I actually thought about the const correctness when I was writing that code and was too lazy to apply it at the time (already had a handful of files done), heh. This is something that definitely can/could/should be done about that. Yes, it absolutely makes it more readable, safe, and can even result in more optimized code.

For the specific case you presented, the same could be said for volatile -- the constraints are setup properly so there's no reason that the asm block needs to be marked volatile. The compiler can rearrange the block however it wants, so long as it doesn't violate the constraints supplied.

Those issues aside, there is indeed a reason for the inline assembly for the FPU. The reasoning is because of horrible compiler support (and lack of specification required in C99) for complete, precise floating point emulation:
  • GCC doesn't support the STDC FENV_ACCESS #pragma that CEN64 needs for correctness. This bug tracker link consists of six years of bickering over the issue: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=34678
  • Microsoft didn't include fenv until C++11 became mainstream as they openly don't support C99. Even when they did include it (in VS2013), they recently admitted in a VS2014 CTP1 blog post [***] that there are a handful of bugs in it that won't be fixed until CTP1 (if that!?). Regardless, using fenv would result in a VS2013/VS2014 dependency for Windows users, which is not desirable.
  • To the best of my knowledge, C99 says nothing about supporting the IEEE 754 standard that the VR4300 is based around (just floating-point arithmetic in general). For performance, hosts which provide a IEEE 754-based FPU can use an accelerated FPU backend, but those that do not will have to have a pure software implementation of IEEE 754. Of course, most hosts support IEEE 754, but not all!
Nearly all of the emulator code I've seen in the wild has terrible FPU support outside of the functionality that is required for emulation 99.9% of the time. Once the x87 backend in CEN64 is complete, it'll be considerably more accurate than most other emulators in nature for all the really obtuse corner cases.

And, TBH, I still have yet to determine if the x87 FPU is even worthy of 100% accuracy due to it's 80-bit registers... (though I think it does). I might have to use the SSE FPU, and who knows what demons lie within that contraption?!

EDIT:
***
In Visual Studio 2013, these functions were incorrectly implemented in the CRT for x86. There were two bugs: [1] a call to fegetenv would cause any pending, unmasked x87 floating point exceptions to be raised, and [2] the fegetenv function would mask all x87 floating point exceptions before returning and would thus return incorrect state.

User avatar
Narann
Posts: 154
Joined: Mon Jun 16, 2014 4:25 pm
Contact:

Re: A lot of asm? What about ZSNES situation? No const?

Post by Narann » Tue Jul 08, 2014 5:06 pm

You are right, only specific "queue" parts of Cen64 is write in asm, the whole "logical code" remain in C. I wasn't aware the whole ZSNES emulator was wrotten in asm.

As MarathonMan said (thanks for the in deep explainations), the problem seems to be on the compiler/langage side that doesn't deal/specify how the floatting point values should be set on the processor (I guess it leave this area empty to leave some space to the compiler for portability stuff).

So yes, asm make sense. Thanks for this very descriptive answers. :)

PS: The code is neat! Good job! Keep it like this!

EDIT: About the const correctness, now you are not on github anymore (that's sad), have you any bug tracker or shared "todo list" to track what is done and what need to be done?

User avatar
max_power
Posts: 6
Joined: Sat Oct 05, 2013 6:01 am

Re: A lot of asm? What about ZSNES situation? No const?

Post by max_power » Wed Jul 09, 2014 5:19 am

I'm a bit confused, I thought the SSE instruction set provided exactly the IEEE 754 floats except for a few rounding modes.
So where is the advantage of using the x87?
And where is the difference from e.g. an x87 float add to one on the VR4300?

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: A lot of asm? What about ZSNES situation? No const?

Post by MarathonMan » Wed Jul 09, 2014 9:08 am

I wasn't aware that SSE2 was IEEE-754 compliant; just started pulling stuff in from the old core.

I was actually looking at switching to SSE2 last night, but the x87 opcodes are massive compared to SSE2. With x87, you can just set the status word for the precision that you require and it's still a-okay with IEEE-754. However, then it becomes the question of how slow/deprecated x87 is in modern x86 implementations.

tl;dr: You can't win at that game.

Not sure what you meant about "how the floats differ". All I was saying is that a C float need not be in IEEE-754 format. C doesn't specify anything other than a handful of constraints for floats and how operations are performed on them.

User avatar
Narann
Posts: 154
Joined: Mon Jun 16, 2014 4:25 pm
Contact:

Re: A lot of asm? What about ZSNES situation? No const?

Post by Narann » Wed Jul 09, 2014 9:48 am

max_power wrote:So where is the advantage of using the x87?
It seems x87 has more instructions over SSE.
max_power wrote:And where is the difference from e.g. an x87 float add to one on the VR4300?
Compare bit to bit result of x87/SSE/VR4300 operations would be a valuable information. Anyone know if this has already be done somewhere?
MarathonMan wrote:I was actually looking at switching to SSE2 last night, but the x87 opcodes are massive compared to SSE2.
I've read the rest of what you said but didn't get it. You said you can set a precision WORD with x87 while SSE2 can't?

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: A lot of asm? What about ZSNES situation? No const?

Post by MarathonMan » Wed Jul 09, 2014 10:01 am

Narann wrote:Compare bit to bit result of x87/SSE/VR4300 operations would be a valuable information. Anyone know if this has already be done somewhere?
VR4300 is IEEE-754 format. See the images, here:
http://en.wikipedia.org/wiki/IEEE_754-1985

x87 (in default precision mode) uses this format:
http://en.wikipedia.org/wiki/Extended_precision

Though it can be turned down to 32 or 64-bit using the PC bits in the x87 status word.

SSE2 (xmm register) is either 32 or 64-bit IEEE-754 depending on the instructions used (here, data is what you define it to be, really).
Narann wrote:
MarathonMan wrote:I was actually looking at switching to SSE2 last night, but the x87 opcodes are massive compared to SSE2.
I've read the rest of what you said but didn't get it. You said you can set a precision WORD with x87 while SSE2 can't?
When I said this, I meant the instruction sizes:

Code: Select all

  73:	f3 0f 10 44 24 18    	movss  0x18(%rsp),%xmm0
  79:	f3 0f 10 4c 24 1c    	movss  0x1c(%rsp),%xmm1
  7f:	0f 58 c8             	addps  %xmm0,%xmm1
...
  8c:	f3 0f 11 4c 24 28    	movss  %xmm1,0x28(%rsp)
vs

Code: Select all

  64:   d9 44 24 10             flds   0x10(%rsp)
  68:   d9 44 24 14             flds   0x14(%rsp)
  6c:   de c1                   faddp  %st,%st(1)
  6e:   d9 5c 24 18             fstps  0x18(%rsp)

User avatar
Narann
Posts: 154
Joined: Mon Jun 16, 2014 4:25 pm
Contact:

Re: A lot of asm? What about ZSNES situation? No const?

Post by Narann » Wed Jul 09, 2014 1:06 pm

MarathonMan wrote: VR4300 is IEEE-754 format. See the images, here:
http://en.wikipedia.org/wiki/IEEE_754-1985

x87 (in default precision mode) uses this format:
http://en.wikipedia.org/wiki/Extended_precision

Though it can be turned down to 32 or 64-bit using the PC bits in the x87 status word.
Owh! So this is the famous 80bit floating point! I learned something today, thank you! :)
MarathonMan wrote:SSE2 (xmm register) is either 32 or 64-bit IEEE-754 depending on the instructions used (here, data is what you define it to be, really).
So I guess it's also faster as you have less values to compute? But maybe modern architecture have no problem with 80bits length values and compute them in the same number of cycle than 32/64bit length values.
MarathonMan wrote: When I said this, I meant the instruction sizes:

Code: Select all

  73:	f3 0f 10 44 24 18    	movss  0x18(%rsp),%xmm0
  79:	f3 0f 10 4c 24 1c    	movss  0x1c(%rsp),%xmm1
  7f:	0f 58 c8             	addps  %xmm0,%xmm1
...
  8c:	f3 0f 11 4c 24 28    	movss  %xmm1,0x28(%rsp)
vs

Code: Select all

  64:   d9 44 24 10             flds   0x10(%rsp)
  68:   d9 44 24 14             flds   0x14(%rsp)
  6c:   de c1                   faddp  %st,%st(1)
  6e:   d9 5c 24 18             fstps  0x18(%rsp)
OMG yes! I already had to deal with SSE code and I rember all this nonunderstandable moves. Yes x87 instructions are definitely easier to read (which is an important point in a "hardware code" context IMHO).

So you mean actually mean:
but the SSE2 opcodes are massive compared to x87.
Anyway, thanks for this interesting infos (as usual).

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest