Future state of the project

Discuss any unrelated topics here.
User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Future state of the project

Post by MarathonMan » Wed May 27, 2015 2:48 pm

As I've said before, I would always announce publicly on a decision to stop developing CEN64 ... and I'm not doing that. :D

However: I've been having a harder and harder time extracting performance from CEN64. At the current rate, it's simply not possible to expect real-time performance from anything in the near future. And this has led me to think about taking radically different approach from how I'm going about designing CEN64 today. Naturally, this implies that I'll go into a rabbit hole and disappear without results for a unbeknownst amount of time, but the project is still alive and kicking, albeit in a different direction entirely.

A little summary:
I've been exploring viable alternatives to how CEN64 emulates things, however radical. The one I've been most optimistic about (until now) is multi-threading.

From the 'feasibility' study of sorts here, I found that putting the RCP in one thread and the VR4300 in another in a non-consistent manner (read: this is an "hacky", non-cycle-accurate, upper-bound of realizable performance), I saw far less than a 25% performance boost, let alone close to the 100% performance boost you'd idealistically hope for.
http://forums.cen64.com/viewtopic.php?f ... t=20#p1867

With these and some other micro-benchmarks I've done behind the scenes, I'm more or less convinced that the cost of coherency logic in modern processors simply makes multi-threading a cycle-accurate emulator simply infeasible., It doesn't matter how much black magic you pour into things (transactional-model/rollback of state, etc.): it's just not worth the overhead (unless you're emulating a lower-frequency or older-gen console).

The closest I got to 'feasible' was the transactional approach, and that implies you need to log or synchronize across cache-state changes, register writes, memory writes, TLB entries, floating-point state, etc... in the end, it just doesn't pay off any way you look at it IMO.

So, the natural alternative is dynamic recompilation. I can't think of anything else that's radical enough to make a difference. But how do you use dynamic recompilation in a cycle-accurate simulator?! The size of each block will be massive due to all the checks you need to perform every cycle...

Hmm.

Well, a close friend of mine actually suggested a really good idea: a marriage between interpreters and compilers. CEN64's interpreter(s) are actually not intolerably slow, so if you find a block of code that is getting recycled a lot, or would be too costly to lower in terms of outputted code size or some other factor, why not use the dynamic compiler to insert a re-entry point into an interpreter for that block (effectively fallback to a 'generic interpreter' block for some segment of instructions).

And thus, the idea was born.

I wish I had a way to perform a similar feasibility study to the multi-threading approach, but I can't really think of anything that works. So, at this point, I'm a few micro-benchmarks away from making myself comfortable enough to bite the bullet, grab the reins, and start developing a monster. :)

If anyone has experience with dynamic recompilers used in other (HLE or otherwise) emulators and can speak to their overhead in terms of time spent compiling vs. time spent executing, generated code size for various ROMs, overhead across various ROMs, etc... I'd love to hear your two cents before I make the plunge.

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: Future state of the project

Post by MarathonMan » Wed May 27, 2015 3:09 pm

The primary concerns are:
- Indirect branches (though there's already one-per-cycle-per-component w/ the interpreter, so...).
- Instruction cache pressure from too much dynamically generated code.

The latter could be alleviated by using a set of generic blocks and branching to the correct most-specific generic block possible.

AIO
Posts: 51
Joined: Wed Nov 05, 2014 4:56 pm

Re: Future state of the project

Post by AIO » Wed May 27, 2015 3:51 pm

Well, I know that emulators like PJ64 stutter in intros/cut scenes while using CPU recompiler, especially when advanced block linking is enabled. So there are times where it has to compile a lot, for the cpu. Doesn't seem to be as big of a problem on 1964 (I think their recompiler is more optimized). For the RSP, you don't really have to worry as much about how much time is spent compiling, because it's significantly less than the VR4300. In fact, some games like WDC don't have that much microcode.

I know that for HLE emulators, a very basic recompiler is still noticeably faster than an interpreter, for both RSP and CPU. I don't know how much different that will be for cycle accurate emulators though. For starters, you could maybe try recompiling the small instructions and just generate a call for the larger ones. Idk too much about Mupen64, but it has a cached interpreter.

Although even with a fast RSP & VR4300, a lot of games still don't run fast enough due to the RDP being such a huge bottleneck.

User avatar
123_qwe
Posts: 3
Joined: Thu May 28, 2015 9:17 am

Re: Future state of the project

Post by 123_qwe » Thu May 28, 2015 1:22 pm

Is libco from bsnes an alternative ?

User avatar
Alegend45
Posts: 11
Joined: Mon Oct 07, 2013 11:24 am

Re: Future state of the project

Post by Alegend45 » Thu May 28, 2015 2:46 pm

Dude, just optimize the RDP. It'll get everything to fullspeed. Star Fox is already quite playable on my Ivy Bridge i3-3210 at 40 VI/s or so. :)

User avatar
Snowstorm64
Posts: 303
Joined: Sun Oct 20, 2013 8:22 pm

Re: Future state of the project

Post by Snowstorm64 » Thu May 28, 2015 4:18 pm

Alegend45 wrote:Dude, just optimize the RDP. It'll get everything to fullspeed. Star Fox is already quite playable on my Ivy Bridge i3-3210 at 40 VI/s or so. :)
RDP is software rendering, which means it is costly in terms of performance. It doesn't matter what you do, it will always be slow. Although it may be not pixel perfect, hardware rendering (OpenGL) is the only way to see some significant performance gain.
OS: Debian GNU/Linux Jessie (8.0)
CPU: Intel i7 4770K @ 3.5 GHz
Build: AVX (compiled from git)

User avatar
Narann
Posts: 154
Joined: Mon Jun 16, 2014 4:25 pm
Contact:

Re: Future state of the project

Post by Narann » Thu May 28, 2015 4:21 pm

AIO wrote:Although even with a fast RSP & VR4300, a lot of games still don't run fast enough due to the RDP being such a huge bottleneck.
From my perspective, current Cen64 RDP implementation (angrylion) have a lot of possible optimizations (first one is re factorize in order to benefit from CPU vectorization).

marathonman: Is RDP the bottleneck? An easy way to test is remove it's logic and always write black color pixels How much VI's would you save with this?

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: Future state of the project

Post by MarathonMan » Thu May 28, 2015 5:25 pm

123_qwe wrote:Is libco from bsnes an alternative ?
While it's a nice/elegant interface, it has considerable more overhead than what I have in mind.
Narann wrote:marathonman: Is RDP the bottleneck? An easy way to test is remove it's logic and always write black color pixels How much VI's would you save with this?
The RDP is a bottleneck for many reasons:

I have actually removed its logic as you suggested in the past and found that not only does it have room for optimization (SIMD vectorization all over the place), but it pollutes the caches very badly due to its large size and branch-y code. Without RDP, CEN64 is easily under 100KB. I've noticed while running VR4300/RSP intensive ROMs like mpeg ucode tasks, you can see 60VI/s and almost 3IPC... but when you start running RDP, it drops to 2.25-2.5IPC...

The performance regression depends heavily on the ROM. Some (SSB, Pokemon Snap! Intro) are heavily bottlenecked by RDP, while others are CPU/RSP (SM64) limited. See:
http://forums.cen64.com/viewtopic.php?f=8&t=184
Snowstorm64 wrote:It doesn't matter what you do, it will always be slow.
And cycle-accurate emulation of N64 will never be possible, right? ;)

Seriously, though, I think it is possible to bring the runtime down quite considerably with the right approach. I wish more people agreed with Narann re: SIMD optimizations. This not only has the benefits of increasing workload, but decreasing binary size. Heck, if someone would just step up and stop just refactoring the same plugin that is shared by all N64 emulators, N64 graphics emulation would be in a much better place.

AIO
Posts: 51
Joined: Wed Nov 05, 2014 4:56 pm

Re: Future state of the project

Post by AIO » Thu May 28, 2015 6:18 pm

Narann wrote: From my perspective, current Cen64 RDP implementation (angrylion) have a lot of possible optimizations (first one is re factorize in order to benefit from CPU vectorization).
I think you're right, but it would require a ton of work to see results. From what I've seen, quite a bit of the code is already vectorized.
MarathonMan wrote: The performance regression depends heavily on the ROM. Some (SSB, Pokemon Snap! Intro) are heavily bottlenecked by RDP, while others are CPU/RSP (SM64) limited.
You should see Vigilante 8 - 2nd Offense :D . That game had the worst performance I've ever seen!
MarathonMan wrote: Seriously, though, I think it is possible to bring the runtime down quite considerably with the right approach. I wish more people agreed with Narann re: SIMD optimizations. This not only has the benefits of increasing workload, but decreasing binary size.
Agreed. I wish people didn't underestimate how much of a bottleneck the RDP is ;/. Aside from SIMD optimizations, I think that code could use more LUTs! At least on my hardware, I saw improvements from using more LUTs.

One example off the top of my head was replacing

Code: Select all

c = ((s & 1)) ? (byteval & 0xf) : (byteval >> 4);
c |= (c << 4);
with

Code: Select all

c = LUT[((s & 1) << 8) | byteval];
MarathonMan wrote: Heck, if someone would just step up and stop just refactoring the same plugin that is shared by all N64 emulators, N64 graphics emulation would be in a much better place.
Agreed, although I think it applies to other parts of N64 emulation too :D .
Alegend45 wrote:Dude, just optimize the RDP. It'll get everything to fullspeed. Star Fox is already quite playable on my Ivy Bridge i3-3210 at 40 VI/s or so. :)
Some games are slow due to the RSP and CPU, so just optimizing the RDP won't necessarily be enough for every game.

User avatar
123_qwe
Posts: 3
Joined: Thu May 28, 2015 9:17 am

Re: Future state of the project

Post by 123_qwe » Fri May 29, 2015 6:44 am

Mame devs have been optimizing the RDP for a while and looking at the changelog getting some nice speedups http://git.redump.net/mame/log/?qt=grep&q=n64 .
I for one would optimize the heck of the RDP and then go for multithreading (maybe just the RDP would be easier and more effective).

User avatar
Narann
Posts: 154
Joined: Mon Jun 16, 2014 4:25 pm
Contact:

Re: Future state of the project

Post by Narann » Fri May 29, 2015 8:47 am

123_qwe wrote:I for one would optimize the heck of the RDP and then go for multithreading (maybe just the RDP would be easier and more effective).
I agree, try to multithread RDP would bring to interesting performance gain. However, keep in mind RDP is not only used to display but also to write memory data to be used back by CPU/RSP. Because of this, RDP is not technically the latest part of the pipeline and this would imply a lot of locking to ensure cycle accuracy... :(

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: Future state of the project

Post by MarathonMan » Fri May 29, 2015 9:48 am

Narann wrote:Because of this, RDP is not technically the latest part of the pipeline and this would imply a lot of locking to ensure cycle accuracy... :(
QFT

Great approach for HLE RDP/pixel-accurate plugins, though.

User avatar
Snowstorm64
Posts: 303
Joined: Sun Oct 20, 2013 8:22 pm

Re: Future state of the project

Post by Snowstorm64 » Fri May 29, 2015 2:09 pm

MarathonMan wrote:
And cycle-accurate emulation of N64 will never be possible, right? ;)
Well, it's true that a software renderer is never that fast, especially compared to the hardware renderer, though. It will take always a fair share of the CPU power, so it will affect negatively the other components. There's not much we can do to improve the performance, however we could add a lot of optimizations but I don't believe we will be able to reach 60 VI/s in games that are bottlenecked by RDP, especially with the current generation of CPUs.
OS: Debian GNU/Linux Jessie (8.0)
CPU: Intel i7 4770K @ 3.5 GHz
Build: AVX (compiled from git)

User avatar
iwasaperson
Posts: 49
Joined: Tue Apr 22, 2014 12:50 am

Re: Future state of the project

Post by iwasaperson » Fri May 29, 2015 7:04 pm

Snowstorm64 wrote:
MarathonMan wrote:
And
cycle-accurate emulation of N64 will never be possible, right? ;)
Well, it's true that a software renderer is never that fast, especially
compared to the hardware renderer, though. It will take always a fair
share of the CPU power, so it will affect negatively the other
components. There's not much we can do to improve the performance,
however we could add a lot of optimizations but I don't believe we will
be able to reach 60 VI/s in games that are bottlenecked by RDP,
especially with the current generation of CPUs.
The PCSX2 software renderer is actually quite fast, and runs full speed on my 3570K with multithreading with no slowdown.
Maybe you can look at some of the Gsdx code and see if you can get any ideas from that.

Keep in mind that the software renderer is nowhere near 100% accurate, nor is it cycle-accurate, or trying to be by any means. There is also the obvious fact that the PS2 is extremely different from the N64.

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: Future state of the project

Post by MarathonMan » Fri May 29, 2015 7:46 pm

iwasaperson wrote:The PCSX2 software renderer is actually quite fast, and runs full speed on my 3570K with multithreading with no slowdown.
Maybe you can look at some of the Gsdx code and see if you can get any ideas from that.

Keep in mind that the software renderer is nowhere near 100% accurate, nor is it cycle-accurate, or trying to be by any means. There is also the obvious fact that the PS2 is extremely different from the N64.
This is the 'problem' with cycle accuracy. The accuracy comes along with a *lot* of hardware synchronization that needs to be simulation very precisely, and very consistently.

It's definitely possibly to multi-thread things in an accurate manner, but not a *cycle* accurate manner. JD/MooglyGuy from MAME/MESS is looking at multi-threading the RDP renderer by doing all the spans work in a separate thread. Great idea, still pixel accurate... but not cycle-accurate!

User avatar
123_qwe
Posts: 3
Joined: Thu May 28, 2015 9:17 am

Re: Future state of the project

Post by 123_qwe » Sat May 30, 2015 8:24 am

Do you mean give a thread to the RDP or break the RDP into threads (1? The latter would still maintain cycle accuracy but must be much more difficult to implement i guess.
Something like


(1)
treatSpans
->task1
->task2
->task3

Where task[1,2,3] are function pointers that could each be ran in a separate thread and treatSpans would just have to wait for its child threads to finish.

User avatar
p4plus2
Posts: 2
Joined: Sun May 31, 2015 10:24 pm

Re: Future state of the project

Post by p4plus2 » Mon Jun 01, 2015 9:03 pm

Have you considered trying an approach where opcodes "chain" to each other? It would be difficult to maintain the pipeline correctly, but I don't think it would be impossibly hard either. I took a crack at a quick and dirty prototype that has a main cpu and a co cpu, my demo works best for running the two in lock step, but I'm more than certain it can be reworked to have the faster CPU lead the slower CPU easily enough. The code below (terrifying as it is) works with both clang (3.6 tested) and GCC (4.8/4.9 tested) (options used on both were: -fomit-frame-pointer -fno-asynchronous-unwind-tables -O3).
MSVC is untested, I don't have access to a windows machine.

Linked because the code formated poorly in code tags

Flaws with this approach:
1) Timing CPUs that are not nice multiple of each other will be more difficult
2) Pipelining the main CPU will be harder, but the same approach should be usable non-the-less
3) There is some dead prelude and epilogue code for the main CPU. So there is some I-Cache pollution. This could be avoided if GCC supported naked functions on x86, but it doesn't.
4) Its not the sanest to code for

Advantages:
1) Should be faster than libco approaches (Still need to write a benchmark to see how much faster, assuming that is the case)
2) Opcode chaining eliminates function prelude and epilogue execution as well as removal of any execution loop.
3) The stack is shared from the parent function (start), hopefully this will help keep that memory in cache better. Totally a guess though, needs more testing.
4) I speculate this coupling will make a unified memory model slightly easier to implement, but only barely if so.

I don't think the idea is purely usable as is, but perhaps somebody can improve it from here. I have a couple of ideas I want to toy with still.

Thoughts? Alternatively, the code was fun to write and you may commence laughing at such a weird method (I can't say I'd blame you at all :P).

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: Future state of the project

Post by MarathonMan » Tue Jun 02, 2015 12:20 am

p4plus2 wrote:Have you considered trying an approach where opcodes "chain" to each other?
Why yes; I have... more or less along the approach of what I was thinking of. :)
p4plus2 wrote:Flaws with this approach:
1) Timing CPUs that are not nice multiple of each other will be more difficult
2) Pipelining the main CPU will be harder, but the same approach should be usable non-the-less
3) There is some dead prelude and epilogue code for the main CPU. So there is some I-Cache pollution. This could be avoided if GCC supported naked functions on x86, but it doesn't.
4) Its not the sanest to code for
1) Yep. Fortunately we're only working with 1x/1.5x.
2) No harder than pipelining the RSP or RDP/etc., though?
3) I'm hoping to avoid all that by generating code on-the-fly and using my own calling convention. Only do a full prelude/epilogue on entry/exit to the "VM".
4) Yes... for sure!
p4plus2 wrote:Thoughts? Alternatively, the code was fun to write and you may commence laughing at such a weird method (I can't say I'd blame you at all :P).
I'm just glad someone came up with a similar idea to what I had... makes me feel more sane. :D

User avatar
p4plus2
Posts: 2
Joined: Sun May 31, 2015 10:24 pm

Re: Future state of the project

Post by p4plus2 » Tue Jun 02, 2015 4:25 am

MarathonMan wrote:2) No harder than pipelining the RSP or RDP/etc., though?
In general no, but the code as I posted wasn't quite fit for that setup I think.
MarathonMan wrote:3) I'm hoping to avoid all that by generating code on-the-fly and using my own calling convention. Only do a full prelude/epilogue on entry/exit to the "VM".
Thats pretty darn clever, I hadn't considered that. This would probably lead to a whole slew of other future optimizations I'd imagine.

MarathonMan wrote: I'm just glad someone came up with a similar idea to what I had... makes me feel more sane. :D
I must admit I am pretty happy myself that the idea is anywhere near viable. It was one of those random late night ideas after reading the first post in this thread. Hopefully as the summer comes around I'll find some time to contribute to this project before class starts again :). I've been wanting to see a high accuracy N64 emulator for more time and the more I watch this project the more interesting it gets! Best of luck.

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: Future state of the project

Post by MarathonMan » Tue Jun 02, 2015 10:59 am

p4plus2 wrote:This would probably lead to a whole slew of other future optimizations I'd imagine.
I hope so!

For starters, my thinking is that half of the xmm registers (in 64-bit mode anyways; less or not the case for 32-bit mode) can be statically allocated to the RSP accumulator and flags. Most of the vectorized functions in CEN64 need a working set of < 8 xmm registers for most tasks, those functions won't have to spill or anything, either...

I tried doing this in the current version of CEN64, but it was so easy to overlook things, and as a result, this would happen when I forgot to save context around an external call:
http://forums.cen64.com/viewtopic.php?f ... 1543#p1543

Lots of other optimizations I have in mind, too... as long as the overhead for compilation isn't too much (or can be leveraged sufficiently well), I'm quite confident that CEN64 will run on commodity processors without an issue.

User avatar
izy
Posts: 25
Joined: Tue Jun 02, 2015 11:34 am

Re: Future state of the project

Post by izy » Tue Jun 02, 2015 12:08 pm

p4plus2 wrote: 3) There is some dead prelude and epilogue code for the main CPU. So there is some I-Cache pollution. This could be avoided if GCC supported naked functions on x86, but it doesn't.
Hello p4plus2, i looked at your code but it's maybe possible to write it better. Let me help you.

1) The GCC compiler supports naked functions everywhere.
The most interesting thing is that the GCC compiler does not need naked functions at all.
However, if you need naked functions in GCC you can get them. You must write everything by hand, including the final "ret", manually manage the stack pointer if you touched it, the input and output parameters...
I'll write a complete example here for you.

Code: Select all

#include <stdio.h>
#include <stdlib.h>
asm(
".global asmtest;"
".type asmtest, @function;"
"asmtest:"
"call asminterrupt;" // call another function
"lea (%rdi, %rsi), %rax;"  // sum and move
"ret;" // return
".global asminterrupt;"
".type asminterrupt, @function;"
"asminterrupt:"
"int $3;"  // return to the debugger
"ret;" // return
".global crashtest;"
".type crashtest, @function;"
"crashtest:"
"ud2;"  // illegal istruction
".global myputs;"
".type myputs, @function;"
"myputs:"
"jmp puts;" // redirect to a C function
// no ret required after a jmp
);
int main()
{
    size_t sum;
    sum = asmtest((size_t)1, (size_t)5);
    printf("Result is %u\n", sum);
    myputs("Crash test!\n");
    fflush(stdout);crashtest();
    return 0;
}
You must respect the calling convention used by your operating system to pass input parameters. The standard System V AMD64 convention (also used by Linux) works by passing the first two parameters in the registers %rdi and %rsi. In Windows the registers are different. Note that floats and doubles aren't passed in general purposes registers, but in the registers of the SSE unit. On 32-bits operating systems, the calling convention is usually the cdecl (the arguments are passed in the stack in reverse order and the caller clears the stack). The accumulator register is used for the return value.
It's possible to use IFDEFs like in any C program to conditionally generate code for a particular Operating system and hardware platform at compiling time.
The code above will only work on 64 bit Linux/BSDs.

Tip: GCC is a powerful compiler and when you write in Assembly you can also access to the power of GAS (the GNU Assembler)...

If you dislike to manage many calling conventions, you can create your own calling convention easily... just write the assembly code you wish (even in an external .s file for GAS). You will be able to call it from any C function:

Code: Select all

size_t function(size_t a, size_t b)
{
    size_t res;
    asm volatile(
        "call asmtest;" // call the naked assembly function
        : "=&a"(res) // read the output from R/E/AX
        : "D" (a), "S" (b) // Put inputs always in R/E/DI and R/E/SI
// these must not change, since i put them as constants
        : "cc" // always a good idea
        );
    return res;
}
In your naked functions you can do whatever you want. For example: you don't need to keep the stack aligned, but on AMD64 if you call a C function you should align the stack back to 16 bytes. In my code i forgot to do so (for puts). Also, always keep the DF (direction flag) unset as expected.

2) You make "GCC shut up" in the wrong way.
Using exit(-1) will avoid you to get GCC's warnings, but it'll create real code in order to do so. Replace your exit(-1) call with the builtin: __builtin_unreachable();
Using __builtin_unreachable() will enable extra optimizations in certain cases and will avoid useless code.
I suggest you to look for __builtin_unreachable() at https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

3) Your code is all based on an old programming style, please upgrade.
A goto equals to a jmp (jump), it is automatic. For example a list of functions is more complex, when you call a function via list[index](), you get a "call" which is slower.
The GCC compiler supports addressed gotos (labels as values) whose purpose is exactly to replace switches and create dynamic switches, look also at the examples here: https://gcc.gnu.org/onlinedocs/gcc/Labe ... alues.html
(in switches all values are constants and cannot be modified in any way.)
You can therefore create fast dynamic switches similar to a list of functions addresses in the following way:

Code: Select all

#include <stdio.h>
#include <stdlib.h>
void execute(size_t opcode_index)
{
    static void *opcodes[] = { &&invalid, &&opcode1, &&opcode2, &&opcode3 };
    /* do something generic */
    goto *opcodes[opcode_index & 3];
invalid:
    puts("executed opcode 0 or two times another opcode");
    abort();
opcode1:
    /* do something */
    puts("executed opcode 1");
    opcodes[1] = &&invalid;
    return;
opcode2:
    /* do something */
    puts("executed opcode 2");
    opcodes[2] = &&invalid;
    return;
opcode3:
    /* do something */
    puts("executed opcode 3");
    opcodes[3] = &&invalid;
    return;
}
int main()
{
    execute(2);
    execute(3);
    execute(2);
    return;
}
4) Never forget to write "volatile" after "asm" when you use the inline assembly without *USED* output values. Otherwise, the compiler may remove your asm block. The compiler acts smarter than what most people desire.

5) Writing "rax", "rbx" in the clobbered registers can be avoided. I am sure you knew this already. When you want to force the usage of a particular register you can just write:
"a" (op.step), "b" ((rom[cpu.pc] & 0xFF) << 3)
and you'll get op.step in the accumulator ("a") and the other value in the base register ("b"). Of course these are set as constants. You should use the registers modifiers if your code will alter those registers. In order to do so, you should assign them to local variables and put the values in the input/output registers list.

Code: Select all

register size_t myrax, myebx;
myrax = op.step;
myebx = (rom[cpu.pc] & 0xFF) << 3;
asm volatile(
"add %[numberInAnAutomaticRegister],  %[anyName]  ;"
"add %k[numberInAnAutomaticRegister], %k[anyWord] ;"
: [anyName] "+&a" (myrax), [anyWord] "+&b" (myebx)
: [numberInAnAutomaticRegister] "r" (any_variable)
: "cc"
);
/* myrax and myebx have been incremented */
there is a lot to explain here:
named registers to avoid %0, %1 etc....
forced registers: "a" means accumulator, "b" is base, "c" counter, "d" data, "S" is source index and "D" destination index.
automatic register: "r" means any register (including r8-r15 on AMD64). You can use "q" not to use them.
modifier "+" to set an input/output value
modifier "&", useless in this case but always a good idea to have, to tell the GCC not to alias registers
clobbered "cc", because it tells the GCC that the FLAGS registers gets altered ("add" will change the flags). Of course on x86 and AMD64 you can sum two numbers without using "add" ("add" is an instruction from the ALU unit of the CPU). You can use the AGU which results often faster on superscalar CPUs without needing to change the CPU flags...look at the code above where i used LEA to sum two registers and move the result into another register with it (a single but very powerful instruction).
modifier for named registers: Writing "%k[" will access the lower 32 bits of a register. It is useful when the same code is compiled on 64 bits CPUs and is for example useful if you want to manage how the overflow shall work. Writing %w[anyWord] will access BX, %b[anyWord] will access BL and %h[anyWord] will access BH.
However the GCC compiler will refuse to allow you to directly overwrite the RBX/EBX register in PIC applications.

Tip: the GCC compiler supports "asm goto" blocks, if you want to mix assembly code with GOTOs. Note that there is a limitation: ASM GOTOs don't support output parameters; but at least labels can be named in the assembly code and can be accessed easily by typing "jmp %l[thename]" https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html and the GCC compiler supports local labels for macros https://gcc.gnu.org/onlinedocs/gcc/Local-Labels.html ...

User avatar
asiga
Posts: 24
Joined: Fri May 30, 2014 5:35 pm

Re: Future state of the project

Post by asiga » Thu Jun 18, 2015 3:39 am

Agreed, accelerating the RDP with OpenGL won't be pixel exact, but there's another option: GPGPU. You should be able to get pixel-accurate results with OpenCL (which, by the way, allows dynamic recompilation, so you could compile on-the-fly OpenCL kernels highly optimized for the current RDP status).

Also, OpenCL kernels are written in C. It has some special keywords, but it's easy to write the same code for both OpenCL and any other standard C compiler (GNU or any other). By using preprocessor defines, you can compile your OpenCL kernel with GNU, or send it to OpenCL, which is nice because you can reuse and test your code in very different scenarios.

On the cons side, getting maximum performance with OpenCL is sometimes frustrating. It's a massive parallel architecture, so it's not always straightforward to optimize code for it. Also, access to global memory has penalties. And intense memory read/write from the GPU from/to the host can soon ruin performance if not studied carefully.

OpenCL could even allow all your GPUs and CPU cores to run in parallel, and the nice thing is that this doesn't change your code: you just choose whether to use one or several devices, choosing from all your CPUs and GPUs.

User avatar
Narann
Posts: 154
Joined: Mon Jun 16, 2014 4:25 pm
Contact:

Re: Future state of the project

Post by Narann » Thu Jun 18, 2015 10:15 am

The problem with RDP is not the computational part. The maths are quite basic: (a-b)*c+d

The problem is in all the branching you need to reproduce the same fixed point precision result. I have no example (Battle of Naboo?) but I think some games where using rendered RDP buffer to do some faster math computation and get the pixel values back (use GPU for massive math is a typical advanced trick in consoles).

OpenCL is an API so it has been designed to be portable so you're flucked. Does hatcat pixel perfect plugin use GPU? I would be very surprised to know how he has deal with bit masks (something you need to be pixel perfect).

The options we have are:
1) Pure cycle accurate RDP renderer (impossible to get proper performance IMHO).
2) Full CPU RDP. Pixel perfect, average performance but not cycle accurate (angylion RDP).
3) Same as 2) but in a multithreaded tile based approach. Pixel perfect, good performance, not cycle accurate.
4) OpenCL/OpenGL, not pixel perfect, good performance, certainly not cycle accurate.

I'm in favor of 3) as it seems to be the good trade off between result and performance. And would allow us to play at 60 VIs.

Maybe Vulkan (SPIR-V actually) would allow to do bit masks but it's not for tomorrow.

User avatar
asiga
Posts: 24
Joined: Fri May 30, 2014 5:35 pm

Re: Future state of the project

Post by asiga » Thu Jun 18, 2015 3:12 pm

Narann wrote:The problem with RDP is not the computational part. The maths are quite basic: (a-b)*c+d

The problem is in all the branching you need to reproduce the same fixed point precision result. I have no example (Battle of Naboo?) but I think some games where using rendered RDP buffer to do some faster math computation and get the pixel values back (use GPU for massive math is a typical advanced trick in consoles).

OpenCL is an API so it has been designed to be portable so you're flucked. Does hatcat pixel perfect plugin use GPU? I would be very surprised to know how he has deal with bit masks (something you need to be pixel perfect).

The options we have are:
1) Pure cycle accurate RDP renderer (impossible to get proper performance IMHO).
2) Full CPU RDP. Pixel perfect, average performance but not cycle accurate (angylion RDP).
3) Same as 2) but in a multithreaded tile based approach. Pixel perfect, good performance, not cycle accurate.
4) OpenCL/OpenGL, not pixel perfect, good performance, certainly not cycle accurate.

I'm in favor of 3) as it seems to be the good trade off between result and performance. And would allow us to play at 60 VIs.

Maybe Vulkan (SPIR-V actually) would allow to do bit masks but it's not for tomorrow.
I think you're mixing OpenCL and OpenGL, because in OpenCL you've all C operators. This is also true for bit wise operators, so yes, bit masks are supported. Every C expression can be compiled into an OpenCL kernel. Also, regarding math, OpenCL follows IEEE specs. Only when you enable some aggressive optimizations you can get lower precision.

However, GPUs behave as SIMD machines, and it's typical to have hundreds or thousands of work units sharing the same instruction. This means that branching has usually an important penalty: if some work units pass an "if" clause and others fail it, some of them get paused until all come back to the same instruction again.

Also, although OpenCL can access images, you don't need to do so: you can work with data arrays just like you'd do in C. I'm saying this because you can achieve in OpenCL the same accuracy as in C.

User avatar
Narann
Posts: 154
Joined: Mon Jun 16, 2014 4:25 pm
Contact:

Re: Future state of the project

Post by Narann » Thu Jun 18, 2015 4:21 pm

asiga wrote:I think you're mixing OpenCL and OpenGL
No I don't. :)
asiga wrote:because in OpenCL you've all C operators
Why the URL I linked above state the invert? Other info here.

Maybe this has changed with new versions. Any spec to point?

About the precision, N64 RDP is not IEEE at all. It's all about fixed point (and from multiple types). The problem is not "loose of precision" related but reproduce precision perfect fixed point RDP arithmetic. A lot of thing highly depend on this. Depth mess on HLE in general is a perfect example.
asiga wrote:if some work units pass an "if" clause and others fail it, some of them get paused until all come back to the same instruction again.
This is why it's quiet hard to emulate on massively parallel archs like GPUs. Read angrylion code, it's full of if statement as reproduce RDP behavior is not simply apply (a-b)*c+d. It's also write the rasterizer with all its subtle options.
asiga wrote:Also, although OpenCL can access images, you don't need to do so
This is not about me. :) Some games use N64's RDP as a math process unit (Like a PC+OpenGL to do math computation before OpenCL days). Those game read image buffer as a result of the computation and handle the result using the N64 CPU. That's why your result has to be "perfect".

User avatar
asiga
Posts: 24
Joined: Fri May 30, 2014 5:35 pm

Re: Future state of the project

Post by asiga » Fri Jun 19, 2015 4:10 am

Narann wrote:
asiga wrote:I think you're mixing OpenCL and OpenGL
No I don't. :)
asiga wrote:because in OpenCL you've all C operators
Why the URL I linked above state the invert? Other info here.

Maybe this has changed with new versions. Any spec to point?
Sorry, didn't know this. However, this doesn't imply you cannot do bit-masks, or even bit-fields. In fact, a lot of OpenGL functionality depends on masks, so the GPU knows how to mask bits and work with them. In this respect, OpenCL supports bitwise operators, and also right/left bit shifts, so, yes, bitfields are supported. Given that C bitfields are just a syntax convenience (compilers translate bitfields into bitwise operators and bit shifts), the fact that the C bitfield syntax isn't supported doesn't mean you'll get less performance than if it was supported, because the operators are there, available for your use.
Narann wrote:About the precision, N64 RDP is not IEEE at all. It's all about fixed point (and from multiple types). The problem is not "loose of precision" related but reproduce precision perfect fixed point RDP arithmetic. A lot of thing highly depend on this. Depth mess on HLE in general is a perfect example.
OpenCL has a native 32bit integer type, as well as all operators needed to work with it. So, it should be possible to write a fixed point implementation with it.
Narann wrote:
asiga wrote:if some work units pass an "if" clause and others fail it, some of them get paused until all come back to the same instruction again.
This is why it's quiet hard to emulate on massively parallel archs like GPUs. Read angrylion code, it's full of if statement as reproduce RDP behavior is not simply apply (a-b)*c+d. It's also write the rasterizer with all its subtle options.
Yes. Using OpenCL isn't as easy as sending your code to the OpenCL compiler. Well, actually, with very little modifications the compiler will accept it, but getting maximum performance is another story. And it's sometimes a bit frustrating to find what's the cause for a low performance, because low-level profilers aren't available on all platforms, nor for all GPU vendors.

However, you can still get impressive performance by cleverly coding your algorithm. For example, GPU-based raytracers need branching. You cannot write a raytracer without branching. And GPU raytracers outperform CPU raytracers by a huge order of magnitude.

Another fact is that you can compile your OpenCL code in runtime. This allows to introduce a lot of optimizations (you can even reduce the use of branching by doing this). However, compiling OpenCL kernels has an important performance cost, so this should be done just in very selected places (for example just after loading a game).
Narann wrote:
asiga wrote:Also, although OpenCL can access images, you don't need to do so
This is not about me. :) Some games use N64's RDP as a math process unit (Like a PC+OpenGL to do math computation before OpenCL days). Those game read image buffer as a result of the computation and handle the result using the N64 CPU. That's why your result has to be "perfect".
Yes, but the image buffer doesn't need to be an image buffer. In fact, the original OpenCL data buffers aren't images, but arrays of any data type. OpenCL supports image buffers as a convenience, but you can work with an array of 32bit ints if you prefer so.

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: Future state of the project

Post by MarathonMan » Fri Jun 19, 2015 1:03 pm

You can probably optimize a lot of the branch-y RDP code into bitmasks and use bit-level tricks to avoid them.

I've never worked with OpenCL before, but I'm very supportive of anyone that wants to try and wild and crazy idea and see what comes of it. :)

User avatar
Snowstorm64
Posts: 303
Joined: Sun Oct 20, 2013 8:22 pm

Re: Future state of the project

Post by Snowstorm64 » Fri Jun 19, 2015 8:38 pm

What about Vulkan then? :) As a single API, it involves graphics AND compute operations, but I don't know if Vulkan offers benefits that are far superior than separated OpenGL/OpenCL...
OS: Debian GNU/Linux Jessie (8.0)
CPU: Intel i7 4770K @ 3.5 GHz
Build: AVX (compiled from git)

User avatar
wareya
Posts: 16
Joined: Tue May 19, 2015 5:44 pm

Re: Future state of the project

Post by wareya » Fri Jun 19, 2015 9:06 pm

>but I don't know if Vulkan offers benefits that are far superior than separated OpenGL/OpenCL...

Supposedly compatibility, for one.

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: Future state of the project

Post by MarathonMan » Sat Jun 20, 2015 1:46 pm

Snowstorm64 wrote:What about Vulkan then? :) As a single API, it involves graphics AND compute operations, but I don't know if Vulkan offers benefits that are far superior than separated OpenGL/OpenCL...
While the API is undoubtedly better, not a single vendor (to my knowledge) has released a Vulkan driver publicly yet, so...

CEN64 can technically work with OpenGL 1.x, even, too... so it might be a good idea to stick to OpenGL/CL since compatibility isn't really a problem (the GL features that CEN64 requires are incredibly simple - it'd be an enormous driver bug if they weren't implemented correctly, and it does all of the GL calls from a separate thread so the overhead that's often mentioned isn't an issue, either).

AIO
Posts: 51
Joined: Wed Nov 05, 2014 4:56 pm

Re: Future state of the project

Post by AIO » Sat Jun 20, 2015 5:48 pm

I think for now, it's best to just use OpenGL and just heavily optimize the current code base. I wouldn't bother with multi threading either, until the RSP and RDP have been significantly sped up.

One good way to reduce the amount of branching is to use LUTs, like in my previous example.

Fortunately, Angrylion has recently pushed some commits :D http://code.google.com/p/angrylions-stuff/source/list . I'm looking forward to the progress. I may actually spend some time on optimizing it. RDP is my main bottleneck.

User avatar
izy
Posts: 25
Joined: Tue Jun 02, 2015 11:34 am

Re: Future state of the project

Post by izy » Mon Jun 22, 2015 10:54 am

Hello AIO.

Code: Select all

    c = ((s & 1)) ? (byteval & 0xf) : (byteval >> 4);
    c |= (c << 4);
I don't know if replacing 8 instructions with a LUT of 4 instructions, one of which involves memory access, is advantageous.
However, it's possible use "ADD" in place of "OR"

Code: Select all

    c = LUT[((s & 1) << 8) | byteval];
the code could be easily compiled in this way:

Code: Select all

and $1, s
shl $8, s
or  byteval, s
movzxb (LUT, s), %eax     # eax=0, al = LUT[s]
mov (LUT, s), %al         # al = LUT[s]
(using either movzx or mov)
It works since i hope you declared s and byteval as size_t (there is no need at all to use uint8_t if you don't need it for specific tricks, for example to play with the overflow 255+1 = 0, but the compiler could also understand this all alone). The same code using the sum will result shorter.
I suggest you to replace the OR (or XOR) with "+" and to use cpu-word wide variables. The generated code will be shorter because MOV and LEA can be used to execute the former "OR-"sum at the same time with the next instruction. Also, the instruction "OR" won't give any advantage over "+" on x86/AMD64.

If you create the table in another way, it can be accessed even easier:

Code: Select all

    c = LUT[(byteval << 1) + (s & 1)];

Code: Select all

and $1, s
lea (s, byteval, 2), s      # s = s+byteval*2
movzxb (LUT, s), %eax       # eax=0, al = LUT[s]
mov (LUT, s), %al           # al = LUT[s]
I don't know if it'll change anything because of the different way the values shall be stored in the table (for ordered cache access, i say).

The ternary operator can be often replaced with a conditional move as long as both the possible results can be computed in advance. As a common rule you can notice anyway that often the code based on the ternary operator where in both "branches" you use the same variable can be rewritten without the ternary operator.
Think about the original code. The result of "s & 1" is used as selector. If it is set to 1 the lower nibble of byteval is selected. Otherwise it's selected the highest nibble. The selected nibble is propagated in the byte-sized variable named "c". It is possible to write the code using CMOV, but it is also possible to rewrite the whole code in another way. I would write the same code in this way:

Code: Select all

s <<= 2;         # prepare for the shift
s ^= 4;          # toggle the condition
s &= 4;          # select the frist bit (now 4th)
byteval >>= s;   # byteval will keep its value or move its high nibble replacing the lowest nibble
byteval &= 0x0f; # select the low nibble
c =  byteval | (byteval << 4); # propagate
In real C:

Code: Select all

unsigned b(unsigned s, unsigned byteval)
{
    s <<= 2;
    s ^= 4;
    s &= 4;
    byteval >>= s;
    byteval &= 0x0f;
    return byteval | (byteval << 4);
}
/*
00000000004005b0 <b>:
  4005b0:       8d 0c bd 00 00 00 00    lea    0x0(,%rdi,4),%ecx
  4005b7:       83 f1 04                xor    $0x4,%ecx
  4005ba:       83 e1 04                and    $0x4,%ecx
  4005bd:       d3 ee                   shr    %cl,%esi
  4005bf:       83 e6 0f                and    $0xf,%esi
  4005c2:       89 f0                   mov    %esi,%eax
  4005c4:       c1 e0 04                shl    $0x4,%eax
  4005c7:       09 f0                   or     %esi,%eax
  4005c9:       c3                      retq   

*/
It's possible to optimize the code in C, in order to save 1 instruction in assembly:

Code: Select all

unsigned b(unsigned s, unsigned byteval)
{
    s <<= 2;
    s += 4; // yes: use the "+" when possible!
    s &= 4;
    byteval >>= s;
    byteval &= 0x0f;
    return byteval | (byteval << 4);
}
/*
00000000004005b0 <b>:
  4005b0:       8d 0c bd 04 00 00 00    lea    0x4(,%rdi,4),%ecx
  4005b7:       83 e1 04                and    $0x4,%ecx
  4005ba:       d3 ee                   shr    %cl,%esi
  4005bc:       83 e6 0f                and    $0xf,%esi
  4005bf:       89 f0                   mov    %esi,%eax
  4005c1:       c1 e0 04                shl    $0x4,%eax
  4005c4:       09 f0                   or     %esi,%eax
  4005c6:       c3                      retq   
  4005c7:       66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
  4005ce:       00 00 
*/
It's made of 7 instructions, or 8 instructions when the XOR is used in place of the sum (again: always use the "+" in place of OR/XOR... as you can see the compiler knows about these optimizations but it cannot understand them automatically you should help the compiler).
Written in assembly by hand with one less instruction

Code: Select all

unsigned c(unsigned s, unsigned byteval)
{
    unsigned result;
    unsigned counter;
    asm volatile(
        "lea            4(, %q[s], 4),          %k[c]                    ;"
        "and            $4,                     %k[c]                    ;"
        "shr            %b[c],                  %k[byteval]              ;"
        "and            $0x0f,                  %k[byteval]              ;"
        "lea            (, %q[byteval], 4),     %k[c]                    ;"
        "lea            (%k[byteval],%k[c],4),  %k[result]               ;"
        : [result] "=r"(result), [c] "=&c" (counter), [byteval] "+&r" (byteval)
        : [s] "r" (s)
        : "cc"
    );
    return result;
}
/*
00000000004005d0 <c>:
  4005d0:       8d 0c bd 04 00 00 00    lea    0x4(,%rdi,4),%ecx
  4005d7:       83 e1 04                and    $0x4,%ecx
  4005da:       d3 ee                   shr    %cl,%esi
  4005dc:       83 e6 0f                and    $0xf,%esi
  4005df:       8d 0c b5 00 00 00 00    lea    0x0(,%rsi,4),%ecx
  4005e6:       67 8d 04 8e             lea    (%esi,%ecx,4),%eax
  4005ea:       c3                      retq   
*/
// alternative:
unsigned d(unsigned s, unsigned byteval)
{
    unsigned result;
    unsigned counter;
    asm volatile(
        "lea            4(%k[z], %k[s], 4),     %k[c]                     ;"
        "and            $4,                     %k[c]                     ;"
        "shr            %b[c],                  %k[byteval]               ;"
        "and            $0x0f,                  %k[byteval]               ;"
        "lea            (%k[z],%k[byteval],4),  %k[c]                     ;"
        "lea            (%k[byteval],%k[c],4),  %k[result]               ;"
        : [result] "=r"(result), [c] "=&c" (counter), [byteval] "+&r" (byteval)
        : [s] "r" (s), [z] "r" ((unsigned)0) /* add an xor to make the lea shorter */
        : "cc"
    );
    return result;
}
/*
00000000004005f0 <d>:
  4005f0:       31 c0                   xor    %eax,%eax
  4005f2:       67 8d 4c b8 04          lea    0x4(%eax,%edi,4),%ecx
  4005f7:       83 e1 04                and    $0x4,%ecx
  4005fa:       d3 ee                   shr    %cl,%esi
  4005fc:       83 e6 0f                and    $0xf,%esi
  4005ff:       67 8d 0c b0             lea    (%eax,%esi,4),%ecx
  400603:       67 8d 04 8e             lea    (%esi,%ecx,4),%eax
  400607:       c3                      retq   
*/
It's nice to notice that in the second version with an extra XOR the GCC selected "eax" to hold the value 0 without wasting another register (if the function is inlined it will change register and could also use another reg. that was pre-initialized to 0 for whatsoever other reason). When the code is inlined, the final lea to eax (67 8d 04 8e) will change automatically to any register where the result shall be placed.

Which is better: 6 or 7 native instructions; or, 4 to 3 instructions where one of them accesses the memory?
The LUT is faster with fast memory access (cache hit), when the loaded value won't be used in the very-next instruction, and when the other version is inlined and finds the register ecx busy and should back it up. Otherwise the computed value version is also ok.

Code: Select all


#include <stdint.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
unsigned a(unsigned s, unsigned byteval)
{
    unsigned c;
    c = ((s & 1)) ? (byteval & 0xf) : (byteval >> 4);
    c |= (c << 4);
    return c;
}
unsigned b(unsigned s, unsigned byteval)
{
    s <<= 2;
    s += 4;
    s &= 4;
    byteval >>= s;
    byteval &= 0x0f;
    return byteval | (byteval << 4);
}
unsigned c(unsigned s, unsigned byteval)
{
    unsigned result;
    unsigned counter;
    asm volatile(
        "lea            4(, %q[s], 4),          %k[c]                    ;"
        "and            $4,                     %k[c]                    ;"
        "shr            %b[c],                  %k[byteval]              ;"
        "and            $0x0f,                  %k[byteval]              ;"
        "lea            (, %q[byteval], 4),     %k[c]                    ;"
        "lea            (%k[byteval],%k[c],4),  %k[result]               ;"
        : [result] "=r"(result), [c] "=&c" (counter), [byteval] "+&r" (byteval)
        : [s] "r" (s)
        : "cc"
    );
    return result;
}
unsigned d(unsigned s, unsigned byteval)
{
    unsigned result;
    unsigned counter;
    asm volatile(
        "lea            4(%k[z], %k[s], 4),     %k[c]                     ;"
        "and            $4,                     %k[c]                     ;"
        "shr            %b[c],                  %k[byteval]               ;"
        "and            $0x0f,                  %k[byteval]               ;"
        "lea            (%k[z],%k[byteval],4),  %k[c]                     ;"
        "lea            (%k[byteval],%k[c],4),  %k[result]               ;"
        : [result] "=r"(result), [c] "=&c" (counter), [byteval] "+&r" (byteval)
        : [s] "r" (s), [z] "r" ((unsigned)0) /* add an xor to make the lea shorter */
        : "cc"
    );
    return result;
}

uint8_t LUT[512];
unsigned l(unsigned s, unsigned byteval)
{
    return LUT[(byteval << 1) + (s & 1)];
}
/*
No inline assembly is required. If you know the compiler,
the compiler knows more-or-less what you want.
0000000000400650 <l>:
  400650:       83 e7 01                and    $0x1,%edi
  400653:       8d 04 77                lea    (%rdi,%rsi,2),%eax
  400656:       0f b6 80 20 0b 60 00    movzbl 0x600b20(%rax),%eax
  40065d:       c3                      retq   
*/

int main()
{
    unsigned s, byteval;
    for(s = 0; s < 2; s++)
        for(byteval = 0; byteval < 256; byteval++)
            LUT[(byteval << 1) + (s & 1)] = a(s, byteval);
    printf("A 0x%02x\n", a(6, 23));
    printf("B 0x%02x\n", b(6, 23));
    printf("C 0x%02x\n", c(6, 23));
    printf("D 0x%02x\n", d(6, 23));
    printf("L 0x%02x\n", l(6, 23));
    return 0;
}

AIO
Posts: 51
Joined: Wed Nov 05, 2014 4:56 pm

Re: Future state of the project

Post by AIO » Wed Jun 24, 2015 12:52 am

Hello izy :D !

I did some benchmarking and tested your LUT example vs your optimized C version of non-LUT example. I profiled Mario Party 1 and just played the intro (~121 seconds), since that game actually uses "case TEXEL_I4:" in fetch_texel. The LUT took ~142ms while the non-LUT took ~172ms. I think I'll trust that the LUT is faster, since I tend to see an improvement from using them in these kind of scenarios, at least on my hardware.

I'm impressed though, because the examples you present sure are creative. I'm using Clang because I use HatCat's fork of Angrylion's plugin as a base. The reason I chose his fork is because I prefer C over C++, though the performance isn't any better, contrary to what some end users automatically assume. With HatCat's fork, Clang has the best performance, while Angrylion's seems to perform best on MSVC. Anyway, with Clang, unfortunately it doesn't seem to take advantage of lea for the piece of code in the switch statement, regardless of the algorithms I tried. Stuff like this makes me just want to write more code in assembly :D . As for your LUT algorithm, the only difference (with Clang's output) is that it replaces the shift instruction with an add. Though I'm sure adding is better than shifting so I will use yours + on other compilers it may take advantage of lea. Maybe I should try upgrading / downgrading Clang, since the versions seem to have a significant impact sometimes.

I declared s and byteval as UINT32, just to make sure the compiler uses movzx. I like to write code that's more likely to have better assembly output. I have to admit, experimenting with micro-optimizations for the RDP is more fun than I anticipated :D . Unfortunately, I have a habit of focusing on code that hardly gets executed ;/ .

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: Future state of the project

Post by MarathonMan » Thu Jun 25, 2015 10:33 am

Let's really think what that statement is saying:

Code: Select all

    c = ((s & 1)) ? (byteval & 0xf) : (byteval >> 4);
    c |= (c << 4);
If the LSB of s is set, then we take the low-order nibble of the byteval. Else, we take the high order nibble of the byteval.
Whichever nibble we take, we replicate it such that the high and low order nibble have the same value.

Here's my attempt:

Code: Select all

#include <stdint.h>

uint8_t nolut(uint8_t s, unsigned byteval) {
  unsigned sa = (s & 1) * 4;
  uint8_t c = byteval << sa; 

  return (c >> 4) | (c & 0xF0);
}

Code: Select all

0000000000000000 <nolut>:
   0:	83 e7 01             	and    $0x1,%edi
   3:	8d 0c bd 00 00 00 00 	lea    0x0(,%rdi,4),%ecx
   a:	d3 e6                	shl    %cl,%esi
   c:	89 f0                	mov    %esi,%eax
   e:	83 e6 f0             	and    $0xfffffff0,%esi
  11:	c0 e8 04             	shr    $0x4,%al
  14:	09 f0                	or     %esi,%eax
  16:	c3                   	retq

AIO
Posts: 51
Joined: Wed Nov 05, 2014 4:56 pm

Re: Future state of the project

Post by AIO » Fri Jun 26, 2015 7:48 am

MarathonMan wrote: Here's my attempt:

Code: Select all

#include <stdint.h>

uint8_t nolut(uint8_t s, unsigned byteval) {
  unsigned sa = (s & 1) * 4;
  uint8_t c = byteval << sa; 

  return (c >> 4) | (c & 0xF0);
}
Nice :D . I realized my example wasn't great ;/ .

I spent quite some time this week, trying to see what I can do to speed up the RDP. So far, I've just done simple stuff like using more LUTs and spliting functions such as fbread_16.

Here are some better examples of using LUTs to optimize the code.

Code: Select all

d = special_9bit_exttable[d];
a = (((a - b) * c) + (d << 8) + 0x80) >> 8;
You could make a separate LUT here just for computing d, since it's already using a LUT.

Code: Select all

pixel_color.a = special_9bit_clamptable[combined_color.a];
if (pixel_color.a == 0xff)
   pixel_color.a = 0x100;
Since it's already using a LUT, might as well make another one to get rid of this branching :D .

Anyway, I'm considering taking the risk and making drastic changes. One thing I have in mind, is the COLOR structure. I haven't checked the code enough, but I'm wondering if they need to be int's? If I were to change it to short's, I think it might have more potential for vectorization.

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: Future state of the project

Post by MarathonMan » Fri Jun 26, 2015 10:03 am

AIO wrote:Anyway, I'm considering taking the risk and making drastic changes. One thing I have in mind, is the COLOR structure. I haven't checked the code enough, but I'm wondering if they need to be int's? If I were to change it to short's, I think it might have more potential for vectorization.
Any reason why 128-bit (4xint) isn't sufficient? I don't think you'll find yourself working on more than one COLOR at a time, right?

User avatar
izy
Posts: 25
Joined: Tue Jun 02, 2015 11:34 am

Re: Future state of the project

Post by izy » Fri Jun 26, 2015 10:31 am

MarathonMan, your code works exactly like mine, just in reverse order. However working in my direction, shifting left (multiply), is always a better idea than shifting right (divide) because it offers more chance for optimization.
That is why your code generates 7 assembly instructions, and my code is made of only 6 hand-written assembly instructions (function "c"). My "alternative version" (function "d") generates 6 or 7 instructions dynamically when inlined & compiled; it's based on whether the compiler is able to reuse a register pre-initialized to 0 or must XOR a register on purpose for my code. So, in the worst case this alternative version is like yours; but it could end up shorter too with 6 instructions.
However, even if my code (function "d") generates 7 instructions it's going to be better than yours. Both pieces of code are made of 7 instructions but there is a difference. Something i already told to AIO, but perhaps i had to explain it better.
Do not use uint8 when there is no reason to! You have noticed that the compiler was able to understand it somehow.

Code: Select all

uint8_t nolut(uint8_t s, unsigned byteval) {
   0:   83 e7 01                and    $0x1,%edi  # s promoted to 32 bits
As you can see "uint8_t s" is accessed as a word. However, you have wrote later:

Code: Select all

  uint8_t c = byteval << sa; 
Why must "c" be an uint8? There is no reason to. You had to write: "unsigned c =.."
The consequence of your evil mind is clear ;)

Code: Select all

  11:   c0 e8 04                shr    $0x4,%al  # Write 8 bits
  14:   09 f0                   or     %esi,%eax # Access later 32 bits (!)
This time the compiler was unable to understand that "c" could have been promoted to an uint32 just by exchanging the following "mov" and "and" instructions. Demo:

Code: Select all

uint8_t nolut(uint8_t s, unsigned byteval) {
  unsigned sa = (s & 1) * 4;
  unsigned c = byteval << sa;
  c &= 0xF0; // mask first
  return (c >> 4) | c;
}
ASM code edited by hand.....
    0000000000000000 <nolut>:
       0:   83 e7 01                and    $0x1,%edi
       3:   8d 0c bd 00 00 00 00    lea    0x0(,%rdi,4),%ecx
       a:   d3 e6                   shl    %cl,%esi
       -:   83 e6 f0                and    $0xf0,%esi #moved up
       -:   89 f0                   mov    %esi,%eax # moved down
      11:   -- -- -                 shr    $0x4,%eax # changed to eax
      14:   09 f0                   or     %esi,%eax
      16:   c3                      retq
You should always try to use variables of the same size. Do not mix uint32 with uint8. Always use all uint8 or all uint32 if you can. What i told AIO before, was to use size_t. Because in an algorithm with a LUT you are going (sooner or later) to use the generated index as offset for a pointer. On 32 bits CPUs i see a pointer as an unsigned-long of 32 bits, on 64 bits CPUs it is a 64 bits unsigned-long. The type size_t is a typedef to unsigned long, and it sounds appropriate for the usage as offset.
When you load or assign an uint8 to a bigger variable type the compiler is likely able to use movzx which's going to cause no penalty issue. Sooner or later, the code "movzxb (baseAddr, index), dest" is going to be executed and if "index" was written to as a pointer-sized variable it'd be better. However on AMD64 when you write to the lowest 32-bits part of a register, the highest 32-bits part is set to 0... so it's possible to create a CPU aware of this detail. This problem may not happen when you mix 32-bit and 64-bits registers' parts, but it happens for sure when you mix 8-bit registers with 16,32,64-bits variables in Q registers. In this case the Q register eax needs to hold a copy of its high 24-bits, because writing to AL and to AH doesn't change the other octets. Re-generate the original value (joining back 24bits+8bits) causes the stall, because it's an extra operation for the CPU.
This code is for sure to avoid:

Code: Select all

  11:   c0 e8 04                shr    $0x4,%al
  14:   09 f0                   or     %esi,%eax # Stall
That will cause a register stall. For a full explanation i found this file for you (first result from gogle) http://www.agner.org/optimize/microarchitecture.pdf (page 57, tl;dr might be wrong)
Your code attempt (without my "fix") is basically the worst version seen so far. Now you know why. The next time it's going to be better, this time do not forget to read carefully in order to understand what is a register stall, why it happens and how to avoid it.

Too little time. Another day maybe i'll answer also to you AIO :geek:

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: Future state of the project

Post by MarathonMan » Fri Jun 26, 2015 4:31 pm

izy - excellent analysis and I did not see the 8 instruction fragment you had posted earlier. Fully agree with all points and great to see someone optimizing at the micro-architectural level! Please stick around when the dynamic code generation start coming around! :D

EDIT: I was trying to remove the LEAs since IIRC only more-recent Intel micro-architectures are good at munching through LEAs (see: http://www.realworldtech.com/includes/i ... png?71da3d). I think that the 8 instruction version may be faster for some CPUs, 7 instruction version may be faster for others... depends. Would have to research, possibly...

AIO
Posts: 51
Joined: Wed Nov 05, 2014 4:56 pm

Re: Future state of the project

Post by AIO » Fri Jun 26, 2015 5:59 pm

MarathonMan wrote:Any reason why 128-bit (4xint) isn't sufficient? I don't think you'll find yourself working on more than one COLOR at a time, right?
Well, the primary reason would be for code like

Code: Select all

TEX->r = t3.r + ((((invsf * (t2.r - t3.r)) + (invtf * (t1.r - t3.r))) + 0x10) >> 5);
I know it's possible to do 32bit multiply with SSE4, but I think it's best to keep system requirements as low as reasonably possible. So I prefer using SSE2 over SSE4 when there's not much potential benefit to using SSE4. Unfortunately, Clang isn't vectorizing the code I pasted above, and it happens to be bottleneck code. I honestly didn't even think of working on more than one COLOR at a time. Would be interesting if it was possible. I also have some ideas to try, like using a giant vectorized LUT for code like

Code: Select all

color0->r = GET_HI_RGBA16_TMEM(c0);
color0->g = GET_MED_RGBA16_TMEM(c0);
color0->b = GET_LOW_RGBA16_TMEM(c0);
color0->a = (c0 & 1) ? 0xff : 0;
I know you can do this with int's. I even tried it and saw some benefit in doing so, but I imagine it might be faster if the size was twice as small.

I don't even know if my idea of using short instead of int is feasible / actually better, but it would save me time to receive input from people who have analyzed the code more than I have :) . I feel that sometimes, you have to try all sorts of ideas in order to improve the code. I like to try and see what works and what doesn't.
izy wrote: Do not use uint8 when there is no reason to!

Why must "c" be an uint8? There is no reason to. You had to write: "unsigned c =.."

When you load or assign an uint8 to a bigger variable type the compiler is likely able to use movzx which's going to cause no penalty issue. Sooner or later, the code "movzxb (baseAddr, index), dest" is going to be executed and if "index" was written to as a pointer-sized variable it'd be better. However on AMD64 when you write to the lowest 32-bits part of a register, the highest 32-bits part is set to 0... so it's possible to create a CPU aware of this detail. This problem may not happen when you mix 32-bit and 64-bits registers' parts, but it happens for sure when you mix 8-bit registers with 16,32,64-bits variables in Q registers. In this case the Q register eax needs to hold a copy of its high 24-bits, because writing to AL and to AH doesn't change the other octets. Re-generate the original value (joining back 24bits+8bits) causes the stall, because it's an extra operation for the CPU.
This code is for sure to avoid:

Code: Select all

11:   c0 e8 04                shr    $0x4,%al
14:   09 f0                   or     %esi,%eax # Stall
That will cause a register stall. For a full explanation i found this file for you (first result from gogle) http://www.agner.org/optimize/microarchitecture.pdf (page 57, tl;dr might be wrong)
Your code attempt (without my "fix") is basically the worst version seen so far. Now you know why. The next time it's going to be better, this time do not forget to read carefully in order to understand what is a register stall, why it happens and how to avoid it.
I'm not sure why it was uint8 to begin with. I too prefer to take advantage of movzx, so I assign the variables as 32bit. Although, last time I checked, it didn't make a difference with Clang in the code I looked at, iirc. I still think it's a good idea to optimize code for as many compilers as you can. Some of these ternary algorithms need to be adjusted.

Good point about register stalling.

Edit: did more optimizing and profiling today. I started using Intel's compiler for the RDP, since I got slightly better results than with Clang :) . I was shocked that the compiler output for some ternary algorithms was awful. In read_tmem_copy, Intel used branching for this code

Code: Select all

hibits[0] = (tidx_a & 0x1000) ? 1 : 0;
hibits[1] = (tidx_blow & 0x1000) ? 1 : 0; 
hibits[2] =     (tidx_bhi & 0x1000) ? 1 : 0;
hibits[3] =     (tidx_c & 0x1000) ? 1 : 0;
hibits[4] =     (tidx_dlow & 0x1000) ? 1 : 0;
hibits[5] = (tidx_dhi & 0x1000) ? 1 : 0;
. Another odd thing is the fact that hibits[2], and hibits[5] are never used! So it's a waste of cpu cycles ;/ .

Since the value of hibits[] is fixed, you can also staticize the following code in fetch_qword_copy

Code: Select all

shorta = hibits[0] ? sortshort[4] : sortshort[0];
shortb = hibits[1] ? sortshort[5] : sortshort[1];
shortc = hibits[3] ? sortshort[6] : sortshort[2];
shortd = hibits[4] ? sortshort[7] : sortshort[3];
This code

Code: Select all

c0 = (ands) ? (c0 & 0xf) : (c0 >> 4);
also came out pretty bad.

After splitting fbread_16, i realized I could do the same to other functions like rgb_dither_complete. Since other_modes.rgb_dither_sel is only assigned in set_other_modes, I figure that's a good place to assign the value of rgb_dither_ptr. That's a good way to reduce branching in some of these bottleneck functions, imo.

Hopefully someone can track down why Angrylion's recent commits made it slower. I'm starting to question how much faster it can get :| . I'm losing hope.

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: Future state of the project

Post by MarathonMan » Sat Jun 27, 2015 11:25 am

It can get a lot faster, but (IMO):

Incremental changes/refactorings won't make a big difference. You really need to fundamentally change the plugin and do something quite radical (vectorize it with SSE, make a JIT/dynarec RDP, etc.). If you are wondering why dynarec RDP could help, consider that a lot of the drawing commands are based upon the mode settings (a lot of those branches). If you dynamically generate code for this kind of thing, you can eliminate a large portion of those branches.

User avatar
Narann
Posts: 154
Joined: Mon Jun 16, 2014 4:25 pm
Contact:

Re: Future state of the project

Post by Narann » Sat Jun 27, 2015 1:32 pm

Would I add: Tile rendering using multithreading?

I have no idea if it's a viable option but I have the impression, as RDP work on buffers, there should be a way to render frame buffers by "tile" and so, send each time to a separate thread.

AIO
Posts: 51
Joined: Wed Nov 05, 2014 4:56 pm

Re: Future state of the project

Post by AIO » Sat Jun 27, 2015 3:30 pm

MarathonMan wrote:It can get a lot faster, but (IMO):

Incremental changes/refactorings won't make a big difference. You really need to fundamentally change the plugin and do something quite radical (vectorize it with SSE, make a JIT/dynarec RDP, etc.). If you are wondering why dynarec RDP could help, consider that a lot of the drawing commands are based upon the mode settings (a lot of those branches). If you dynamically generate code for this kind of thing, you can eliminate a large portion of those branches.
You're absolutely right that you'll need to do something radical. I actually realized yesterday that JIT has good potential for the RDP. I just don't know how I'd feasibly implement it at this point. I don't know enough about the RDP yet, so I'm just reading through, optimizing along the way, and profiling. I was intrigued at how simple changes, made an impact on bottleneck functions. One example is

Code: Select all

PAIRWRITE8(tempdword, tempbyte, (tempbyte & 1) ? 3 : 0);
Edit: this is a case where the compiler can do an awful job with ternary algorithms. If the compiler you're using, doesn't do a good job with this ternary algorithm, then you may want to consider using a different algorithm since it's bottleneck code in some cases.

My issue is that some games are going to need a huge speed up! I just profiled Vigilante 8 2nd Offense in 1964 1.1, even with VI filters off. I get ~13 VI/s. I have 2.667 ghz quad i5. Someone with a very high end machine would get somewhere in the 20's for VI/s. Here's the results for 2 minutes of gameplay.Image

Perhaps it's time for me to move onto 64 bit. I did notice that part of the problem is that there aren't enough registers (in 32 bit), so there was quite a bit of overhead. Unfortunately, even doubling the speed of the RDP won't be enough, for some games. My best hope is to see if I can figure out how to vectorize the RDP a lot better. I still need to figure out whether changing the COLOR structure from int to short, is viable or not.

I think the tweaking I've done so far, was worthwhile. I can certainly see a difference in profiler results, at the very least.

User avatar
izy
Posts: 25
Joined: Tue Jun 02, 2015 11:34 am

Re: Future state of the project

Post by izy » Mon Jun 29, 2015 11:45 am

I found out the code you were talking about in a fork of mupen64plus.
It's hard to say whether the LUT is advantageous, because results differ a lot among scenarios and CPUs. It's really possible that the code with the LUT performs better in a real word application. Measuring the cpu time by hand is difficult when there is code accessing the memory. A compromise could be made with the LFENCE+RDTSCP instructions to benchmark the code (note that the final P is important -- without the P the instructions is different and cannot be really used to benchmark code; also, if your code only writes to memory use SFENCE and if does r/w use MFENCE). This way will give you different results, but in a real-world application you won't find anywhere a load or store fence in similar circumstances and you'd take advantage of all trick a CPU can engage (ruined by rdtscp). RDTSC (without P) and no fence could give results more similar to a real-world application usage, but it won't tell you about the worst case scenario and it is therefore useless for comparisons.

You're right. Writing everything in assembly would save a lot of instructions and allow you to apply tricks that the compiler may not see. On the other hand, if you need to change something... it's going to be a big pain.

I've found this code:

Code: Select all

    case TEXEL_I4:
    {
        uint8_t byteval, c;
        taddr = ((tbase << 4) + s) >> 1;
        taddr ^= ((t & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR);
        byteval = g_gdp.tmem[taddr & 0xfff];
        c = (s & 1) ? (byteval & 0xf) : (byteval >> 4);
        c |= (c << 4);
        color->r = c;
        color->g = c;
        color->b = c;
        color->a = c;
    }
    break;
  16beeb:       c1 e1 04                shl    $0x4,%ecx # tbase << 4
  16beee:       83 e2 01                and    $0x1,%edx
  16bef1:       01 f1                   add    %esi,%ecx # tbase += s
  16bef3:       d1 e9                   shr    %ecx      # tbase >>= 1
....
Therefore I also saw this:

Code: Select all

taddr = ((tbase << 4) + s) >> 1;
taddr = ((tbase * 16) + s) / 2;
tbase can be overwritten because isn't used later but anyway it could have be kept untouched in this way:

Code: Select all

shl $4, tbase
lea (tbase, s), taddr # sum and move to destination
shr $1, taddr
possibly it can be asm-inlined in this way:

Code: Select all

shr $1, s
lea (s, tbase, 8), taddr # sum, shift and move to destination
Next, i see this code...
#define BYTE_XOR_DWORD_SWAP 4
#define WORD_XOR_DWORD_SWAP 2
taddr ^= ((t & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR);

Code: Select all

size_t a0(size_t t)
{
    size_t taddr = ((t & 1) ? 4 : 2);
    return taddr;
}
{without the final XOR}

00000000004005c0 <a0>:
  4005c0:       83 e7 01                and    $0x1,%edi
  4005c3:       48 83 ff 01             cmp    $0x1,%rdi
  4005c7:       48 19 c0                sbb    %rax,%rax
  4005ca:       48 83 e0 fe             and    $0xfffffffffffffffe,%rax
  4005ce:       48 83 c0 04             add    $0x4,%rax
  4005d2:       c3                      retq   
{without the final XOR}
It's such a horrible code. The returned value can be XOR-ed with taddr at a later time.
There's a good coincidence. 2 in binary is 010 and 4 is 100. As always "t & 1" is the selector.

Code: Select all

mov $2, temp
and $1, t               # t is either 0 or 1
lea (temp, t, 2), temp  # temp+=t*2  (temp=2+t*2)
xor temp, taddr         # xor with either 2 or 4
Code:

Code: Select all

size_t b0(size_t t)
{
    size_t temp;
    asm volatile(
        "and            $1,           %[t]                 ;"
        "lea            2(, %[t], 2), %[temp]                 ;"
        : [temp] "=r" (temp), [t] "+r" (t)
        : 
        : "cc"
    );
    return temp;
}
{without the final XOR}

00000000004005e0 <b0>:
  4005e0:       48 83 e7 01             and    $0x1,%rdi
  4005e4:       48 8d 04 7d 02 00 00    lea    0x2(,%rdi,2),%rax
  4005eb:       00 
  4005ec:       c3                      retq   
{without the final XOR}
The other version, where the defines have different values:
#define BYTE_XOR_DWORD_SWAP 7
#define WORD_XOR_DWORD_SWAP 3
taddr ^= ((t & 1) ? BYTE_XOR_DWORD_SWAP : BYTE_ADDR_XOR);

Code: Select all

size_t a1(size_t t)
{
    size_t taddr = ((t & 1) ? 7 : 3);
    return taddr;
}
{without the final XOR}

00000000004005f0 <a1>:
  4005f0:       83 e7 01                and    $0x1,%edi
  4005f3:       48 83 ff 01             cmp    $0x1,%rdi
  4005f7:       48 19 c0                sbb    %rax,%rax
  4005fa:       48 83 e0 fc             and    $0xfffffffffffffffc,%rax
  4005fe:       48 83 c0 07             add    $0x7,%rax
  400602:       c3                      retq   
{without the final XOR}
There is another good coincidence. 3 in binary is 011 and 7 is 111. As always "t & 1" is the selector.

Code: Select all

mov $3, temp
and $1, t               # t is either 0 or 1
lea (temp, t, 4), temp  # temp+=t*4  (temp=3+t*4)
xor temp, taddr         # xor with either 3 or 7
In this case it's even possible to avoid to clobber "t". This will create a much better code when the functions code is inlined.

Code: Select all

size_t b1(size_t t)
{
    size_t temp;
    asm volatile(
        "lea            3(, %[t], 4), %[temp]                 ;"
        "and            $7,           %[temp]                 ;"
        : [temp] "=r" (temp)
        : [t] "r" (t)
        : "cc"
    );
    return temp;
}
{without the final XOR}

0000000000400610 <b1>:
  400610:       48 8d 04 bd 03 00 00    lea    0x3(,%rdi,4),%rax
  400617:       00 
  400618:       48 83 e0 07             and    $0x7,%rax
  40061c:       c3                      retq 
{without the final XOR}
The more assembly code you write, the shorter will be the resulting code. I noticed that usually you save a 20%-40% over the number of instruction written by a compiler. In this case (function "fetch_texel"), i notice that taddr is initialized to 0. It can be used as input for these LEAs, and also for the first of the LEAs as used in my old coded version to generate the color "c" without the LUT.

Code: Select all

static STRICTINLINE int32_t alpha_combiner_equation(int32_t a, int32_t b, int32_t c, int32_t d)
{
   a = special_9bit_exttable[a];
   b = special_9bit_exttable[b];
   c = SIGNF(c, 9);
   d = special_9bit_exttable[d];
   a = (((a - b) * c) + (d << 8) + 0x80) >> 8;
   return (a & 0x1ff);
}
Look also at this macro:

Code: Select all

    #define SIGNF(x, numb)    ((x) | -((x) & (1 << (numb - 1))))

....
  40041c:	89 c2                	mov    %eax,%edx        # always required
  40041e:	81 e2 00 01 00 00    	and    $0x100,%edx
  400424:	f7 da                	neg    %edx
  400426:	09 c2                	or     %eax,%edx
That macro will create 4 instructions. It's possible to save one or two instructions with inline assembly, maybe using GCC functions macro. This macro shall create 3 or 2 instructions (when the value is replaced in-place).

Code: Select all

    #define SIGNF(x, numb) \
        ({ \
            typeof(x) _x = x; \
            asm("shl            %[bits],        %[temp]                 ;" \
                "sar            %[bits],        %[temp]                 ;" \
                : [temp] "+r" (_x) \
                : [bits] "n" (sizeof(_x)*8-numb) \
                : "cc" \
            ); \
            _x; \
        })

.....
  400410:	89 c1                	mov    %eax,%ecx        # not always required
  400412:	c1 e1 17             	shl    $0x17,%ecx
  400415:	c1 f9 17             	sar    $0x17,%ecx
I see that you want to turn a,b,c,d into shorts. But from the code above i see that a,b,c,d are used as indexes to access to a LUT. It would be a better idea to use "(unsigned) int/long a, b, c, d...". Making everything 32 bits or even CPU-word wide, and unsigned if they cannot be signed, will result in better code. Also, all readings and writings to pointer-sized and pointer-aligned variables are guaranteed to be atomic on all platforms. Using shorts is advantageous only if you create 8 shorts and load them all with movdqa. If you read or write values one-by-one, unsigned and unsigned longs will perform better than shorts.

Another case from your following message:

Code: Select all

PAIRWRITE8(tempdword, tempbyte, (tempbyte & 1) ? 3 : 0);
Again, this is a case where the compiler does an awful job with ternary algorithms. Simply replacing that with -(tempbyte & 1) & 3 made a difference!
You can use an inline assembly to reduce even more the amount of instructions:

Code: Select all

and $1, c
lea (c, c, 2), c
that basically means:
c &= 1; // c=0 or c=1
c = c+c+c; // 0+0+0=0 1+1+1=3

Inline macro for GCC:

Code: Select all

#define GET(tempbyte) \
    ({ \
        typeof(tempbyte) temp; \
        asm( \
            "lea        (%[mask], %[mask], 2),  %[temp]     ;" \
            : [temp] "=r" (temp) \
            : [mask] "r" (tempbyte & 1) \
        ); \
        temp; \
    })

....
PAIRWRITE8(tempdword, tempbyte, GET(tempbyte));
Look at how they translate for a standard function (input argument in RDI):

Code: Select all

return (tempbyte & 1) ? 3 : 0
0000000000400630 <tA>:
  400630:       48 89 f8                mov    %rdi,%rax
  400633:       83 e0 01                and    $0x1,%eax
  400636:       48 f7 d8                neg    %rax
  400639:       83 e0 03                and    $0x3,%eax
  40063c:       c3                      retq   

return -(tempbyte & 1) & 3
0000000000400640 <tB>:
  400640:       48 89 f8                mov    %rdi,%rax
  400643:       83 e0 01                and    $0x1,%eax
  400646:       48 f7 d8                neg    %rax
  400649:       83 e0 03                and    $0x3,%eax
  40064c:       c3                      retq   

return GET(tempbyte);
0000000000400650 <tC>:
  400650:       83 e7 01                and    $0x1,%edi
  400653:       48 8d 04 7f             lea    (%rdi,%rdi,2),%rax
  400657:       c3                      retq   

MarathonMan wrote:DIT: I was trying to remove the LEAs since IIRC only more-recent Intel micro-architectures are good at munching through LEAs (see: http://www.realworldtech.com/includes/i ... png?71da3d).
No, always use LEA if possible. Also, that picture is wrong.

AIO
Posts: 51
Joined: Wed Nov 05, 2014 4:56 pm

Re: Future state of the project

Post by AIO » Tue Jun 30, 2015 3:48 am

izy wrote:I found out the code you were talking about in a fork of mupen64plus.
It's hard to say whether the LUT is advantageous, because results differ a lot among scenarios and CPUs. It's really possible that the code with the LUT performs better in a real word application. Measuring the cpu time by hand is difficult when there is code accessing the memory. A compromise could be made with the LFENCE+RDTSCP instructions to benchmark the code (note that the final P is important -- without the P the instructions is different and cannot be really used to benchmark code; also, if your code only writes to memory use SFENCE and if does r/w use MFENCE). This way will give you different results, but in a real-world application you won't find anywhere a load or store fence in similar circumstances and you'd take advantage of all trick a CPU can engage (ruined by rdtscp). RDTSC (without P) and no fence could give results more similar to a real-world application usage, but it won't tell you about the worst case scenario and it is therefore useless for comparisons.
When the differences are small, ya it is hard to say whether the LUT is advantageous. One reason I'm inclined to use LUTs sometimes, is that since I'm stuck with 32bit development, register allocation is very limited. So sometimes I even have to use a less efficient, but simpler algorithm because of the drawbacks. Earlier I mentioned staticizing

Code: Select all

shorta = hibits[0] ? sortshort[4] : sortshort[0];
shortb = hibits[1] ? sortshort[5] : sortshort[1];
shortc = hibits[3] ? sortshort[6] : sortshort[2];
shortd = hibits[4] ? sortshort[7] : sortshort[3];
It turned out that it was slower after staticizing it, probably due to poor register allocation, so I reverted it back to ternary and just changed

Code: Select all

hibits[0] = (tidx_a & 0x1000) ? 1 : 0;
hibits[1] = (tidx_blow & 0x1000) ? 1 : 0; 
hibits[2] = (tidx_bhi & 0x1000) ? 1 : 0;
hibits[3] = (tidx_c & 0x1000) ? 1 : 0;
hibits[4] = (tidx_dlow & 0x1000) ? 1 : 0;
hibits[5] = (tidx_dhi & 0x1000) ? 1 : 0;
to

Code: Select all

hibits[0] = tidx_a    & 0x1000;
hibits[1] = tidx_blow & 0x1000; 
hibits[3] = tidx_c    & 0x1000;
hibits[4] = tidx_dlow & 0x1000;
since the only time hibits[] is used, is to check whether the value = 0 or not. In the other LUT examples I gave, the benefit was much greater :) .

As for benchmarking, I've honestly never tried any of the fence instructions. I think I'll look into that.
izy wrote: You're right. Writing everything in assembly would save a lot of instructions and allow you to apply tricks that the compiler may not see. On the other hand, if you need to change something... it's going to be a big pain.

The more assembly code you write, the shorter will be the resulting code. I noticed that usually you save a 20%-40% over the number of instruction written by a compiler. In this case (function "fetch_texel"), i notice that taddr is initialized to 0. It can be used as input for these LEAs, and also for the first of the LEAs as used in my old coded version to generate the color "c" without the LUT.
Indeed. It's especially useful in scenarios where register allocation is important. I hate seeing significant overhead! Hopefully 64bit compiler output will look better.
izy wrote: I see that you want to turn a,b,c,d into shorts. But from the code above i see that a,b,c,d are used as indexes to access to a LUT. It would be a better idea to use "(unsigned) int/long a, b, c, d...". Making everything 32 bits or even CPU-word wide, and unsigned if they cannot be signed, will result in better code. Also, all readings and writings to pointer-sized and pointer-aligned variables are guaranteed to be atomic on all platforms. Using shorts is advantageous only if you create 8 shorts and load them all with movdqa. If you read or write values one-by-one, unsigned and unsigned longs will perform better than shorts.
Perhaps it might not be beneficial for scalar code, but it's potentially advantageous for vectorization. texture_pipeline_cycle() is a function that takes up a significant amount of time. I'm still not even sure if it's feasible to change the COLOR structure to shorts, but if it is, then code like

Code: Select all

TEX->r = t3.r + ((((invsf * (t2.r - t3.r)) + (invtf * (t1.r - t3.r))) + 0x10) >> 5);	
TEX->g = t3.g + ((((invsf * (t2.g - t3.g)) + (invtf * (t1.g - t3.g))) + 0x10) >> 5);																		
TEX->b = t3.b + ((((invsf * (t2.b - t3.b)) + (invtf * (t1.b - t3.b))) + 0x10) >> 5);																
TEX->a = t3.a + ((((invsf * (t2.a - t3.a)) + (invtf * (t1.a - t3.a))) + 0x10) >> 5);
would be more easily vectorized, since 32bit multiplication requires SSE4. I'm just curious really. If it turns out that int is better or is actually needed, I'll just have to use SSE4 for that part then.
izy wrote:

Code: Select all

400410:   89 c1                   mov    %eax,%ecx        # not always required
400412:   c1 e1 17                shl    $0x17,%ecx
400415:   c1 f9 17                sar    $0x17,%ecx
I must say, pretty neat.

User avatar
Sintendo
Posts: 25
Joined: Thu Oct 31, 2013 9:11 am

Re: Future state of the project

Post by Sintendo » Tue Jun 30, 2015 4:07 am

How come you're stuck with 32-bit? Since you have SSE4, I'm assuming your CPU is capable of 64-bit.

I'd hate to see you spend all this time optimizing things specifically for 32-bit.

AIO
Posts: 51
Joined: Wed Nov 05, 2014 4:56 pm

Re: Future state of the project

Post by AIO » Tue Jun 30, 2015 10:51 pm

Sintendo wrote:How come you're stuck with 32-bit? Since you have SSE4, I'm assuming your CPU is capable of 64-bit.

I'd hate to see you spend all this time optimizing things specifically for 32-bit.
My hardware and OS is 64-bit, but atm I rely too much on 32-bit software. 32-bit zilmar spec emulators are a great environment for me to develop and test on. Also, this is really the first time where I felt I absolutely need to use 64-bit, since the RDP is so slow and needs all the speed ups it can get. You don't have to worry about me focusing on 32-bit only optimizations because I'm already skeptical enough about how much faster the pixel accurate RDP can get. Anyway, none of the important optimizations I've done so far, are for 32-bit only. As for these smaller optimizations, it's more about curiosity, learning, and practice.

I don't have enough time to port everything I use, to 64 bit. Not sure what I will even do at this point. I have too many goals and too many limitations.

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: Future state of the project

Post by MarathonMan » Wed Jul 01, 2015 3:28 pm

izy wrote:
MarathonMan wrote:DIT: I was trying to remove the LEAs since IIRC only more-recent Intel micro-architectures are good at munching through LEAs (see: http://www.realworldtech.com/includes/i ... png?71da3d).
No, always use LEA if possible. Also, that picture is wrong.
What's wrong with it? Link?

RWT is run by David Kanter... not saying he's perfect, but he knows his stuff. :p

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: Future state of the project

Post by MarathonMan » Thu Jul 02, 2015 11:42 am

Bump.

MooglyGuy and some of the other MESS guys have been multi-threading and vectorizing the RDP. So I'll probably pull from them and replace angrylion's SVN with that in the future.

ShadowFX
Posts: 86
Joined: Sat Oct 05, 2013 2:08 am
Location: The Netherlands

Re: Future state of the project

Post by ShadowFX » Thu Jul 02, 2015 1:41 pm

MarathonMan wrote:Bump.

MooglyGuy and some of the other MESS guys have been multi-threading and vectorizing the RDP. So I'll probably pull from them and replace angrylion's SVN with that in the future.
Does that RDP also include angrylion's recent changes? See: https://code.google.com/p/angrylions-stuff/source/list
"Change is inevitable; progress is optional"

OS: Windows 10 Pro x64
Specs: Intel Core i7-7700K @ 4.2GHz, 16GB DDR4-RAM, NVIDIA GeForce GTX 1080 Ti
Main build: AVX (official)

User avatar
izy
Posts: 25
Joined: Tue Jun 02, 2015 11:34 am

Re: Future state of the project

Post by izy » Thu Jul 02, 2015 3:44 pm

MarathonMan wrote:
izy wrote:
MarathonMan wrote:DIT: I was trying to remove the LEAs since IIRC only more-recent Intel micro-architectures are good at munching through LEAs (see: http://www.realworldtech.com/includes/i ... png?71da3d).
No, always use LEA if possible. Also, that picture is wrong.
What's wrong with it? Link?

RWT is run by David Kanter... not saying he's perfect, but he knows his stuff. :p
That picture says that Sandy Bridge has two LEAs.
This article on Sandy Bridge begins on page 41 but look at the table on page 52 http://www.intel.com/content/dam/doc/ma ... manual.pdf (also found an extract: http://cs.colby.edu/courses/S15/cs232/I ... ecture.pdf )
LEA means Slow LEA for Intel. Sandy Bridge has only a (Slow) LEA and has two Fast LEAs. A Fast LEA is Intel's (marketing) way to call a partial incomplete LEA unit. You see there; LEAs are placed on integer ports for performances, moving between integer ports has no penalties (table on page 53, moving from & to integer ports penalty=0)
a. Port0 has no LEA, but your image says it has a complete one
b. Port1 has two LEAs (a complete and a fast lea not shown in your diagram)
c. Port5 has a fast LEA (not shown in the diagram)
I feel free to suspect that image to be utterly wrong, by trusting Intel's articles more than Wikipedia-Blog-like articles and by accounting for three errors just in regard to LEA alone. I wonder how can someone believe that source to be accurate. I probed that website further http://www.realworldtech.com/wp-content ... well-3.png and i noticed that there appears a Fast Lea on Sandy B. (it stills misplaced on the wrong port and the other fast lea is still missing).
MarathonMan wrote:Link?
http://lmgtfy.com/?q=cpu+intel+website

Post Reply

Who is online

Users browsing this forum: No registered users and 3 guests