Code review thread

Discuss topics related to development here.
Post Reply
User avatar
Narann
Posts: 154
Joined: Mon Jun 16, 2014 4:25 pm
Contact:

Code review thread

Post by Narann » Thu Jul 31, 2014 5:55 pm

This thread is intent to provide "code review" or code discuss on various commits.

You can copy past this template:

Code: Select all

[url=<url>]<commit message>[url]
Your comment.
You are also encouraged to use the commit message in the "object" of your post.
Last edited by Narann on Thu Jul 31, 2014 6:00 pm, edited 1 time in total.

User avatar
Narann
Posts: 154
Joined: Mon Jun 16, 2014 4:25 pm
Contact:

Regression: -O2 is better with caches.

Post by Narann » Thu Jul 31, 2014 5:59 pm

Regression: -O2 is better with caches.

I just see you change -O3 to -O2 "because it work better with cache".

-O3 add a lot of optimization flags. Can I suggest to find wich one is responsible of the performance issue and add them manually?

Do you know the command:

Code: Select all

$gcc -c -Q -O2 --help=optimizers
It list every flag used for the specific gcc command line.

I've generate two files (-O2 and -O3) and compare then to raise you the difference (gcc version 4.4.7):

Code: Select all

$gcc -c -Q -O2 --help=optimizers >> O2_flags
$gcc -c -Q -O3 --help=optimizers >> O3_flags
$meld O2_flags O3_flags
(Yes, I like meld)

All this flags are enabled in O3 and not in O2:

Code: Select all

-fgcse-after-reload
-finline-functions << It's sad to lost this one no? ;)
-fipa-cp-clone
-fpredictive-commoning << I guess this one is responsible of the performance loose.
-ftree-vectorize
-funswitch-loops
Hope this help!

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: Regression: -O2 is better with caches.

Post by MarathonMan » Thu Jul 31, 2014 9:05 pm

Narann wrote:Do you know the command:

Code: Select all

$gcc -c -Q -O2 --help=optimizers
It list every flag used for the specific gcc command line.

...

All this flags are enabled in O3 and not in O2:

Code: Select all

-fgcse-after-reload
-finline-functions << It's sad to lost this one no? ;)
-fipa-cp-clone
-fpredictive-commoning << I guess this one is responsible of the performance loose.
-ftree-vectorize
-funswitch-loops
Hope this help!
Hmm... thanks, I have never heard of that specific argument before. I'll experiment with those before, as I've never heard of a few. For a few of those, I might have to track which compiler version they originated in as to not set them on older compilers (it would be unfortunate if someone on gcc-4.8 could not build CEN64!)

Believe it or not, I think -finline-functions is the dangerous one. Right now, a lot of the new core's success is due to the uop caches present in SNB, IVB, and Haswell (from what I can tell). These "uop caches" really serve as an L0 instruction cache. As such, they provide the CPU with (roughly?) twice the decoding bandwidth. However, they're absolutely tiny -- something on the order of effectively 1.5KiB.

Really, I shouldn't be touching the optimization flags too much yet. :oops: ... but I got a VI/s craving!

User avatar
Narann
Posts: 154
Joined: Mon Jun 16, 2014 4:25 pm
Contact:

Re: Regression: -O2 is better with caches.

Post by Narann » Fri Aug 01, 2014 9:45 am

Thanks for the informations! Reading the -finline-functions doc again and the uop cache stuff you are certainly right.
MarathonMan wrote:Really, I shouldn't be touching the optimization flags too much yet.
I think so. Reading the VI/s fury starting on the main thread I suggest to wait. As krom said, Windows has it own behavior too. The time to the "best compiler flags set" will come soon. :P

And to bounce on your answer: compiler/version specific optimization "flags" seems valid to me (It could also be interesting to have a "unit/regression test script" that bench multiple "compiler flag sets". This way we could have feedback of many devs and choose the best set).

User avatar
Narann
Posts: 154
Joined: Mon Jun 16, 2014 4:25 pm
Contact:

uint32_t iw = rfex_latch->iw;

Post by Narann » Thu Aug 21, 2014 7:45 am

In seems that every function in:
vr4300/functions.c
always allocated the iw variable.

I was wondering if their would have any benefit to allocate iw once for all (static?) and just assign rfex_latch->iw at the beginning of the functions:

Code: Select all

static uint32_t reg_iw;
And in:

Code: Select all

int VR4300_ANDI_ORI_XORI(struct vr4300 *vr4300, uint64_t rs, uint64_t rt) {
         struct vr4300_rfex_latch *rfex_latch = &vr4300->pipeline.rfex_latch;
         struct vr4300_exdc_latch *exdc_latch = &vr4300->pipeline.exdc_latch;

         reg_iw = rfex_latch->iw;
And in general, wouldn't there be any interest to directly use rfex_latch->iw and never allocate or assign anything?

Same for various masks. Most of them are 64-bit length. Would there be any performance increase if we declare all of this at the beginning?

For example, with a structure, like this:

Code: Select all

struct {
  uint32_t iw;
  uint64_t offset;
  uint64_t addr;
  union {
    uint64_t mask;
    uint64_t or_mask;
  };
  uint64_t and_mask;
  uint64_t xor_mask;
} emu_reg;

static struct emu_reg reg;
And in:

Code: Select all

int VR4300_ANDI_ORI_XORI(struct vr4300 *vr4300, uint64_t rs, uint64_t rt) {
         struct vr4300_rfex_latch *rfex_latch = &vr4300->pipeline.rfex_latch;
         struct vr4300_exdc_latch *exdc_latch = &vr4300->pipeline.exdc_latch;

         reg.iw = rfex_latch->iw;
         reg.and_mask = vr4300_bitwise_lut[iw >> 26 & 0x3][0];
         reg.xor_mask = vr4300_bitwise_lut[iw >> 26 & 0x3][1];
I just ask but maybe there is some sort of optimizations on the allocator already.

User avatar
Sintendo
Posts: 25
Joined: Thu Oct 31, 2013 9:11 am

Re: Code review thread

Post by Sintendo » Thu Aug 21, 2014 8:31 am

I don't see how any of that would help? If anything, adding static seems like it would make things less efficient.

User avatar
Narann
Posts: 154
Joined: Mon Jun 16, 2014 4:25 pm
Contact:

Re: Code review thread

Post by Narann » Thu Aug 21, 2014 9:02 am

Sintendo wrote:adding static seems like it would make things less efficient.
Why? Because they are not allocated on the stack?

User avatar
Sintendo
Posts: 25
Joined: Thu Oct 31, 2013 9:11 am

Re: Code review thread

Post by Sintendo » Thu Aug 21, 2014 9:54 am

By making reg_iw global, every function that writes to reg_iw implicitly copies those changes back to where reg_iw is kept in memory at the end, so the changes are visible to other functions reading reg_iw. This is pointless, because every function starts by overwriting it with rfex_latch->iw anyway. You're doing extra work for nothing.

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: Code review thread

Post by MarathonMan » Thu Aug 21, 2014 12:13 pm

Sintendo wrote:By making reg_iw global, every function that writes to reg_iw implicitly copies those changes back to where reg_iw is kept in memory at the end, so the changes are visible to other functions reading reg_iw. This is pointless, because every function starts by overwriting it with rfex_latch->iw anyway. You're doing extra work for nothing.
This is accurate. Static objects also intend to inhibit optimizations, as far as I understand.

Though, Narann, you may be on to something. It may be beneficial to pass IW as a parameter to each function. On x86_64, you're basically guaranteed that the first 4 arguments are passed via registers regardless of M$/SysV ABI. CEN64 currently only passes 3 parameters, so passing IW as a 4th would result it IW being passed to all the functions through a register instead of having to, inevitably, loading it off the stack through the vr4300 pointer. This would have the benefit of both reducing the code size and hoisting the load up further up in the dependency chain. I will definitely be experimenting with this and expect positive results.

User avatar
Narann
Posts: 154
Joined: Mon Jun 16, 2014 4:25 pm
Contact:

Re: Code review thread

Post by Narann » Thu Aug 21, 2014 1:42 pm

MarathonMan wrote:Static objects also intend to inhibit optimizations
I guess it's related to stack optimizations.
MarathonMan wrote:On x86_64, you're basically guaranteed that the first 4 arguments are passed via registers regardless of M$/SysV ABI.
Wow! Interesting! I just look for this and found this. Is that a good source for what you are talking about? It seems that SystemV go up to six first args are passed in registers.
MarathonMan wrote:CEN64 currently only passes 3 parameters, so passing IW as a 4th would result it IW being passed to all the functions through a register instead of having to, inevitably, loading it off the stack through the vr4300 pointer. This would have the benefit of both reducing the code size and hoisting the load up further up in the dependency chain. I will definitely be experimenting with this and expect positive results.
I would be very interesting to have percentage of gain for a such register optimization. Don't hestitate to share.

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: Code review thread

Post by MarathonMan » Thu Aug 21, 2014 2:25 pm

Narann wrote:
MarathonMan wrote:On x86_64, you're basically guaranteed that the first 4 arguments are passed via registers regardless of M$/SysV ABI.
Wow! Interesting! I just look for this and found this. Is that a good source for what you are talking about? It seems that SystemV go up to six first args are passed in registers.
Yes, that contains some information. You might even be interested in reading into the common ABIs further. x86_64 SysV provides things like the red zone that can be used to your advantage if you're aware of them. You basically get 128 bytes on the top of any frame for free, without adjusting the stack pointer, as that section of space is guaranteed not to be destroyed by interrupt handlers and whatnot.

User avatar
Narann
Posts: 154
Joined: Mon Jun 16, 2014 4:25 pm
Contact:

Re: Code review thread

Post by Narann » Thu Aug 21, 2014 3:07 pm

MarathonMan wrote:You basically get 128 bytes on the top of any frame for free, without adjusting the stack pointer, as that section of space is guaranteed not to be destroyed by interrupt handlers and whatnot.
Diving a little into stack stuff I realize I was not aware of all this low level stuff. I was thinking stack as a more high level function handling. Now I realize It's far more close the hardware than what I was thinking. I will have to read.

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: Code review thread

Post by MarathonMan » Fri Aug 22, 2014 7:59 am

Reduced binary size by 384 bytes for me. :)

Also seemed to slightly boost VI/s of some ROMs that are not running at a full 60VI/s on my system.

Needless to say, committed to master and I'll use the technique in the RSP rewrite as well!

User avatar
Narann
Posts: 154
Joined: Mon Jun 16, 2014 4:25 pm
Contact:

Re: Code review thread

Post by Narann » Fri Aug 22, 2014 10:30 am

MarathonMan wrote:Reduced binary size by 384 bytes for me. :)
lol how much now? 40kB? :D
MarathonMan wrote:Also seemed to slightly boost VI/s of some ROMs that are not running at a full 60VI/s on my system.
I were not surprised you didn't win so much. After all, there is still an in
MarathonMan wrote:Needless to say, committed to master and I'll use the technique in the RSP rewrite as well!
That's great! :D

Now iw is retrieved once, why not init it as const and propagate the cont to each function? From what I see, iw is never modified right?

Code: Select all

const uint32_t iw = rfex_latch->iw;
Also, I see you use a (big) inline function. Would I suggest to use the gcc -Winline flag to ensure everything defined as inline is properly inlined? Because reading how big vr4300_ex_stage is, I'm not sure its properly inlined.

And last: What do you expect with a static inline outside a header? Here we have static inline int vr4300_rf_stage(struct vr4300 *vr4300) inside vr4300/pipeline.c. You want to inline the compiled object at link time?

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: Code review thread

Post by MarathonMan » Sat Aug 23, 2014 10:14 am

Narann wrote:
MarathonMan wrote:Reduced binary size by 384 bytes for me. :)
lol how much now? 40kB? :D
48kb for me!
Narann wrote:Now iw is retrieved once, why not init it as const and propagate the cont to each function? From what I see, iw is never modified right?

Code: Select all

const uint32_t iw = rfex_latch->iw;
*shrugs* I don't see a benefit to doing it. Marking primitive types as const is something I generally refrain from doing as I don't think there are any benefits.
Narann wrote:Also, I see you use a (big) inline function. Would I suggest to use the gcc -Winline flag to ensure everything defined as inline is properly inlined? Because reading how big vr4300_ex_stage is, I'm not sure its properly inlined.

And last: What do you expect with a static inline outside a header? Here we have static inline int vr4300_rf_stage(struct vr4300 *vr4300) inside vr4300/pipeline.c. You want to inline the compiled object at link time?
Another useful GNU flag that I wasn't aware of. :D I will have to experiment with this one and add it.

The static inline is probably overkill here, and static would probably suffice just fine. static inline is just a bad habit I guess. :D

User avatar
Narann
Posts: 154
Joined: Mon Jun 16, 2014 4:25 pm
Contact:

Re: Code review thread

Post by Narann » Fri Jan 16, 2015 7:48 pm

Reading the (very clear) RDP code, I realize color_combiner_equation() could be vectorized. It's always used for r, g, and b (color_combiner_equation() is never called alone, always by three).

I have no idea if this function is a performance bottleneck but there is a place ready for optimization here.

Also a secondary question: Are inline functions in .c files actually inlined? I was thinking only preprocessor functions (.h) could be inlined.

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: Code review thread

Post by MarathonMan » Fri Jan 16, 2015 10:16 pm

I agree that there are certainly always which SSE intrinsics could be used to improve sections of this "plugin" -- at the expense of readability, possibly.

I have done some very mild profiling of the RDP code (by disabling the RDP commands, sans the RDP SYNCs). For some carts, this doesn't result in any 'breakage' and allows the system to emulate the ROM as normal without running any RDP emulation. The only potential source of 'contamination' is due to the fact that the compiler's DCE probably determines that most of the code in the RDP plugin is not reachable and omits it from the final binary, improving the locality of some code when the most of the RDP plugin isn't present. This is advantageous as I can run a ROM with and without the RDP and compare the VI/s to see how much of the runtime is being spent by the RSP/VR4300 compared to the RDP.

The results showed me that the RDP is incredibly efficient (compared to the rest of the emulator). In Super Mario 64, when Mario's face is spinning around in the intro, it appeared that the RDP only cost 10% of the total runtime. Once I realized that it was that efficient, I put it out of consideration temporarily as the RSP and VR4300 are definitely the elephants in the room as far as performance goes.

Regarding inline, you can certainly have inline functions in both source and header files. A static function need not be inlined. All static really says is that the function is local to the translation unit ("file") in which it is defined. If you mark a small function in a C file as static, a good compiler will almost certainly inline it. If it's a big function, it may not inline it, because the code required to carry out the function is so large. Even if it doesn't inline it, however, marking a function as static gives the compiler the ability do things like ignore the calling convention and perform other optimizations across the function barrier - thing as simple as CSE.

BTW, inline is just a hint to the compiler. A standard C99 compiler can totally ignore your hint and not inline the function if it chooses. For that reason, many compilers provide attributes or pragmas that force inlining to occur. For GNU, the attribute is __attribute__((always_inline)) and for MSVC it is __forceinline.

User avatar
Narann
Posts: 154
Joined: Mon Jun 16, 2014 4:25 pm
Contact:

Re: Code review thread

Post by Narann » Fri Jan 16, 2015 10:53 pm

Thanks for the explanation. If it's only 10%, it's pointless to break the MESS code (we would lost upstream easy merge I guess). :)

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: Code review thread

Post by MarathonMan » Sat Jan 17, 2015 9:42 am

angrylion doesn't seem to make sweeping changes to the project (mostly just correctness fixes and performance improvements), so any changes wouldn't be hard to merge from upstream. I just haven't put any time into it at the moment because I'd be optimizing a part of the program that only consumes 10% of the runtime, so the gains are much less.

TL;DR: definitely someday, just not now. :)

beannaich
Posts: 149
Joined: Mon Oct 21, 2013 2:43 pm

Re: Code review thread

Post by beannaich » Sat Jan 17, 2015 2:42 pm

Narann wrote:Thanks for the explanation. If it's only 10%, it's pointless to break the MESS code (we would lost upstream easy merge I guess). :)
The better way to optimize the RDP would be to pull request the actual RDP repo, and merge those changes into cen64 after the pull request is accepted. Then angrylion gets to benefit from the optimizations also.

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: Code review thread

Post by MarathonMan » Sat Jan 17, 2015 3:24 pm

beannaich wrote:The better way to optimize the RDP would be to pull request the actual RDP repo, and merge those changes into cen64 after the pull request is accepted. Then angrylion gets to benefit from the optimizations also.
I could be wrong, but IIRC angrylion is against things that mangle readability like intrinsics tend to do.

beannaich
Posts: 149
Joined: Mon Oct 21, 2013 2:43 pm

Re: Code review thread

Post by beannaich » Sat Jan 17, 2015 5:20 pm

MarathonMan wrote:I could be wrong, but IIRC angrylion is against things that mangle readability like intrinsics tend to do.
If that's the case then you should make your own RDP and use angrylion's as a reference :p

User avatar
Breadwinka
Posts: 54
Joined: Fri Oct 04, 2013 11:35 pm

Re: Code review thread

Post by Breadwinka » Sat Jan 17, 2015 5:33 pm

beannaich wrote:
MarathonMan wrote:I could be wrong, but IIRC angrylion is against things that mangle readability like intrinsics tend to do.
If that's the case then you should make your own RDP and use angrylion's as a reference :p
I think that was the plan if I recall.

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest