Future state of the project

Discuss any unrelated topics here.
User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: Future state of the project

Post by MarathonMan » Fri Jul 03, 2015 10:44 am

izy wrote:...
Ah thanks for the info - I didn't know Intel published that depth of micro-architectural detail. I always thought everything had to be reverse-engineered.

AIO
Posts: 51
Joined: Wed Nov 05, 2014 4:56 pm

Re: Future state of the project

Post by AIO » Sat Aug 01, 2015 4:18 am

I started optimizing the RDP some more and have reached the point where simple optimizations are making a noticeable difference :D . Using more arrays really helped. For some reason, any compiler I've looked at, didn't vectorize

Code: Select all

TEX->r = t3.r + ((((invsf * (t2.r - t3.r)) + (invtf * (t1.r - t3.r))) + 0x10) >> 5);   
TEX->g = t3.g + ((((invsf * (t2.g - t3.g)) + (invtf * (t1.g - t3.g))) + 0x10) >> 5);                                                      
TEX->b = t3.b + ((((invsf * (t2.b - t3.b)) + (invtf * (t1.b - t3.b))) + 0x10) >> 5);                                                
TEX->a = t3.a + ((((invsf * (t2.a - t3.a)) + (invtf * (t1.a - t3.a))) + 0x10) >> 5);
yet it did when I used arrays instead of structures.

AIO
Posts: 51
Joined: Wed Nov 05, 2014 4:56 pm

Re: Future state of the project

Post by AIO » Fri Jun 10, 2016 12:57 am

Good news is that I realized that pmaddwd is quite useful for certain algorithms :D . Now, the gap between SSE2 and SSE4 is smaller, since pmulld isn't necessary most of the time. I'm going to start using that instruction in rgbaz_correct_clip and texture_pipeline_cycle. It will not only shorten the gap, but should also be faster than previous algorithm.

So I profiled and saw a decent boost in rgbaz_correct_clip after replacing 2 pmulld's with 1 pmaddwd :D . I think coding in assembly along with using SMC and lots of function splitting, is a good way to gain a significant boost in performance. I'm thinking that using SMC can reduce memory reads, free up more registers, cut down branching, and reduce the amount of necessary function splitting. Renderspans() is very very inefficient for rectangles, so I plan on writing a function optimized for rectangles.

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: Future state of the project

Post by MarathonMan » Wed Jun 15, 2016 12:15 am

I implemented some of the easy pickings in texture_pipeline_cycle in CEN64 and saw about ~1VI/s improvement in OoT. The binary size has also shrunk quite a bit (> 7% for Linux SSE4.1). EDIT: oops, nevermind about the size part!

I stand corrected - perhaps vectorizing the RDP will result in more performance than I had otherwise expected. It is quite a bother to do, though.
Attachments
angrylion-rdp.png
angrylion-rdp.png (41.13 KiB) Viewed 5477 times

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest