I have an experimental branch where:
-multithread will use RSP/VI for one core, RDP for a second, and VR4300/AI/PI/SI for the third.
A fourth thread is used for rendering the framebuffer to the window, but that's already merged into everything.

Accuracy: well, it boots like... nothing. So quite bad right now lol.Snowstorm64 wrote:Nice! How much does it impact on the emulator's accuracy (also performance)?
Nice find!! I am 60VI/s on this game as well!Snowstorm64 wrote:Also Super Mario 64, although its graphics are a bit unstable, but it's playable. I have managed to play it overclocked(up to almost double the speed!) and get two stars before the games has crashed.
First world problems... :piwasaperson wrote:Getting 90 VI/s on my 6600K. We may need a frame limiter now.
Using a CRT at 93Hz with Linux (OSS Intel drivers), so that's not an option.MarathonMan wrote:Nice find!! I am 60VI/s on this game as well!Snowstorm64 wrote:Also Super Mario 64, although its graphics are a bit unstable, but it's playable. I have managed to play it overclocked(up to almost double the speed!) and get two stars before the games has crashed.
First world problems... :piwasaperson wrote:Getting 90 VI/s on my 6600K. We may need a frame limiter now.
Turn on v-sync in the meantime? My system will lock at 60VI/s because my display configuration is setup for 60HZ.
I can't really debug a frame limiter, because I literally cannot get it to go over 60VI/s anyways...
Ah I see.iwasaperson wrote:Using a CRT at 93Hz with Linux (OSS Intel drivers), so that's not an option.
Sounds good.MarathonMan wrote: Ah I see.
And I rescind my comment - I can debug it; I'll just have to debug it in a headless mode.
tl;dr: on the TODO list it goes.
Thanks!iwasaperson wrote:Sounds good.MarathonMan wrote: Ah I see.
And I rescind my comment - I can debug it; I'll just have to debug it in a headless mode.
tl;dr: on the TODO list it goes.
Really impressed with the spike in progress lately.
Yes, it is absolutely possible to tighten the sync on higher end systems.Snowstorm64 wrote:I wonder if at this point we could afford a tighter sync on the various components, in order to achieve better accuracy, or not...Maybe an option to set the looseness of the sync?
OpenGL and Vulkan are different concepts. GPUs are no longer graphics accelerators, but massively parallel SIMD machines, designed for general purpose computing provided it fits in a massive parallel flow. Yes, Vulkan has a graphics/gaming flavor, but it's much more low-level than any other GPU API.Narann wrote:Unfortunately, this would not really help Cen64 (nor angrylion plugin) as it only rely on OpenGL for window rendering. Vulkan would not improve performance.
The problem is which accuracy do you want to achieve using GPU (I'm not talking about cycle accuracy, more pixel/depth accuracy). If you rely on GPU internal rasterizer (the one exposed by Vulkan/OpenGL), you are clearly doing HLE and will fight against it to get consistent results.Snowstorm64 wrote:Well, it's true that Angrylion RDP is software rendering, and CEN64's VI is the only place here that use OpenGL(and it's quite simple and minimal), along with the backends in the os directory. Still, I wonder how well a Vulkan-based RDP (different from the Angrylion's one) would do against the software rendering, though...
What about LLE? Isn't z64gl supposed to be low level hardware rendering, and thus it can emulate the graphics without the problems that HLE encounters like you have said before? Or am I wrong? If someone makes a Vulkan-based LLE RDP, could this be more accurate than anything other except for the software rendering?Narann wrote:-cut-
I should check again but from what I remember, z64gl is not a LLE RDP per see but a HLE RDP relying on LLE RSP.Snowstorm64 wrote:What about LLE? Isn't z64gl supposed to be low level hardware rendering, and thus it can emulate the graphics without the problems that HLE encounters like you have said before? Or am I wrong? If someone makes a Vulkan-based LLE RDP, could this be more accurate than anything other except for the software rendering?
What does it change? Once again, the rasterizer is not "accurate" (in a sense of "properly emulated") because it rely on the on local GPU one. I would be interested to know if the red point works in Pokemon Snap with z64gl as this game need a properly emulated depth buffer.it is mainly an RDP emulator implemented in OpenGL. Contrary to usual graphics plugins, it doesn't emulate the RSP part, so it requires a functionnal RSP emulator plugin to give any results.
Indeed (well, maybe not the whole system, but the RDP could be 100% implemented in GPU in a 100% pixel exact way). The fact that the rest of N64 "emulators" use fixed-operation OpenGL and they completely trash the N64 experience, has created a generalized bad opinion about using GPUs for doing the emulation. However, GPUs can be programmed nowadays just like you write C, and you can even get IEEE fp accuracy compliance if you ask to. The RDP is designed for by-pixel operations and by-vertex operations, so the GPU massive parallelism fits in the scenario. Yes, and with 100% pixel accuracy. But of course not using OpenGL.wareya wrote:Just emulate the entire system on GPU /s
If it's even possible, it will be quite hard. For example, depth value take multiple fixed point format during RDP pipeline. Simulate this can be tricky even on CPU. Why would you bother to hack this on the GPU side? What will you win compared to CPU? Performance? I already explained why it's not relevant above.asiga wrote:well, maybe not the whole system, but the RDP could be 100% implemented in GPU in a 100% pixel exact way).
No, the fact N64 architecture is very different a than any modern console doesasiga wrote:The fact that the rest of N64 "emulators" use fixed-operation OpenGL and they completely trash the N64 experience, has created a generalized bad opinion about using GPUs for doing the emulation.
Writting a kernel (or a shader) in C is not complicate yes but it doesn't mean you have no more complexity: You still have to handle CPU<=>GPU memory. Something you don't have to on a CPU only solution.asiga wrote:However, GPUs can be programmed nowadays just like you write C
For proper RDP emulation you actually never want that.asiga wrote:you can even get IEEE fp accuracy compliance if you ask to.
Trueasiga wrote:The RDP is designed for by-pixel operations
Falseasiga wrote:and by-vertex operations
But the original question remain, what do you expect using GPU? Why is GPU so good? Performances? Once again, if you want accurate hack-free results you need native resolution so the number of pixel to compute is not such high. Plus, doing so, you will have to sync GPU and CPU memory. While the amount of data is not big, the simple access to GPU memory in a non unified memory architecture can decrease performances a lot.asiga wrote:so the GPU massive parallelism fits in the scenario. Yes, and with 100% pixel accuracy. But of course not using OpenGL.
Well, I'm not going to try to convince anybody about the benefits of general purpose computing on GPU. Even Intel is jumping into this wagon by applying GPU concepts and design to their Xeon Phi (Knights Landing/Knights Hill/etc) HPC products.Narann wrote: But the original question remain, what do you expect using GPU? Why is GPU so good? Performances? Once again, if you want accurate hack-free results you need native resolution so the number of pixel to compute is not such high. Plus, doing so, you will have to sync GPU and CPU memory. While the amount of data is not big, the simple access to GPU memory in a non unified memory architecture can decrease performances a lot.
TLDR: While possible, it's complex for virtually no benefit.
...so basically HSA on a modern AMD APU?Narann wrote:The only situation where you "could" actually write a LLE RDP plugin using local GPU is on unified memory architectures writting bare metal (architecture specific) GPU code (Raspberry Pi with its VC4 is a good example).
It could be a good candidate yes!Nintendo Maniac 64 wrote:...so basically HSA on a modern AMD APU?
The only thing is that MarathonMan has previously stated that Cen64 is so latency-sensitive that SMT is actually faster than two separate CPU cores, so the likes of an on-die GPU (even with shared memory and all) would be slower still due to even worse latency.Narann wrote:It could be a good candidate yes!Nintendo Maniac 64 wrote:...so basically HSA on a modern AMD APU?
In which situation?The only thing is that MarathonMan has previously stated that Cen64 is so latency-sensitive that SMT is actually faster than two separate CPU cores
So basically what bsnes's "Balanced" core does with regards to performance vs accuracy.MarathonMan wrote:Multithreading as it's implemented now only works due to the fact that I found a way to sacrifice a very small amount of accuracy for a disproportionally large amount of parallelism.
Which is why I said "I do realize bsnes is single-threaded though".wareya wrote:bsnes is single-core
I think I was only recording at 30FPSNarann wrote:I'm surprise SSMB lag so much, it was supposed to be a graphic-cheap game to actually be fast.
Wow. Single threaded runs about the same for me on my 6600K (OCed 200MHz with turbo boost), so a multithreaded RDP would probably run full speed all the time.MarathonMan wrote:I nearly stabilized this commit. There is the occasional total freeze after a few minutes of play with some games, but I'm confident that I'll be able to figure it out. I'll toss up a YouTube video in a bit.
Getting 60VI/s on these titles:
Goldeneye 007
Vigilante 8
Mario Tennis (on court, not in menus)
Banjo Kazooie
Super Smash Bros.
etc...
EDIT: Enjoy! https://www.youtube.com/watch?v=Jy8IOxcj8r4
CEN64 uses OpenAL, not JACK. I had some issues with that a while ago, it turned out that OpenAL Soft wasn't configured properly. I have fixed it using the tool "alsoft-conf" (that you can find in the repository, with same name) and pointed to it the backend I use. (PulseAudio, but can also be JACK or ALSA)iwasaperson wrote: Also, I've noticed that CEN64 wants to use the JACK audio server. What advantages does this have over ALSA for emulation? I already use JACK for audio production, so I just have it on whenever I'm running CEN64 anyway.
This is because the FlashRAM save isn't loaded into CEN64, you need to do it in order to make Majora's Mask work properly. To do it, you have to launch from shell something like this:iwasaperson wrote: EDIT: https://ipfs.pics/ipfs/QmbZkhSpC3NDSKeo ... qXfikp7CNY
What's going on here? I know my ROM is fine since I got it from the GoodSet and it works perfectly on my EverDrive. I also instantly died when the intro finished.
Code: Select all
cen64 -flash tlozmajorasmask.fla pifdata.bin tlozmajorasmask.z64
It already works just fine without JACK running. I guess it falls back to ALSA in that case. Also alsoft-conf doesn't give JACK as an option, it's just using ALSA ATM.Snowstorm64 wrote:CEN64 uses OpenAL, not JACK. I had some issues with that a while ago, it turned out that OpenAL Soft wasn't configured properly. I have fixed it using the tool "alsoft-conf" (that you can find in the repository, with same name) and pointed to it the backend I use. (PulseAudio, but can also be JACK or ALSA)iwasaperson wrote: Also, I've noticed that CEN64 wants to use the JACK audio server. What advantages does this have over ALSA for emulation? I already use JACK for audio production, so I just have it on whenever I'm running CEN64 anyway.
That worked. Thanks.This is because the FlashRAM save isn't loaded into CEN64, you need to do it in order to make Majora's Mask work properly. To do it, you have to launch from shell something like this:iwasaperson wrote: EDIT: https://ipfs.pics/ipfs/QmbZkhSpC3NDSKeo ... qXfikp7CNY
What's going on here? I know my ROM is fine since I got it from the GoodSet and it works perfectly on my EverDrive. I also instantly died when the intro finished.Code: Select all
cen64 -flash tlozmajorasmask.fla pifdata.bin tlozmajorasmask.z64
Just how much CPU headroom is left over (if any) for whatever your particular CPU model is? I'm particularly interested in if there's enough headroom available via overclocking for 60VI/s to still be possible even if Cen64 is compiled to run at overclocked N64 speeds (125MHz).MarathonMan wrote:Getting 60VI/s on these titles:
Goldeneye 007
Super Smash Bros.
I'm also curious about how many cores can be used by CEN64, because I'm going to buy a new machine next month and, although it could seem exaggerated by some, CEN64 could play a role in the decision of the number of cores of my machine.CluelessGuy wrote:Can someone break this down for me? I'm not super tech savvy. Does this mean CEN64 will take advantage of system with multiple cores now, and before it was only using one? I've got an 18 core machine, which each clocked around 2.3ghz. Should I expect very good performance moving forward with this update?
Yes, it can take advantage of up to ~3-4 cores. However, CEN64 prefers fast cores, rather than lower clocks and lots of cores. A 2.3ghz 18-core CPU is not ideal for high performance.CluelessGuy wrote:Can someone break this down for me? I'm not super tech savvy. Does this mean CEN64 will take advantage of system with multiple cores now, and before it was only using one? I've got an 18 core machine, which each clocked around 2.3ghz. Should I expect very good performance moving forward with this update?
I don't see it really scaling beyond a quad core. Only 3 of the cores are really "busy" (the fourth is just to handle the GUI, render the screen, etc.) For the best performance, you want a true quad core, though -- there's a notable difference between a dual-core hyperthreaded CPU, and one that actually has 4 cores.asiga wrote:I'm also curious about how many cores can be used by CEN64
MarathonMan wrote:Yes, it can take advantage of up to ~3-4 cores. However, CEN64 prefers fast cores, rather than lower clocks and lots of cores. A 2.3ghz 18-core CPU is not ideal for high performance.CluelessGuy wrote:Can someone break this down for me? I'm not super tech savvy. Does this mean CEN64 will take advantage of system with multiple cores now, and before it was only using one? I've got an 18 core machine, which each clocked around 2.3ghz. Should I expect very good performance moving forward with this update?
I don't see it really scaling beyond a quad core. Only 3 of the cores are really "busy" (the fourth is just to handle the GUI, render the screen, etc.) For the best performance, you want a true quad core, though -- there's a notable difference between a dual-core hyperthreaded CPU, and one that actually has 4 cores.asiga wrote:I'm also curious about how many cores can be used by CEN64
In the future, I hope to use fewer cores once I finish the rewrite. The rule of thumb is probably thus: get the highest clocked quad core you can find. If you're on a budget, get the highest clocked dual core with hyperthreading you can find.
Users browsing this forum: No registered users and 2 guests