Code: Select all
---------------- Introduction ---------------- So, at this time, I've effectively written two CEN64 interpreter cores. Unfortunately, they're simply not fast enough, even with crazy amounts of high-end hardware. To address this issue with the third (and hopefully, final) write of CEN64, I've researched two possible solutions that should allow CEN64 to run at full speed on modest hardware: * Run multiple interpreters in parallel (one for each pipeline, processor, etc.) and design them such that each one can "commit" and "rollback" a simulated cycle - effectively, a transactional memory-like approach. Early analysis of this kind of approach demonstrated a prohibitively high transactional "abort" rate. I found that, due to the N64's architecture, there was a lot of aborting and stalling as components access RDRAM very frequently, raise interrupts often enough, etc. And, since every piece of the system is cycle-accurate, commit logs are incredibly expensive to maintain due to the state of all the latches and everything that needs to be maintained. So, I'm more or less convinced that a transactional-style system that is capable of leveraging multicore processing isn't the answer. * Leverage dynamically-recompiled blocks. On a Haswell system, it's not uncommon to see 3-3.5 IPC when the RDP isn't running. Unfortunately, a lot of the instructions that are being executed are very, very predictable conditional branches (but cannot be omitted as the cycle-accurate pipelines still need to catch the uncommon case)!. So, what one could do is to emit and somehow link together small, optimized cycle-accurate pipeline models that have many conditional checks omitted. Accuracy is still maintained because oftentimes, it can be determined that many of the conditional checks are not necessary, depending on the program counter or some other state of the system. Other common examples of checks that can be omitted in the VR4300 pipeline alone: the data cache (DC) stage can simply be made to forward the contents of the latches when the prior cycle was not a memory instruction. The execute (EX) stage does not have to check whether a COP (coprocessor unusable) exception should be raised, or if FPU registers need to be accessed, when the instruction is an ordinary integer instruction. Virtual address regions (uncached, cached and mapped, etc.) can be determined ONCE depending on the virtual address of the block. These are just a handful of examples where checks can be optimized out simply by using an initial pass that analyzes the state of the system. And so, this is the route that I've decided to take with the third write of the CEN64 core. ---------- Design ---------- The heart and brains of the emulator run within a virtual-machine like context. The goal of the system is to call a thunk and remain inside the context as much as possible, exiting only to compile or perform some activity that cannot be performed within the context itself. Working inside a context gives us full reign over the hardware registers and enables us to effectively ignore the host's calling convention. This means that, for example, on x86_64, we can keep the entirety of the RSP's accumulator registers and flags in native hardware registers, and still have half of our vector hardware registers to spare! Dynamically recompiled blocks of code can quickly be allocated and deallocated using a custom slab-based allocator. Although the allocator has a fixed-sized memory pool and probably has higher overhead than conventional memory allocation algorithms, it's significantly faster than all libc malloc/free implementations I've tried (10x faster than GNU libc). Moreover, there is no need to worry about marking pages as executable for every malloc or allocation, since the allocator reuses the same set of pages (which remain executable through the execution of the emulator). Lastly, it coerces the system into using large pages as to reduce the amount of page faults, even for large amounts of dynamically recompiled code. With an execution environment and allocator in place, the design questions that remain are really the most interesting of the bunch: how does one efficiently dynamically compile optimized cycle-accurate models (and link them together) by some means? To efficiently compile blocks, my current approach relies on the use of several "templated" cycle accurate models with a hole in the middle of the model for emulation the execution logic (an FPU add, an integer multiply, etc.). In this way, optimizations of the templates can be focused on outside of runtime, and assembly of a model at runtime is very efficient since it really only involves selection of some templates and data movement. The selection is done simply by leaving virtualized context for a brief period, taking the current state of the system and feeding it to a selection algorithm. With most questions of compilation itself sorted out (if you can even call it that!), the only real question remaining is this: how does one link together these models at runtime? Since each model has several cycles of 'assumptions' baked into it, branching backward and forward really throws a wrench into the mix because we need to do something while the pipeline is "primed" and we can start executing our models again. One option is to add additional checks to make each model (to make them more generic), but that effectively cancels out the potential gain of the system, so I'd rather avoid that if possible. Another option is to compile and store paths for both branch directions along with the simulated model for that cycle. Indirect branches will be a little more cumbersome, but will still work as long as there are only a few potential candidates for branching. In the event that this doesn't end up being the case, or we have to start emulating a hardware trap/exception, we can run a generic interpreter for a few cycles/instructions, and then jump back into the compiled code.