Code: Select all
----------------
Introduction
----------------
So, at this time, I've effectively written two CEN64 interpreter cores.
Unfortunately, they're simply not fast enough, even with crazy amounts of
high-end hardware.
To address this issue with the third (and hopefully, final) write of CEN64,
I've researched two possible solutions that should allow CEN64 to run at full
speed on modest hardware:
* Run multiple interpreters in parallel (one for each pipeline, processor,
etc.) and design them such that each one can "commit" and "rollback" a
simulated cycle - effectively, a transactional memory-like approach.
Early analysis of this kind of approach demonstrated a prohibitively high
transactional "abort" rate. I found that, due to the N64's architecture,
there was a lot of aborting and stalling as components access RDRAM very
frequently, raise interrupts often enough, etc. And, since every piece of
the system is cycle-accurate, commit logs are incredibly expensive to
maintain due to the state of all the latches and everything that needs to
be maintained.
So, I'm more or less convinced that a transactional-style system that is
capable of leveraging multicore processing isn't the answer.
* Leverage dynamically-recompiled blocks. On a Haswell system, it's not
uncommon to see 3-3.5 IPC when the RDP isn't running. Unfortunately, a lot
of the instructions that are being executed are very, very predictable
conditional branches (but cannot be omitted as the cycle-accurate pipelines
still need to catch the uncommon case)!.
So, what one could do is to emit and somehow link together small, optimized
cycle-accurate pipeline models that have many conditional checks omitted.
Accuracy is still maintained because oftentimes, it can be determined that
many of the conditional checks are not necessary, depending on the program
counter or some other state of the system. Other common examples of checks
that can be omitted in the VR4300 pipeline alone: the data cache (DC) stage
can simply be made to forward the contents of the latches when the prior
cycle was not a memory instruction. The execute (EX) stage does not have to
check whether a COP (coprocessor unusable) exception should be raised, or
if FPU registers need to be accessed, when the instruction is an ordinary
integer instruction. Virtual address regions (uncached, cached and mapped,
etc.) can be determined ONCE depending on the virtual address of the block.
These are just a handful of examples where checks can be optimized out
simply by using an initial pass that analyzes the state of the system.
And so, this is the route that I've decided to take with the third write
of the CEN64 core.
----------
Design
----------
The heart and brains of the emulator run within a virtual-machine like context.
The goal of the system is to call a thunk and remain inside the context as much
as possible, exiting only to compile or perform some activity that cannot be
performed within the context itself.
Working inside a context gives us full reign over the hardware registers and
enables us to effectively ignore the host's calling convention. This means
that, for example, on x86_64, we can keep the entirety of the RSP's accumulator
registers and flags in native hardware registers, and still have half of our
vector hardware registers to spare!
Dynamically recompiled blocks of code can quickly be allocated and deallocated
using a custom slab-based allocator. Although the allocator has a fixed-sized
memory pool and probably has higher overhead than conventional memory allocation
algorithms, it's significantly faster than all libc malloc/free implementations
I've tried (10x faster than GNU libc). Moreover, there is no need to worry about
marking pages as executable for every malloc or allocation, since the allocator
reuses the same set of pages (which remain executable through the execution of
the emulator). Lastly, it coerces the system into using large pages as to reduce
the amount of page faults, even for large amounts of dynamically recompiled
code.
With an execution environment and allocator in place, the design questions that
remain are really the most interesting of the bunch: how does one efficiently
dynamically compile optimized cycle-accurate models (and link them together) by
some means?
To efficiently compile blocks, my current approach relies on the use of several
"templated" cycle accurate models with a hole in the middle of the model for
emulation the execution logic (an FPU add, an integer multiply, etc.). In this
way, optimizations of the templates can be focused on outside of runtime, and
assembly of a model at runtime is very efficient since it really only involves
selection of some templates and data movement. The selection is done simply
by leaving virtualized context for a brief period, taking the current state of
the system and feeding it to a selection algorithm.
With most questions of compilation itself sorted out (if you can even call it
that!), the only real question remaining is this: how does one link together
these models at runtime? Since each model has several cycles of 'assumptions'
baked into it, branching backward and forward really throws a wrench into the
mix because we need to do something while the pipeline is "primed" and we can
start executing our models again.
One option is to add additional checks to make each model (to make them more
generic), but that effectively cancels out the potential gain of the system,
so I'd rather avoid that if possible.
Another option is to compile and store paths for both branch directions along
with the simulated model for that cycle. Indirect branches will be a little
more cumbersome, but will still work as long as there are only a few potential
candidates for branching. In the event that this doesn't end up being the case,
or we have to start emulating a hardware trap/exception, we can run a generic
interpreter for a few cycles/instructions, and then jump back into the compiled
code.