Notes on development of the next CEN64 core.

Discuss topics related to development here.
Post Reply
User avatar
MarathonMan
Site Admin
Posts: 691
Joined: Fri Oct 04, 2013 4:49 pm

Notes on development of the next CEN64 core.

Post by MarathonMan » Sun Sep 06, 2015 10:52 am

So, as I've mentioned in the past, I went into the rabbit hole for a little bit to fundamentally reconsider the design of CEN64. I'm getting closer and closer to be able to popping my head out - here's a design document/notes that I've made for myself so far that some developers may appreciate! I've designed just about all of the system that I discuss so far... it's mostly just finishing the remaining questions I have to do before really going to town on this thing.

Code: Select all

----------------
  Introduction
----------------

So, at this time, I've effectively written two CEN64 interpreter cores.
Unfortunately, they're simply not fast enough, even with crazy amounts of
high-end hardware.

To address this issue with the third (and hopefully, final) write of CEN64,
I've researched two possible solutions that should allow CEN64 to run at full
speed on modest hardware:

  * Run multiple interpreters in parallel (one for each pipeline, processor,
    etc.) and design them such that each one can "commit" and "rollback" a
    simulated cycle - effectively, a transactional memory-like approach.

    Early analysis of this kind of approach demonstrated a prohibitively high
    transactional "abort" rate. I found that, due to the N64's architecture,
    there was a lot of aborting and stalling as components access RDRAM very
    frequently, raise interrupts often enough, etc. And, since every piece of
    the system is cycle-accurate, commit logs are incredibly expensive to
    maintain due to the state of all the latches and everything that needs to
    be maintained.

    So, I'm more or less convinced that a transactional-style system that is
    capable of leveraging multicore processing isn't the answer.

  * Leverage dynamically-recompiled blocks. On a Haswell system, it's not
    uncommon to see 3-3.5 IPC when the RDP isn't running. Unfortunately, a lot
    of the instructions that are being executed are very, very predictable
    conditional branches (but cannot be omitted as the cycle-accurate pipelines
    still need to catch the uncommon case)!.

    So, what one could do is to emit and somehow link together small, optimized
    cycle-accurate pipeline models that have many conditional checks omitted.
    Accuracy is still maintained because oftentimes, it can be determined that
    many of the conditional checks are not necessary, depending on the program
    counter or some other state of the system. Other common examples of checks
    that can be omitted in the VR4300 pipeline alone: the data cache (DC) stage
    can simply be made to forward the contents of the latches when the prior
    cycle was not a memory instruction. The execute (EX) stage does not have to
    check whether a COP (coprocessor unusable) exception should be raised, or
    if FPU registers need to be accessed, when the instruction is an ordinary
    integer instruction. Virtual address regions (uncached, cached and mapped,
    etc.) can be determined ONCE depending on the virtual address of the block.
    These are just a handful of examples where checks can be optimized out
    simply by using an initial pass that analyzes the state of the system.

    And so, this is the route that I've decided to take with the third write
    of the CEN64 core.

----------
  Design
----------

The heart and brains of the emulator run within a virtual-machine like context.
The goal of the system is to call a thunk and remain inside the context as much
as possible, exiting only to compile or perform some activity that cannot be
performed within the context itself.

Working inside a context gives us full reign over the hardware registers and
enables us to effectively ignore the host's calling convention. This means
that, for example, on x86_64, we can keep the entirety of the RSP's accumulator
registers and flags in native hardware registers, and still have half of our
vector hardware registers to spare!

Dynamically recompiled blocks of code can quickly be allocated and deallocated
using a custom slab-based allocator. Although the allocator has a fixed-sized
memory pool and probably has higher overhead than conventional memory allocation
algorithms, it's significantly faster than all libc malloc/free implementations
I've tried (10x faster than GNU libc). Moreover, there is no need to worry about
marking pages as executable for every malloc or allocation, since the allocator
reuses the same set of pages (which remain executable through the execution of
the emulator). Lastly, it coerces the system into using large pages as to reduce
the amount of page faults, even for large amounts of dynamically recompiled
code.

With an execution environment and allocator in place, the design questions that
remain are really the most interesting of the bunch: how does one efficiently
dynamically compile optimized cycle-accurate models (and link them together) by
some means?

To efficiently compile blocks, my current approach relies on the use of several
"templated" cycle accurate models with a hole in the middle of the model for
emulation the execution logic (an FPU add, an integer multiply, etc.). In this
way, optimizations of the templates can be focused on outside of runtime, and
assembly of a model at runtime is very efficient since it really only involves
selection of some templates and data movement. The selection is done simply
by leaving virtualized context for a brief period, taking the current state of
the system and feeding it to a selection algorithm.

With most questions of compilation itself sorted out (if you can even call it
that!), the only real question remaining is this: how does one link together
these models at runtime? Since each model has several cycles of 'assumptions'
baked into it, branching backward and forward really throws a wrench into the
mix because we need to do something while the pipeline is "primed" and we can
start executing our models again.

One option is to add additional checks to make each model (to make them more
generic), but that effectively cancels out the potential gain of the system,
so I'd rather avoid that if possible.

Another option is to compile and store paths for both branch directions along
with the simulated model for that cycle. Indirect branches will be a little
more cumbersome, but will still work as long as there are only a few potential
candidates for branching. In the event that this doesn't end up being the case,
or we have to start emulating a hardware trap/exception, we can run a generic
interpreter for a few cycles/instructions, and then jump back into the compiled
code.

User avatar
tony971
Posts: 12
Joined: Sun Feb 01, 2015 1:02 pm

Re: Notes on development of the next CEN64 core.

Post by tony971 » Wed Sep 09, 2015 11:53 am

Not sure if this is useful but Fiora is pretty amazing at pipeline optimization. https://www.reddit.com/r/emulation/comm ... re/cuvceyp

User avatar
siggie
Posts: 5
Joined: Mon Nov 04, 2013 2:23 pm

Re: Notes on development of the next CEN64 core.

Post by siggie » Wed Sep 09, 2015 12:20 pm

Yeah , she gave Dolphin's JIT compiler a massive speedboost last year.

User avatar
MarathonMan
Site Admin
Posts: 691
Joined: Fri Oct 04, 2013 4:49 pm

Re: Notes on development of the next CEN64 core.

Post by MarathonMan » Wed Sep 09, 2015 1:46 pm

tony971 wrote:Not sure if this is useful but Fiora is pretty amazing at pipeline optimization. https://www.reddit.com/r/emulation/comm ... re/cuvceyp
Thanks saw it - I'm already sub'd to /r/emulation. :-)

User avatar
Narann
Posts: 154
Joined: Mon Jun 16, 2014 4:25 pm
Contact:

Re: Notes on development of the next CEN64 core.

Post by Narann » Thu Sep 17, 2015 10:03 am

My humble feedbacks on this:
* Run multiple interpreters in parallel (one for each pipeline, processor,
etc.) and design them such that each one can "commit" and "rollback" a
simulated cycle - effectively, a transactional memory-like approach.
Mmmmh very interesting. Tell me if I'm wrong but transactional memory in a multi thread context imply sync. How handle this while keeping performance?
So, I'm more or less convinced that a transactional-style system that is
capable of leveraging multicore processing isn't the answer.
Ok, so you say you DON'T want to do that. lol
So, what one could do is to emit and somehow link together small, optimized
cycle-accurate pipeline models that have many conditional checks omitted.
Seems to be the end of <90KB build. :D

I like the idea though. Have you any example of some expensive conditions we could skip and in which case? Is it impossible to start from the current Cen64 core to implement this?
Accuracy is still maintained because oftentimes, it can be determined that
many of the conditional checks are not necessary, depending on the program
counter or some other state of the system.
Any approach to handle "expensive conditions array"? Will you use a map?

Like a bit field:0100111. Each bit representing a particular condition that need to be checked (1) or not (0) and you generate your pipeline "queue" from it?
Other common examples of checks
that can be omitted in the VR4300 pipeline alone: the data cache (DC) stage
can simply be made to forward the contents of the latches when the prior
cycle was not a memory instruction. The execute (EX) stage does not have to
check whether a COP (coprocessor unusable) exception should be raised, or
if FPU registers need to be accessed, when the instruction is an ordinary
integer instruction.
In this particular example, you would disable EX and COP emulation and, because the behavior is predictive, manually update EX and COP registers?
And so, this is the route that I've decided to take with the third write
of the CEN64 core.
Look like a very good one.

About linking (I'm not very sure to totally understand), if your pipeline contain a limited number of stage (DC, EX, COP) you would have something like this:

Code: Select all

|id|  DataCache (DC)   |  |id|         EX        |  |id|        COP        |
| 0|<full execution>   |  | 0|<full execution>   |  | 0|<full execution>   |
| 1| compiled block 1  |  | 1| compiled block 1  |  | 1| compiled block 1  |
| 2| compiled block 2  |  | 2| compiled block 2  |  | 2| compiled block 2  |
| 3| compiled block... |  | 3| compiled block... |  | 3| compiled block... |
This way, a "default" pipeline model (the current one) is 0,0,0 and you would have and an "optimized" one that could be 3,2,3, each "compiled block" assuming a certain amount of things.

Each of those "path" (3,2,3) can be mapped to a certain amount of initial states:
If instruction queue is blah, register X is blah, register Y is blah, go for 1,3,2. I have no idea how registers are stored in Cen64 but if registers are aligned, it can be as simple as a checksum:

Code: Select all

hash;
ptr = INSTRUCTION_POSITION;
hash.digest(ptr);
ptr += REGISTER_INSTRUCTION_TO_X;
hash.digest(ptr);
ptr += REGISTER_X_TO_Y;
hash.digest(ptr);

map[hash.to_int()] = tuple(1,3,2);
Something like this.
One option is to add additional checks to make each model (to make them more
generic), but that effectively cancels out the potential gain of the system,
so I'd rather avoid that if possible.
I suggest to don't do that. Do models assuming things and, at execution time, just ensure those assumptions are correct before run the model.

Not sure if all what I said make sense. :)

User avatar
MarathonMan
Site Admin
Posts: 691
Joined: Fri Oct 04, 2013 4:49 pm

Re: Notes on development of the next CEN64 core.

Post by MarathonMan » Thu Sep 17, 2015 10:44 am

Narann wrote:I like the idea though. Have you any example of some expensive conditions we could skip and in which case? Is it impossible to start from the current Cen64 core to implement this?
Well, the current core has to check everything, every cycle, without exception. See the "other example of checks" that I wrote about for some that are very uncommon. In addition to that list, another "uncommon" one is determining which register file to read the source registers from (when forwarding is not used) - does RS/RT reference the integer RF, or is it VS/VT referencing the COP1 RF? Most of the time, we are executing an integer instruction!
Narann wrote:Any approach to handle "expensive conditions array"? Will you use a map?
Basically arrays of pre-generated assembly code, one for each "kind" of cycle. Size shouldn't be a problem because most of the models or "kinds of cycles" will have assumptions removed and be shorter as a result. If one case is too large, then I can just use a more-generic one in place of it.

As for selecting which array: when a (uncompiled) cycle is first encountered, a "interpreter" like cycle will run that also notes all of the assumptions and selects the correct code path at the end of the cycle (using a regular index, I suppose). No map or anything like that should be necessary.
Narann wrote:In this particular example, you would disable EX and COP emulation and, because the behavior is predictive, manually update EX and COP registers?

Most of the time, COP registers don't need to be updated at all -- the only real exception being COP0::Count. EX logic would work as normal, but the fact that an exception doesn't need to be checked for saves the CPU a good number of instructions of potential work (otherwise, it has to check if COP0::Status has usable bit set for relevant coprocessor, which mode the CPU is it according to COP0::Status, etc.).
Narann wrote:About linking (I'm not very sure to totally understand), if your pipeline contain a limited number of stage (DC, EX, COP) you would have something like this:

Code: Select all

|id|  DataCache (DC)   |  |id|         EX        |  |id|        COP        |
| 0|<full execution>   |  | 0|<full execution>   |  | 0|<full execution>   |
| 1| compiled block 1  |  | 1| compiled block 1  |  | 1| compiled block 1  |
| 2| compiled block 2  |  | 2| compiled block 2  |  | 2| compiled block 2  |
| 3| compiled block... |  | 3| compiled block... |  | 3| compiled block... |
This way, a "default" pipeline model (the current one) is 0,0,0 and you would have and an "optimized" one that could be 3,2,3, each "compiled block" assuming a certain amount of things.

Each of those "path" (3,2,3) can be mapped to a certain amount of initial states:
If instruction queue is blah, register X is blah, register Y is blah, go for 1,3,2. I have no idea how registers are stored in Cen64 but if registers are aligned, it can be as simple as a checksum:

Code: Select all

hash;
ptr = INSTRUCTION_POSITION;
hash.digest(ptr);
ptr += REGISTER_INSTRUCTION_TO_X;
hash.digest(ptr);
ptr += REGISTER_X_TO_Y;
hash.digest(ptr);

map[hash.to_int()] = tuple(1,3,2);
Something like this.
Unfortunately, pipeline will always have 5 stages! And most of the time, they do useful work.

If I were to separate the function for each part of each stage (I think this is what you mean), there would be a lot of branching overhead, which would add a lot of performance overhead for stages like WB (unless the branches are predictable... but I don't think they will be) that normally just copy the result from the latch into the RF using the index given by the latch. Only very infrequently does it "slip" because of a COP0 restriction.

If you think about it though, this is what I'm doing on the component level: where instead of a tuple being used to express the kind of cycle within a component, the tuple is used to express the state of the current component (RSP, VR4300, etc.).

I hope that makes sense, because I definitely didn't work that well at all. :)

User avatar
Narann
Posts: 154
Joined: Mon Jun 16, 2014 4:25 pm
Contact:

Re: Notes on development of the next CEN64 core.

Post by Narann » Thu Sep 17, 2015 1:30 pm

MarathonMan wrote:where instead of a tuple being used to express the kind of cycle within a component, the tuple is used to express the state of the current component (RSP, VR4300, etc.).
This sentence is meaningful! :)

Thanks!

User avatar
asiga
Posts: 24
Joined: Fri May 30, 2014 5:35 pm

Re: Notes on development of the next CEN64 core.

Post by asiga » Sat Oct 10, 2015 11:19 am

It looks like a good approach for a good performance increase, MarathonMan. But it looks quite complicated to get it working. I'm just thinking this: maybe it would be easier to implement full recompilation of the whole ROM before running a game, doing this recompilation in a way that the code returns to the idle loop each time a cycle ends (or every N cycles, depending on the cycle granularity that you're using, which I don't know). The only times when the recompiler would be invoked would be when loading a new ROM and when a game loads microcode into the RDP.

User avatar
Narann
Posts: 154
Joined: Mon Jun 16, 2014 4:25 pm
Contact:

Re: Notes on development of the next CEN64 core.

Post by Narann » Sat Oct 10, 2015 11:41 am

asiga wrote:maybe it would be easier to implement full recompilation of the whole ROM before running a game, doing this recompilation in a way that the code returns to the idle loop each time a cycle ends
I wonder how much time it would take to the compiler to recompile the whole ROM.

But I like the idea. It would simplify the whole design: One code to recompile, one code to run recompiled code.

User avatar
MarathonMan
Site Admin
Posts: 691
Joined: Fri Oct 04, 2013 4:49 pm

Re: Notes on development of the next CEN64 core.

Post by MarathonMan » Sat Oct 10, 2015 12:34 pm

The problem with a static compiler is that it would only work for certain games. A lot of cart can, and do, move code around, uncompress code, etc. ... this is why it has to be done dynamically. If it weren't for that, I'd totally agree that ahead-of time compilation would be the way to go.

User avatar
MarathonMan
Site Admin
Posts: 691
Joined: Fri Oct 04, 2013 4:49 pm

Re: Notes on development of the next CEN64 core.

Post by MarathonMan » Sun Jun 26, 2016 11:26 am

I've told a few people about where I'm planning to go with the project, but I still get someone that asks me what's up every once in awhile. So, I'll just leave this here so I don't want to repeat myself. :D

Most of my time lately has been working on a new concept (more on this later). In the meantime, I've been spending some time on and off optimizing the current core to see how much I could squeeze out of it. In doing so, I've completely made up my mind: an interpreter-based CEN64 is simply intractable as far as achieving realtime performance -- even on very high-end hardware. And mind you, this is with an instruction-level accurate RDP! Please, someone prove me wrong! ;)

I've tried quite a few things to address this:
  • I had some (~5% VI/s increase) success in vectorizing the RDP by taking advantages of locations where RGBA or STWZ pixels and spans are processed in parallel, etc.
  • I split the RCP and VR4300 into two threads, while sacrificing some accuracy in the process. Visually, this appears to have been a pretty big win and not a whole lot of accuracy is sacrificed. Still too slow, though.
  • I have tried splitting the RCP further, into an RSP and RDP thread and had limited success. Lots of titles, including Super Smash Bros., will run at 60VI/s -- but overall, there is a huge hit to accuracy and things in general just don't work.
  • I've profiled things to death. It's very evident that many ROMs will target one component of the system. So, if you want good coverage across the whole ROM set, everything (VR4300, RSP, RDP) needs to be really optimized.
Now going off that last bullet: in speaking to developers in the community, it seems that everyone else has come up with the same findings. Lots of exciting work is going on to address it. My personal approach will be to take the design I originally posted about in this thread and give it a twist.

I was kind of uncertain about some of the parts of the design as I began to move forward. At one point, it hit me: all of the CPUs (VR4300, RSP, RDP) inside the N64 use an in-order architecture. I realized that I could leverage this inherent property to my advantage as far as performance is concerned. The idea is such: when a CPU stalls, the pipeline freezes until a condition is met (whether it be waiting on a RDRAM access, or that multicycle FPU operation to complete). When this occurs, I can simply use coroutines to eject from the current location and branch to the next JIT block. On the next cycle to the stalled component, instead of re-entering a quasi-generic pipeline model like I do now, I can branch back directly into the spot (rather, into the coroutine) where the stall occurred and perform an extremely light-weight check (and continue if needed).

All this gets better, though. Since CEN64 will eventually have it's own VM, I effectively control the calling convention and can play all sorts of games with the register allocator to "teach" it about these eject points. I can also do things like statically reserve some registers (say, 8 XMM registers on x86_64 for RSP accumulators and clips), so that there is a very minimal amount of spilling and reloading acquired at many of these points. I've also thought about statically reserving one or two registers per component so that each CPU core has one or two registers devoted to it to further reduce spills and reloads. The opportunities are endless.

I think with all these ideas mashed together, a cycle-accurate dynamic recompiler should be tractable (however ugly).

I have begun writing a compiler to move forward with all these idea. The compiler will compile the initial interpreter and its guts will double as a JIT infrastructure. The interpreter will profile for hot sections of code and flag them for compilation. The JIT compilers (running in separate threads, as to not impact emulation) will pick up these hints and compile code for the hot spots.

Hopefully, I'll get it right this time. I'm very excited and optimistic that this will finally give me the headroom I need to add all the checks cycle-accuracy requires while delivering 60VI/s.

User avatar
Narann
Posts: 154
Joined: Mon Jun 16, 2014 4:25 pm
Contact:

Re: Notes on development of the next CEN64 core.

Post by Narann » Sun Jun 26, 2016 9:42 pm

Thanks for keeping us posted. This look quite ugly as you say but it seems to be very prompt to efficiency. I think this has never been tried on the n64 scene.
MarathonMan wrote:The interpreter will profile for hot sections of code and flag them for compilation.
I wonder about the cost of this during gameplay. I wonder if it wouldn't be even better to allow the emulator to save (as cache file) already known hot sections/recompiled sections so it doesn't have to hang each time you restart a game.

I'm very excited by what you say, I suggest to keep all informations about what you say about the design. The point is that it will maybe be code which is hard to get so any hint on how the whole works is interesting for newcomer (or future contributors).

Keep the good work! :D

User avatar
MarathonMan
Site Admin
Posts: 691
Joined: Fri Oct 04, 2013 4:49 pm

Re: Notes on development of the next CEN64 core.

Post by MarathonMan » Mon Jun 27, 2016 9:41 am

A JIT cache that exists between executions is something I've thought about eventually adding as an incremental thing. In the meantime, small blobs of codes that get run several times over like the RSP ucodes, main engine and libultra code, etc. will immediately benefit without a JIT cache that lives between runs.

User avatar
Snowstorm64
Posts: 302
Joined: Sun Oct 20, 2013 8:22 pm

Re: Notes on development of the next CEN64 core.

Post by Snowstorm64 » Sun Jul 03, 2016 5:57 am

This sounds promising! But I wonder... what about the actual core? Can it be recycled or at least are some parts of it? It would be sad to see all this work on the actual core wasted...
OS: Debian GNU/Linux Jessie (8.0)
CPU: Intel i7 4770K @ 3.5 GHz
Build: AVX (compiled from git)

User avatar
MarathonMan
Site Admin
Posts: 691
Joined: Fri Oct 04, 2013 4:49 pm

Re: Notes on development of the next CEN64 core.

Post by MarathonMan » Sun Jul 03, 2016 10:35 am

Snowstorm64 wrote:This sounds promising! But I wonder... what about the actual core? Can it be recycled or at least are some parts of it? It would be sad to see all this work on the actual core wasted...
I'll be able reuse most of it... I just have to transcribe it into the new language. I think the idea of what I have hasn't come through very clear yet, so let me explain a little further.

The compiler inside CEN64 will be very different than anything you'd typically see in a HLE. Most HLE compilers are not compilers at all - they're just binary translators. The take an input language (MIPS binary) and convert in into an output language (x86 binary). They often use neat tricks like register caching during the conversion process to make the output code cleaner.

CEN64, on the other hand, will feature a full-blown, optimizing, source-to-binary compiler and managed runtime. There are a few reasons for this madness:
  • One being that, because of the nature of a compiler, it is likely that I will encounter a lot of bugs along the way, which makes writing a cycle-accurate emulator that much more difficult. I'm designing the CEN64 compiler so I can run it in a standalone mode and perform regression testing outside of the emulator. Especially as I begin developing passes and getting into more of the compiler backend, this will help to catch a lot of bugs and allow me to pinpoint things quicker since I can just run a regression suite as a git post-commit hook.
  • Secondly, and probably of more importance, is that I need to interleave simulation code alongside the translated MIPS instructions. The simulation code that will get weaved in with the instructions will vary greatly in size and complexity depending on what parts of the CPU pipeline the MIPS code is exercising. Trying to interleave simulation code alongside the output of a binary compiler would be nothing short of a complete disaster and certainly result in suboptimal code being generated. A compiler, on the other hand, can pragmatically take in all this input and optimize across all of it in a generic way. Once I spot additional opportunities for optimization, I don't need to dig around in the generated binary recompiler code... I can just write another compiler pass that further optimizes the graphs, perhaps taking some assumptions as input.
  • Lastly, looking forward, keeping the compiler quasi-generic and extensible means that there is a greater degree of reusability. There is nothing that ties the compiler to strictly N64 emulation (rather, all of the specific logic is deferred to compiler passes or the runtime). If and when the day comes, this could become a framework for either myself or others to build very fast and efficient cycle-accurate simulators.
Anyways, with all that out of the way, I can reveal a bit about the compiler as it stands today so you can see how translating from the C interpreters will be quite easy.

Here's a blob of useless sample code that gets compiled right now:

Code: Select all

(include rsp.cen)

(namespace rsp

(deobj cpu
  v128 vreg,
  u32 sreg
)

)

(defun none start()
  assign (5+1) to foo
  call foo(buz,1,(1<<2))
  if (buz) ()
  assign 6 to bar
)
That's it. The language has variables, function calls, binary expressions, conditional branching, and some niceties like comments and include directives. Nothing super fancy here. But if you look at vr4300/pipeline.c, you will see that, quite frankly, that's all that code is really doing. Adding anything else, such as pointers, would only serve to complicate the compilation and optimization process.

The compiler then digests that code into an internal IR:

Code: Select all

[Object, type=rsp:cpu, members=[
	v128 vreg
	u32 sreg
]]
[Function, name=:start, return=Nothing, args=[]
	Intrinsic [entry]
	Branch [cond=Always, target=0]
	Label [2]
	Intrinsic [panic]
	Branch [cond=Always, target=1]
	Label [0]
	Store [:foo <= Add[
		Integer [5]
		Integer [1]
	]]
	Call [:foo
		Load [:buz]
		Integer [1]
		LShift[
			Integer [1]
			Integer [2]
		]
	]
	Branch [cond=???, target=3]
	Label [3]
	Store [:bar <= Integer [6]]
	Branch [cond=Always, target=2]
	Label [1]
	Intrinsic [exit]
]
The compiler will then convert the graph to SSA form (not yet started) and run semantic analysis and optimization passes on the graph.

Code: Select all

Running pass: List build (Builds the compiler's internal function/object/label lists)
Running pass: Insert nodes into ::start (Inserts entry/exit nodes into ::start)
Running pass: Type checker (Checks for correct type use within the graph)
Running pass: Object analysis (Computes field alignment, object size, etc.)
Object: rsp:cpu, alignment = 16, size = 20
[Object, type=rsp:cpu, members=[
	v128 vreg
	u32 sreg
]]
It then lowers the graph to x86_64 binary code optimized for your CPU. The days of seperate SSE2/SSSE3/SSE4.1/AVX/Native builds are gone... there will be only one portable binary that is capable of generating optimized code for the host CPU. :D

Code: Select all

   0:   48 89 a7 00 01 00 00    mov    QWORD PTR [rdi+0x100],rsp
   7:   48 8d a7 00 01 00 00    lea    rsp,[rdi+0x100]
   e:   53                      push   rbx
   f:   55                      push   rbp
  10:   41 54                   push   r12
  12:   41 55                   push   r13
  14:   41 56                   push   r14
  16:   41 57                   push   r15
  18:   48 89 fd                mov    rbp,rdi
  1b:   66 0f ef c0             pxor   xmm0,xmm0
  1f:   48 31 c0                xor    rax,rax
  22:   66 0f ef c9             pxor   xmm1,xmm1
  26:   48 31 c9                xor    rcx,rcx
  29:   66 0f ef d2             pxor   xmm2,xmm2
  2d:   48 31 d2                xor    rdx,rdx
  30:   66 0f ef db             pxor   xmm3,xmm3
  34:   48 31 db                xor    rbx,rbx
  ...
So hopefully now, you can see that I plan to reuse almost all of the logic in the interpreter cores. I just have to transcribe it into the new language.

After I get the interpreters running, I can then use them to actively profile for hot sections of MIPS code and flag it for execution. A JIT thread will pickup these hints and compile very optimized code for those segments using the existing compiler infrastructure (the interpreters will continue to be used for everything else - no need to compile the world).

The ideas only go more and more crazy from there.

User avatar
Snowstorm64
Posts: 302
Joined: Sun Oct 20, 2013 8:22 pm

Re: Notes on development of the next CEN64 core.

Post by Snowstorm64 » Sun Jul 03, 2016 3:20 pm

After reading your post multiple times, I think that's a genius way to achieve accurate emulation while delivering full speed. The only thing that concerns me is that it looks too much awesome to be true....are there any downsides (other than translating the code in the new language)?! :P
I hope you'll finally manage to make the dream of the perfect emulation come true this time. ;)
OS: Debian GNU/Linux Jessie (8.0)
CPU: Intel i7 4770K @ 3.5 GHz
Build: AVX (compiled from git)

User avatar
Narann
Posts: 154
Joined: Mon Jun 16, 2014 4:25 pm
Contact:

Re: Notes on development of the next CEN64 core.

Post by Narann » Sun Jul 03, 2016 3:28 pm

This promise is mind blowing!!! :o

What looks even more impressive is that it could bring very good performances for older console cycle emulation (I'm thinking about embedded system where cycle emulation could become reality).

This is a huge project you've got there, One that could change the face of CAE (Cycle Accurate Emulation, let's create a new acronym).

This also mean someone with low level skills could (if you teach them) detect hot spot and investigate for optimizations to improve the overall performance.

Keep the good job, it looks awesome!!

User avatar
Nacho
Posts: 66
Joined: Thu Nov 07, 2013 9:25 am

Re: Notes on development of the next CEN64 core.

Post by Nacho » Sun Jul 03, 2016 4:58 pm

So, if I understood well, are you creating the ultimate emulator? :O Basically, a cycle accurate JIT?
Testing CEN64 on: Intel Core i5 520M 2.4 GHz. SSE2 SSE3 SSE4.1 SSE4.2 SSSE3, but no AVX. Ubuntu Linux

User avatar
MarathonMan
Site Admin
Posts: 691
Joined: Fri Oct 04, 2013 4:49 pm

Re: Notes on development of the next CEN64 core.

Post by MarathonMan » Sun Jul 03, 2016 5:36 pm

Thank you for the kind words everyone!
Nacho wrote:Basically, a cycle accurate JIT?
Narann wrote:One that could change the face of CAE (Cycle Accurate Emulation, let's create a new acronym).
Yes, cycle-accuracy for all! :mrgreen: I'm not certain that it will have enough oomph for embedded use cases (at least N64 - I could see SNES being a thing), but time will tell.
Snowstorm64 wrote:are there any downsides (other than translating the code in the new language)?! :P
Sure, I think there are some.

Firstly, I'm not going to beat around the bush: the language wasn't designed around elegance -- it's a really ugly language at best and has very limited use cases. One of the big reasons for why we emulate things is preservation, and it kinda sucks that something like this will be used to preserve what the hardware does.

Secondly, the design implies some kind of 'lock-in' to an environment that can support all the requirements of the compiler and runtime. Any architecture or OS that wishes to benefit from the emulation must port everything (in the case of a different architecture, this means writing a new backend). This may not seem like a huge deal from the outset, but it does prevent the emulator from running on things like iOS (which does not permit programmers to acquire executable pages).

User avatar
Narann
Posts: 154
Joined: Mon Jun 16, 2014 4:25 pm
Contact:

Re: Notes on development of the next CEN64 core.

Post by Narann » Sun Jul 03, 2016 10:37 pm

MarathonMan wrote:One of the big reasons for why we emulate things is preservation,
Image
I totally agree with that, experience prove that emulators are here to stay.
MarathonMan wrote:and it kinda sucks that something like this will be used to preserve what the hardware does.
But what other options would you have? It can be ugly if it's simple and properly documented. The point is not to create a C++ killer.
MarathonMan wrote:iOS (which does not permit programmers to acquire executable pages).
It's an permissions limitation or an OS limitation? I mean, if you root it, can you do it? And if so, would you break something? If the answer is Yes then No you should not bother with sofware limited hardware. Those low level stuff are never "nice" (is there any emulator dynarec working on more than one architecture being nicely coded?). From what I've seen, documentation (comments, big overview) is often the only way to go. Try to make the global code nice and locate what must be dirty in some specific, separated locations.

User avatar
MarathonMan
Site Admin
Posts: 691
Joined: Fri Oct 04, 2013 4:49 pm

Re: Notes on development of the next CEN64 core.

Post by MarathonMan » Mon Jul 04, 2016 8:45 am

Narann wrote:It's an permissions limitation or an OS limitation? I mean, if you root it, can you do it? And if so, would you break something?
If you root it, you can get around it I believe. That was just an example, though :)

User avatar
Snowstorm64
Posts: 302
Joined: Sun Oct 20, 2013 8:22 pm

Re: Notes on development of the next CEN64 core.

Post by Snowstorm64 » Mon Jul 04, 2016 10:54 am

MarathonMan wrote: Sure, I think there are some.

Firstly, I'm not going to beat around the bush: the language wasn't designed around elegance -- it's a really ugly language at best and has very limited use cases. One of the big reasons for why we emulate things is preservation, and it kinda sucks that something like this will be used to preserve what the hardware does.

Secondly, the design implies some kind of 'lock-in' to an environment that can support all the requirements of the compiler and runtime. Any architecture or OS that wishes to benefit from the emulation must port everything (in the case of a different architecture, this means writing a new backend). This may not seem like a huge deal from the outset, but it does prevent the emulator from running on things like iOS (which does not permit programmers to acquire executable pages).
If I understood what you mean, then...well, to be fair, it's not like actual CEN64's code is readable/portable enough because of the heavy use of SSE/AVX intrinsics...theoretically an ANSI C-compliant code would be ideal, but because we need to achieve cycle-accurate emulation at an acceptable speed (and also think of code portability!), the use of SSE intrinsics is a necessary evil in order to do that. So if you have to write the new core, as you have just described before, in order to deliver perfect CAE at good speed...just do it! Maybe one day we'll have computers that will be powerful enough to do perfect CAE without any of those compromises, but for now let's preserve everything before all the N64 consoles begin to die. ;)
OS: Debian GNU/Linux Jessie (8.0)
CPU: Intel i7 4770K @ 3.5 GHz
Build: AVX (compiled from git)

User avatar
MarathonMan
Site Admin
Posts: 691
Joined: Fri Oct 04, 2013 4:49 pm

Re: Notes on development of the next CEN64 core.

Post by MarathonMan » Mon Jul 04, 2016 12:02 pm

Snowstorm64 wrote:before all the N64 consoles begin to die. ;)
N64s were made with Grade A Nintendium though...

http://orig06.deviantart.net/927c/f/201 ... 6879v5.png

User avatar
Narann
Posts: 154
Joined: Mon Jun 16, 2014 4:25 pm
Contact:

Re: Notes on development of the next CEN64 core.

Post by Narann » Mon Jul 04, 2016 2:57 pm

MarathonMan wrote:N64s were made with Grade A Nintendium though...
N64 PCB will die before the case... :mrgreen:

User avatar
Nintendo Maniac 64
Posts: 185
Joined: Fri Oct 04, 2013 11:37 pm

Re: Notes on development of the next CEN64 core.

Post by Nintendo Maniac 64 » Mon Jul 04, 2016 5:39 pm

Narann wrote:
MarathonMan wrote:N64s were made with Grade A Nintendium though...
N64 PCB will die before the case... :mrgreen:
I can attest to that via the most modern use of Grade A Nintendium - the Wii remote; I had one a couple years ago that seems to have randomly died on me even though the remote itself looks perfectly fine (I opened it up and everything - it looks pristine).
CEN64 Forum's resident straight-male kuutsundere
(just "tsundere" makes people think of "Shana clones" *shivers*)

CPU+iGPU: Pentium G3258 @ 4.6GHz/1.281v
dGPU: Radeon HD5870 1GB
RAM: Vengeance 1600 4x4GB
OS: Windows 7

User avatar
The Extremist
Posts: 29
Joined: Sun Nov 03, 2013 6:11 pm
Location: Canadian Prairie

Re: Notes on development of the next CEN64 core.

Post by The Extremist » Fri Jul 08, 2016 7:25 am

From what I can gather, "Nintendium" is actually ABS plastic. Same stuff as Lego, though different moulding temperatures and additives give it different properties.

User avatar
wareya
Posts: 16
Joined: Tue May 19, 2015 5:44 pm

Re: Notes on development of the next CEN64 core.

Post by wareya » Fri Jul 08, 2016 4:18 pm

ABS and PBT are both extremely common. PBT is considered a luxury in computer peripherals but it's supposed to be more brittle than ABS, which is a bad thing for taking blunt damage.

User avatar
MarathonMan
Site Admin
Posts: 691
Joined: Fri Oct 04, 2013 4:49 pm

Re: Notes on development of the next CEN64 core.

Post by MarathonMan » Fri Oct 21, 2016 9:13 am

Whenever I pick this up, I always end up doing something other than what I planned. Ugh.

Most recently, I have bolstered the semantic analysis, which is quite boring if I don't say so myself. There is now type checking (TODO: implicit casting), improved variable parsing and handling, and more. Of course, because I love to over-optimize things, the memory requirements are further reduced and the compiler itself got a little speed boost.

Hopefully there's just one last minor thing blocking me from really getting into the meat of SSA construction. After SSA construction is finished, I should have enough pieces constructed to start compiling and executing generic programs. After that, it's just optimization passes and further improvements to semantic analysis (probably along with new language constructs at some point in time...)

Code: Select all

#(include rsp.cen)

(namespace rsp

(deobj cpu
  v128 vreg,
  u32 sreg
)

(deobj nested
  u32 poopie,
  rsp:cpu cpuobj
)

)

(defun none foo(i32 asdf)
)

(defun none start()
  devar zbaz as i32
  devar nonshadowedfoo as i32
  assign (5+1) to :zbaz
  call foo(zbaz) #,1,(1&&2))
  if ((nonshadowedfoo && zbaz) || (1<=2)) (
    devar innerscope as u16
    if (call inner(1,2)) (
      assign 1 to innerscope
    ) elseif (zbaz) (
      assign 2 to nonshadowedfoo
    ) else (
      # yadayadya
      // comments
      /*whoooo
      hoo*/
    )
    assign 69 to zbaz
  )
  assign 6 to zbaz
)

Code: Select all

Running pass: List build (Builds the compiler's internal function/object/label lists)
Object: rsp:cpu, alignment = 16, size = 20
Object: rsp:nested, alignment = 16, size = 36
Running pass: Insert nodes into ::start (Inserts entry/exit nodes into ::start)
Running pass: Semantic analysis (Checks for correct semantics and resolves variables)
Running pass: SSA conversion (Converts the graph to SSA form)
[Object, type=rsp:cpu, members=[
	v128 vreg
	u32 sreg
]]
[Object, type=rsp:nested, members=[
	u32 poopie
	rsp:cpu cpuobj
]]
[Function, name=:foo, return=Nothing, args=[asdf]
]
[Function, name=:start, return=Nothing, args=[]
	Intrinsic [entry]
	Branch [type=Always, target=0]
	Label [2]
	Intrinsic [panic]
	Branch [type=Always, target=1]
	Label [0]
	Variable [type = i32, name = zbaz]
	Variable [type = i32, name = nonshadowedfoo]
	Store [zbaz <= Add[
		Integer [5]
		Integer [1]
	]]
	Call [:foo
		Load [zbaz]
	]
	Branch [type=Conditional, target=4, condition=Not[LogicalOr[
		LogicalAnd[
			NotEqual[
				Load [nonshadowedfoo]
				Integer [0]
			]
			NotEqual[
				Load [zbaz]
				Integer [0]
			]
		]
		LessOrEqual[
			Integer [1]
			Integer [2]
		]
	]]]
	Variable [type = u16, name = innerscope]
	Branch [type=Conditional, target=6, condition=Not[NotEqual[
		Call [:inner
			Integer [1]
			Integer [2]
		]
		Integer [0]
	]]]
	Store [innerscope <= Integer [1]]
	Branch [type=Always, target=5]
	Label [6]
	Branch [type=Conditional, target=7, condition=Not[NotEqual[
		Load [zbaz]
		Integer [0]
	]]]
	Store [nonshadowedfoo <= Integer [2]]
	Branch [type=Always, target=5]
	Label [7]
	Label [5]
	Store [zbaz <= Integer [69]]
	Branch [type=Always, target=3]
	Label [4]
	Label [3]
	Store [zbaz <= Integer [6]]
	Branch [type=Always, target=2]
	Label [1]
	Intrinsic [exit]
]

Program size is: 349 bytes
Compiler arena usage statistics:
	current: 20480 bytes
	maximum: 28672 bytes

User avatar
MarathonMan
Site Admin
Posts: 691
Joined: Fri Oct 04, 2013 4:49 pm

Re: Notes on development of the next CEN64 core.

Post by MarathonMan » Sat Apr 22, 2017 8:57 pm

Code: Select all

(defun none start()
  devar foo as i32

  if (1) (
    assign 1 to foo
  ) elseif (2) (
    assign 2 to foo
  ) else (
    assign 3 to foo
  )
)

assign (foo + 1) to foo

---

Running pass: List build (Builds the compiler's internal block/function/object lists)
Running pass: Insert nodes into ::start (Inserts entry/exit nodes into ::start)
Running pass: Semantic analysis (Checks for correct semantics and resolves variables)
Running pass: SSA conversion (Converts the graph to SSA form)
[Function, name=:start, return=Nothing, args=[]
Basic Block [0] - preds [], succs: [3]
	Intrinsic [entry]
	Branch [type=Always, target=3]
Basic Block [2] - preds [9], succs: [1]
	Intrinsic [panic]
	Branch [type=Always, target=1]
Basic Block [1] - preds [2], succs: []
	Intrinsic [exit]
Basic Block [3] - preds [0], succs: [6,5]
	Variable [type = i32, name = foo]
	Branch [type=Conditional, target=6(T)/5(NT), end=4, condition=NotEqual[
		Integer [1]
		Integer [0]
	]]
Basic Block [6] - preds [3], succs: [4]
	Variable [type = i32, name = foo@5]
	Store [foo@5 <= Integer [1]]
	Branch [type=Always, target=4]
Basic Block [5] - preds [3], succs: [8,7]
	Branch [type=Conditional, target=8(T)/7(NT), end=4, condition=NotEqual[
		Integer [2]
		Integer [0]
	]]
Basic Block [8] - preds [5], succs: [4]
	Variable [type = i32, name = foo@4]
	Store [foo@4 <= Integer [2]]
	Branch [type=Always, target=4]
Basic Block [7] - preds [5], succs: [4]
	Variable [type = i32, name = foo@3]
	Store [foo@3 <= Integer [3]]
	Branch [type=Always, target=4]
Basic Block [4] - preds [6,8,7], succs: [9]
	Variable [type = i32, name = foo@2]
	Variable [type = i32, name = foo@1]
	Phi [foo@1 <= foo@5, foo@4, foo@3]
	Store [foo@2 <= Add[
		Load [foo@1]
		Integer [1]
	]]
	Branch [type=Always, target=9]
Basic Block [9] - preds [4], succs: [2]
	Branch [type=Always, target=2]
]

Code: Select all

(defun none start()
  devar foo as i32

  if (1) (
    assign 1 to foo
  ) elseif (2) (
    assign 2 to foo
  ) else (
    assign 3 to foo
  )
  assign (foo + 1) to foo
)
SSA is working now. Should be able to start codegen very soon.

User avatar
Narann
Posts: 154
Joined: Mon Jun 16, 2014 4:25 pm
Contact:

Re: Notes on development of the next CEN64 core.

Post by Narann » Sun Apr 23, 2017 7:20 am

:o

User avatar
grivy
Posts: 6
Joined: Sat Oct 05, 2013 5:33 am

Re: Notes on development of the next CEN64 core.

Post by grivy » Wed Apr 26, 2017 6:19 pm

So, for my understanding, I read things like VM, JIT, optimized for host CPU etc.

Am I right in thinking that this language/compiler has, in its foundation, similar principles as Java/C#, but is very specialized and more low level to be able apply tricks like queuing faster versions of code blocks when the conditions allow for them? Or are there too many optimizations or differences in general that the comparison is just silly?

User avatar
MarathonMan
Site Admin
Posts: 691
Joined: Fri Oct 04, 2013 4:49 pm

Re: Notes on development of the next CEN64 core.

Post by MarathonMan » Wed Apr 26, 2017 10:48 pm

It is pretty similar in many respects, yes.

But instead of making the execution slower in favor of a dynamic language, it will (hopefully) make the execution faster using some horribly designed syntax. :lol: :D

User avatar
Narann
Posts: 154
Joined: Mon Jun 16, 2014 4:25 pm
Contact:

Re: Notes on development of the next CEN64 core.

Post by Narann » Thu Apr 27, 2017 3:37 am

Twist plot: This language is used by many low level dev and there is a documentation for it. :lol:

User avatar
Nacho
Posts: 66
Joined: Thu Nov 07, 2013 9:25 am

Re: Notes on development of the next CEN64 core.

Post by Nacho » Sat Jun 10, 2017 6:51 pm

Wait a minute... Why using LISP as the code to be compiled is a good idea?

I guess the next CEN64 will take a compiled ROM, decompile it into some LISP madness ball of code, and then JIT the hell out of it. Nice. But, why LISP?
Testing CEN64 on: Intel Core i5 520M 2.4 GHz. SSE2 SSE3 SSE4.1 SSE4.2 SSSE3, but no AVX. Ubuntu Linux

User avatar
MarathonMan
Site Admin
Posts: 691
Joined: Fri Oct 04, 2013 4:49 pm

Re: Notes on development of the next CEN64 core.

Post by MarathonMan » Sat Jul 01, 2017 9:59 am

The language looks LISP-y, it's not actually LISP.

The reason why the language looks like that is because it's trivial to lex and parse:
https://git.cen64.com/?p=cen64.git;a=bl ... fd;hb=HEAD

The entirety of the language is currently parsed by what is a little over 1,000 lines of C!

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest