Contributing to CEN64 + Emulator architecture expanation

Discuss any unrelated topics here.
Post Reply
User avatar
Nacho
Posts: 66
Joined: Thu Nov 07, 2013 9:25 am

Contributing to CEN64 + Emulator architecture expanation

Post by Nacho » Wed Jul 08, 2015 2:08 pm

Hi!

You may remember me from some post at the forums, but I'd like to introduce myself:

I'm Nacho, from Spain, former physicist, computer enthusiast and N64 gamer. I'm somewhat proficient in C and I would like to contribute to CEN64 in a more deep way than a mere ROM testing.

That being said, I find the code somewhat shady, mainly due to the lack of documentation specifically pointed to the emulator itself.

So, what's the best way to start looking into the actual code? What can I do to understand better the architecture of CEN64?
Testing CEN64 on: Intel Core i5 520M 2.4 GHz. SSE2 SSE3 SSE4.1 SSE4.2 SSSE3, but no AVX. Ubuntu Linux

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: Contributing to CEN64 + Emulator architecture expanation

Post by MarathonMan » Wed Jul 08, 2015 5:34 pm

Bless your soul. :P

There's an incredibly numerous number of ways to contribute, and it mostly boils down to: what do YOU want to do?

Do you want to focus on accuracy, performance, compatibility, UI/UX, etc.? If you're focused on any of the first three, you have your choice of the RCP (RSP, RDP, or any of the controllers - AI/VI/PI/SI, etc.), or VR4300. There's also potential improvements in the areas of architectural support(ARM) or OS/library support. Or, maybe you just want to get a specific ROM booting. I'm not sure. :D

Sorry about the lack of documentation. I am definitely free to answer questions and can start providing documentation, but it's beneficial to the both of us if you pick a starting point and I can start filling in those relevant holes first. Also free free to ping me on irc: @marathonm on efnet in #n64dev.

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: Contributing to CEN64 + Emulator architecture expanation

Post by MarathonMan » Wed Jul 08, 2015 6:52 pm

On a high level, the architecture is structured like so, very similar to the RCP:

Code: Select all

bus <=========+
              |- AI (Audio Interface) ==> OpenAL
              |- VI (Video Interface) ==> OpenGL
              |- PI (Parallel Interface -- ROM, 64DD, etc.)
              |- SI (Serial Interface -- controller, etc.)
              |- MI (MIPS interface) ==> VR4300
              |- RI (RAM interface) ==> RDRAM
              |- RSP
              |- RDP
Every component has a connection to the bus, and the bus has a connection to every component. Just like the RCP, components are accessed through a memory-mapped I/O interface:

Code: Select all

int bus_read_word(void *component, uint32_t address, uint32_t *word);
int bus_write_word(void *component, uint32_t address, uint32_t word, uint32_t dqm);
component is just a pointer to yourself (i.e., &vr4300, or &rdp), the address and word should be obvious. Addresses must be word-aligned. dqm is just a write mask and can be used to write only part of the word, if you only want to write a byte or something -- it's effectively used to prevent the need for functions that write bytes, halfwords, doublewords, etc.

In general, functions return 0 on success and non-zero (or NULL) on failure.

For understanding the N64 memory map, refer to: http://infrid.com/rcp64/docfiles/n64maps.txt

User avatar
Nacho
Posts: 66
Joined: Thu Nov 07, 2013 9:25 am

Re: Contributing to CEN64 + Emulator architecture expanation

Post by Nacho » Thu Jul 09, 2015 11:15 am

That's really a good starting point :P

Regarding what I want to do, I think the question is more like "What I'm able to do?", since I want to take a deeper look at the insides of the beast and seeing what can be improved/implemented. By example: some crazy assembler stuff discussed on the previous days are clearly out of my league, but lurking trough the code and tweaking instructions may be useful. The more the people understanding the emulator, the bigger the possibilities of making significant improvements.

And even if I'm not able to do some important contribution, it would lead to some interesting discussion, documentation and community building, which is always good in such a project like this.

By the way...

Code: Select all

$ git clone git://git.cen64.com/cen64.git
Clonar en «cen64»...
fatal: unable to connect to git.cen64.com:
git.cen64.com[0: 104.236.100.222]: errno=Refused connection

Testing CEN64 on: Intel Core i5 520M 2.4 GHz. SSE2 SSE3 SSE4.1 SSE4.2 SSSE3, but no AVX. Ubuntu Linux

User avatar
Snowstorm64
Posts: 303
Joined: Sun Oct 20, 2013 8:22 pm

Re: Contributing to CEN64 + Emulator architecture expanation

Post by Snowstorm64 » Thu Jul 09, 2015 1:15 pm

For a good start, you can try to implement EEPROM/SRAM/FlashRAM support in the emulator, I believe it all takes are in which address they are found and how to read data or store the changes in a specified file. You may want to take a look at the CEN64's old core code to understand how it works.
OS: Debian GNU/Linux Jessie (8.0)
CPU: Intel i7 4770K @ 3.5 GHz
Build: AVX (compiled from git)

User avatar
izy
Posts: 25
Joined: Tue Jun 02, 2015 11:34 am

Re: Contributing to CEN64 + Emulator architecture expanation

Post by izy » Fri Jul 10, 2015 10:34 am

I began to give a look at bus_read_word/bus_write_word
Looking into the first call made to the file ri/controller.c i noticed that there are calls to byteswap_32 being made there. byteswap_32 uses __builtin_bswap32 which uses bswap r32. Bswap is a slow instruction. It isn't a bad thing to break the dependency between the two bswap-instructions. Just for fun, at least :D
The original code is:

Code: Select all

  word = byteswap_32((byteswap_32(orig_word) & ~dqm) | word);
but i would write it the other way round:

Code: Select all

  word = byteswap_32(word) ^ orig_word ^ (byteswap_32(dqm) & orig_word);
both versions have 2 calls to BSWAP and 3 logic operations (NOT, AND, OR -> XOR, XOR, AND). However, in my version - if possible - both BSWAPs could theoretically execute (or at least start to) at the same time.

Code: Select all

// Writes a word to RDRAM.
int write_rdram(void *opaque, uint32_t address, uint32_t word, uint32_t dqm) {
  struct ri_controller *ri = (struct ri_controller *) opaque;
  unsigned offset = address - RDRAM_BASE_ADDRESS;
  uint32_t orig_word;

  memcpy(&orig_word, ri->ram + offset, sizeof(orig_word));
  #if !defined(__i486__) && !defined(__x86_64__)
  word = byteswap_32(word);
  dqm = byteswap_32(dqm);
  word ^= orig_word;
  word ^= (dqm & orig_word);
  #else
  asm volatile(
    "bswap        %k[word]                  ;"
    "bswap        %k[dqm_]                  ;"
    "xor          %k[orig], %k[word]        ;"
    "and          %k[orig], %k[dqm_]        ;"
    "xor          %k[dqm_], %k[word]        ;"
    : [word] "+&r" (word), [dqm_] "+&r" (dqm)
    : [orig] "r" (orig_word)
    : "cc"
  );
  #endif
  memcpy(ri->ram + offset, &word, sizeof(word));
  return 0;
}
The same code could be used for the function write_pif_ram().
Also another cosmetic thing,
$ grep -re '.*offset.*REGS_BASE_ADDRESS.*'
grep says you should decide if "offset" is unsigned or uint32 (like "uint32_t address")

Anyway i answer because i actually really would like to know something else. I noticed that bus_read_word/bus_write_word call "resolve_mapped_address". It is a binary search function.
Since it accesses the memory and i don't think the tree gets new nodes at runtime (but even if it does maybe nothing changes), isn't possible to use a LUT? A lookup array could access the right node/item in constant time or O(1). Maybe it isn't a function called often and makes no difference or there is any particular reason why it has to be a tree?

User avatar
MarathonMan
Site Admin
Posts: 692
Joined: Fri Oct 04, 2013 4:49 pm

Re: Contributing to CEN64 + Emulator architecture expanation

Post by MarathonMan » Fri Jul 10, 2015 2:42 pm

Good find with the ordering of the bswap instructions. That's a relatively hot section of code, and the same technique can also be applied during RSP DMAs and the like.

Regarding the tree; that's almost something that I'm willing to take a penalty on as it gives me a lot of flexibility for both adding support for new peripherals (like the 64DD), as well as inserting things into the address space without ripping my hair out. If you look into the mapping further, you'll see that a lot of the functions end up calling read_*_regs or write_*_regs. These functions, in turn, switch *again* on the low order address bits to pick a MMIO'd register within the controller's address region. I've been toying to see how adding extra mappings in the tree instead of switching for a second time locally improves or reduces performance, for example. If I use a LUT, I basically lock myself into a specific approach unless I feel like adding a bunch of padding and reconsidering the shift amounts and masks... not fun!

And FYI - no, there is no need for it to be dynamic (the physical address map is fixed). However, the tree is actually a red-black tree, so regardless of whether or not I add nodes at a later time, the tree will always remain balanced.

Also, I've actually tried implementing a LUT, but it's quite hard in this case as the physical address map is quite sparse (http://infrid.com/rcp64/docfiles/n64maps.txt). You end up either creating an enormous LUT, or a LUT which looks into another LUT, etc. to get around the sparse-ness of the physical address layout. So in the end, I don't even think the potential performance gain of a LUT is all that great in this case. And it thrashes the caches a bit harder than the tree does (since all the tree nodes are allocated contiguously).

Another thing I did to make the tree less expensive is make note of the fact that RDRAM accesses are extremely frequency compared to all other accesses. Thus the tree is really only consulted for non-RDRAM accesses, which are both infrequent and you can expect to pay a penalty for anyways:

Code: Select all

// Issues a read request to the bus.
int bus_read_word(void *component, uint32_t address, uint32_t *word) {
  const struct memory_mapping *node;
  struct bus_controller *bus;

  memcpy(&bus, component, sizeof(bus));

  if (address < RDRAM_BASE_ADDRESS_LEN)
    return read_rdram(bus->ri, address, word);

  else if ((node = resolve_mapped_address(&bus->map, address)) == NULL) {
    debug("bus_read_word: Failed to access: 0x%.8X\n", address);

    *word = 0x00000000U;
    return 0;
  }

  return node->on_read(node->instance, address, word);
}

User avatar
izy
Posts: 25
Joined: Tue Jun 02, 2015 11:34 am

Re: Contributing to CEN64 + Emulator architecture expanation

Post by izy » Sun Jul 12, 2015 10:31 am

MarathonMan wrote:If I use a LUT, I basically lock myself into a specific approach
Adding code doesn't remove yours. If you're interested, the Tree could be kept and used for tests. All you'd need is to add a few "ifdef"s in order to decide which code (tree/lut) to use.
MarathonMan wrote:Also, I've actually tried implementing a LUT, but it's quite hard in this case as the physical address map is quite sparse
I've implemented the version with the LUT. It works and so far it was very easy to code. (for me, though).
MarathonMan wrote:You end up either creating an enormous LUT, or a LUT which looks into another LUT
Indeed. A plain LUT with 4 billions of items would have been quite big.
My LUT is big but it dropped from 4 billions of items, down to 65536. And then from 512 KB of required memory, down to 64 KB. And then from 64KB down to 32KB. It's a single LUT, not two.
I've built a compression function in order to keep the LUT small.
The compression function i've wrote is made of three lines of code in C. They shall translate to six instructions in assembly.

Of course if you change the amount of memory segments in order to implement new segments & new stuffs, the LUT could grow and a different compression function would be required (that's very likely but not automatic)
MarathonMan wrote:I don't even think the potential performance gain of a LUT is all that great in this case
At the moment my versions with the LUT are six to seven times faster than the original function resolve_mapped_address with the tree. The more segments you add, the more advantageous is going to be a sort of "lookup by index array" over a search function.
The performance gain regarding the LUT alone is noticeable if you call it in a loop for tests, you notice it takes 1 seconds in place of 6; but maybe the program won't see a big difference if resolve_mapped_address is seldom used.
MarathonMan wrote:And it thrashes the caches a bit harder than the tree does (since all the tree nodes are allocated contiguously).
In a lookup table/array all items are allocated contiguously. It could be maybe of help to allocate the LUT somewhere followed by the array struct memory_mapping mapping[] at a very next address, in order not to jump between very different addresses. Perhaps before accessing the LUT also a prefetch against struct memory_mapping mapping[] could be added (but i'm unsure if it's going to be useful, not tested --don't know).

The first compression function i've wrote generates a 16-bits number and the LUT has 65536 elements (2**16). I've reduced later the amount of items. I was holding 65536 pointers, they used up to 65536*sizeof(struct memory_mapping*) bytes of memory. The pointed elements are inside the sequential array struct memory_mapping mapping[]. I replaced the pointers with the index value. Since you've implemented at the moment 19 memory maps (static const struct bus_controller_mapping mappings[NUM_MAPPINGS] = {...19 items...}); i used an uint8_t to hold the indexes and this caused the LUT to decrease its size down to 64KB (65536*sizeof(uint8_t) = 65536).
I wrote the code based on your (cen64s) memory segments. The memory mappings at n64maps.txt are different. Just to name two cases, these segments seem to be unimplemented in cen64.

Code: Select all

 0x1FD0 0000 to 0x7FFF FFFF Cartridge Domain 1 Address 3
 0x8000 0000 to 0xFFFF FFFF External SysAD Device
If i add the support for those segments then the required memory for the LUT could increase.
For example.
- If i want to add to the compression function the support for "Cartridge Domain 1 Address 3", nothing at all changes. The compression function will be the very same and the LUT won't increase its size. It's free.
- If i need to add the support for "External SysAD Device" then the compression function changes a lot and a more complex compression function will be required. The generated hash will be of 17 bits thus doubling the size of the LUT. Because the amount of items is going to be 2^(16+1). On the other hand, a special case inside the compression function could be implemented to avoid to extend the LUT at the expense of some more code (which is anyway going to create a slow down every time the CPU executes it, but trashing less data cache).

It's also possible to save bits sometimes... I've given a small penalty/delay to the segment "0x1FC007C0 to 0x1FC007FF PIF RAM" in order to save some memory. I removed a bit from the compression function and the new LUT is now made of 2^15 elements (from 2^16 down to 2^15). It's a possible trade off optimization. Consume a new bit here, save an old bit there...
This approach had the following consequences:
- halved the items of the LUT
- halved its size (now either 32KB for indexes or 256KB for pointers)
- all ranges get a small speed up (because of the smaller table)
- only accessing the range 0x1FC007C0-0x1FC007FF will get a delay
- the code is more complex to manage
- overall this implementation is faster than the other one where the short range of PIF RAM gets no penalty

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest