Home Artists Posts Import Register

Downloads

Content

Hi Everyone!

after completing most of the VI work, I went back to fixing bugs on the N64 core. Today we will look at some of them. Instead of just listing every one, I will pick some and tell more about them in detail, as I think this is more interesting.

We start with the most important one: the RDP command interface rewrite.

You can see in the left image that some letters are missing and this was kind of hitting nearly every game. Some polygons and Sprites have been popping in and out randomly. While this may seem like a minor bug, you will see why it is so important soon.

The RDP receives draw commands which can span one or multiple 64bit "words". Lets look at some examples:

You can see that there are single word commands for loading a texture or filling a rectangle with solid color, but also ones that span across many multiple words. Texture rectangle for example needs a second word for the texture coordinates. This is the one that was broken in the screenshot.

Triangle commands are much worse. Even a "Fill Triangle" needs already 4 words, with the worst case being a triangle with color shading, texturing and z-buffer, needing 22 words. How do we fetch all these commands?

Typically a game prepares a render list with some hundred up to some thousand commands per frame, which would be too much data to store directly in the RDP, so this data is either stored in RDRAM or in the RSP.

So we need to fetch this data and there is only simple solution that is used typically in software emulators that don't emulate the RDPs memory access being shared with the CPU and other parts: fetch 1 word and evaluate. If the commands needs more words, fetch those.

While in theory this works well, it has a major downside: it's not what the real hardware does. How can I know? Because it doesn't work well in hardware. The reason is that even for small commands like a "Texture Rectangle", you would need two seperate memory accesses including bus arbitration and memory latency.

This doesn't sound too bad, but just imagine you fetched the first command word and before you can fetch the second, the VI fetches data for video output, which always must have priority, because otherwise the pixels would not be ready early enough to be seen on the screen. Now you are waiting until the VI finishes the read before even the ram access delay begins. This would slow down the rendering of small elements a lot.

So what is the solution?

We must somehow buffer this incoming data and fetch more than we might need for the next command.

The old solution in the core was to always fetch one full command, so up to 22 words and then work from it until the data available is not enough to execute the next command. In this case a new block would be fetched, sometimes refetching words from before, so up to 22 words would be there again.

While this worked in 99% of the situations, it had a problem I wasn't thinking about. I was always assuming that the RSP or CPU must be aware of the GPU being slow for some reason and keep the data of the requested commands valid until they received the feedback from the RDP that this data was processed. But this is not a case.

Imagine a game wanting to draw 4 sprites, which requires 8 words. It issues the command to the RDP, then works ahead and somewhere later in the game logic needs to draw another 4 sprites. So it may reuse the memory from before to store the next draw commands. As those data has already been fetched by the RDP, it should be no issue, right?

Well, unfortunatly it was in the core as sometimes words have been refetched and if the command was modified in between, it would receive different data there.

When after the modification the command doesn't have a valid command ID anymore, it would not be executed at all. Then you get a missing sprite or other glitch for 1 frame. But what if the command is modified in a way that the RDP cannot handle? In worst case, the RDP command processor would hang up. Imagine a game wanting to draw a rectangle (2 words) and by accident it was interpreted as "Shade Triangle" (12 words). The RDP would wait for the CPU or RSP to deliver this data, but they will never.

The solution of this issue was to use a command FIFO. FIFO stands for "first in, first out". and means that you can feed data in from side (RAM) and read it out in the same order it was filled in on the other side (RDP). So always when there is space free in the FIFO and the CPU or RSP requests the RDP, we fill the FIFO. Whenever the last draw command in the RDP completes, we can just use data from the FIFO. 

The reason why I have not done it like this in the first place is the handling when a command is incomplete, e.g. wanting 12 words but only having 6 words. Then the command interpreter has to wait until the FIFO was filled again. But it turned out this wasn't that difficult to design in and the result is outstanding:

- the flickering sprites are gone

- random hangs in games are gone

- some games that showed garbage are rendering better


Another important change was the bug that caused this:

The logo in the upper image was cut off. At least that is what was reported as bug and it seemed like that on the first view. Now that you see how it should look like, you will also notice that there are more issues, like the (R) being at the wrong position compared to the rainbow color below the text. It also seems like a minor issue, but again, we will see the whole implications later.

I just finished the most of the VI work and assumed this to be an VI bug, so I did what I always do first when a bug is reported: check with my software emulator. If things work there, I have a known-good implementation where I can look how things are supposed to work. This doesn't always work out, especially because the emulator has plenty of bugs as well and I don't always spend the time to fix them and if it's a timing edge case, it doesn't help much in the end.

In any case, this was not such a case, so I could start. As the emulator rendered this fine, I set up the VI simulation in the VHDL simulator. This simulation takes the RDRAM contents from the software emulator as well as the VI register settings and outputs one image to a file as if the file was a TV. The resulting pixels can be compared automatically and tell me which pixels are wrong. It turned out that the VI was working fine.

I could have stopped here, because it was not a VI bug, but if things are itching, you sometimes have to scratch, so I went deeper. The next step would be to check if the rendering is wrong. I started the RDP simulation in the VHDL simulator. It takes the RDRAM contents from the emulator and a list of RDP commands that the emulator processed over one frame and executes the same, then compares all writes that go from the RDP to the RDRAM. This basically simulates the rendering of 1 frame without the needs of simulating the CPU, RSP or other parts of the system.

Again, the simulation did match 100% and so I was stuck again. What could be the reason? It means that the RDP commands or RDRAM contents must be already wrong in the core. Looking at the image, it seems the textures and colors are fine, so it's more likely that the commands are wrong. Also thinking about the image again from this new perspective, it seems to be an issue with the scaling of polygons.

The 3D space to 2D screen space conversion is typically done in the RSP, so I looked there. This time I ran the full simulation of the whole core for 2 frames, capturing every single result the core produces:

- every CPU instruction with program counter, opcode and result

- every RSP instruction with program counter, opcode and result

- every pixel rendered

- every pixel output via VI

- every audio sample

This simulation is incredible slow. It takes at least 5 minutes to simulate 1 output frame. As many games run with 30 or even less frames per second, you often have to simulate for 2 or 3 output frames to even get one render frame. If you want the whole path from CPU/RSP through rendering and output via VI, this can be easily 30 minutes of simulation time. But then you have everything captured and only need to search the needle in the hay: about 3 million cpu instructions and 1 million RSP instructions captured. Now that will be fun?

I was lucky that the RSP instruction output was very well comparable to the emulators RSP output. Of course the timing is completly different and there have been thousands of differences, but at some point the differences have been so huge that the compare tool couldn't find anything anymore that matches.

This point was a vector instruction in the RSP where the emulator calcuated some number around 60, while the core had some number around 90. I searched where this number came from and again I was lucky. The whole situation I simulate here is from a savestate taken from the emulator at this particular point in the intro of this game. This savestate has the whole RDRAM of the system naturally stored in it. 

As the savestate was without movement, all calculations for polygon positions should be static, as nothing is moving, right? Well, if you load the savestate in the emulator, they are indeed static, but on the core it would transform the polygons to the wrong look, so the positions must change.

That is exactly what happened here: the value that is used for the calculation in the RSP was coming from RAM and was static in the emulator, but was changed in the core. Now the task was to look why it would change, because it should not. I found that a floating point operation in the CPU was giving the wrong results, which would then be stored in RAM.

Now we have it! Finding the root cause is 90% of the work, we just need to find out why this happens. As a floating point operation is rather simple on a high level, this wasn't so difficult. It was a multiply of a*b=c but b was different in the emulator compared to the core. So even easier: the incoming value was wrong and not the calculation itself.

But wait! The first difference was the floating point result? How can b be different then? Let's look at the instruction flow in detail:

The core was using the wrong value for floating point Register 16, even though it was updated with the correct value just before? How can this be possible?

This is relativly simple explained with the 5 stage pipeline of the CPU: several instructions are in flight in parallel with one being executed at a time, but others preparing required data already and others storing results of previous calculations. In this case here, the pipeline has to wait(stall) for the loaded value to be available before it can do the execution.

But if that was broken, how could this pass all the FPU tests and run most games fine? The reason is a legacy mode of the N64 CPU. While the N64 FPU has 32 registers with 64 bit width, the previous generation of this CPU only had 16 FPU registers. To emulate that behavior, the CPU has a special mode that will use the 32bit of two registers to rebuild the behavior of the register set being only 16 registers large.

This means, that in this mode, to change the upper 32bit of register 16, the CPU will instead write the lower 32bit of register 17, but then use register 16 in the calculation. The problem in the CPU instruction decoding in the core was that the CPU would not stall the multiply, because the multiply needs register 16 and register 17 was written right before due to this weird remapping.

With the condition modifed, the logo looked right. But as you can already guess, this was only the tip of the icebergs. Another game I suspected could be influenced was Glover. When the camera was moving in cutscenes there, it was moving wrong and I always assumed this was a FPU bug. It's fixed now. After a testbuild was available, plenty of games where found in short time that improved, with some even crashing because of this bug.


The last fix I want to show is relativly simple compared to the previous ones:

Games storing the rendered image in 32bit framebuffer mode had issues:

- the color of some assets being wrong, mostly for transparencies

- seems in the polygons

The first issue was found fast: when reading back a pixel from framebuffer in 32bit color mode so it can be blended with the new pixel to be drawn, the colors where mixed up.

The second issue was related to the way the polygons are stored. In 16 bit render mode, each pixel is 2 byte wide, so to calcuate the framebuffer address, you would multiply the x position by 2 and add it to the line position.

However, in 32bit mode, you need to multiply the x position by 4 to get the byte address. That was all implemented and working. The issue turned out to be the Z-Buffer. This is storing the depth information for each pixel and is always 2 byte wide. 

To save ressources, I used the same indexing I had for the framebuffer als for the z-buffer, but forgot that can have different offset, so the rendering used the wrong z-Buffer value resulting in some pixels not being drawn. After that being fixed, games rendering in 32Bit seem to be fine.


That's it for the detailed descriptions this time. There are at least 10 more fixes over the last week, but I cannot showcase them all here and spend the time rather on improving the core.

With the attached core, you will get all of them and games should run better than before.

Thank you all and have fun!

Files

Comments

Anonymous

Robert thank you for all of your hard work. My n64 games are getting quite the workout in HD finally!