FPGAzumSpass

How fast is the RSP? (Patreon)

Published:

2023-08-02 08:03:28

Imported:

2023-12

Content

Hi Everyone,

todays progress update will go into some technical details of the RSP and I think they can be presented best by comparing the RSP against other systems I worked on in the past.

If you still have the tasks from the last post in mind, then I can tell you that 2 of them have been completed:

- implement Scalar Unit access to the RSP/RDP registers

- implement Vector Unit (decoding, memory access, processing)

That leads to the great result of the RSP passing all of the n64-systemtests for the RSP:

Does that mean the RSP is complete?

Unfortunatly not yet. While it's calculations should be fine as the test proves, it's actually the instruction decoding step itself that is still missing 2 key features. They will probably not cost a lot of ressources, but are kind of tricky to build so that they could leads to random issues. That's why I decided to postpone the implementation of them until some games work, so that regressions can be observed more easily than with the strict execution scheme of the n64-systemtest.

Nontheless I want to tell you about them now and today, as this gives me the chance to introduce you to the world of RSP and it's calculation power.

We will talk in MIPS today. While MIPS is the processor architecture of the N64 CPU and (because of that) also the name of the yellow rabbit in Mario 64, it is also an abbreviation of "million instructions per second". So if a processor can do 1 MIPS, that means it can do for example 1 million adds of two numbers per second.

These numbers can tell us about the raw calculation power of a processor, even thought there are also other factors like the amount of bits (8,16,32,64) for each of these operations, other components like the memory speed or special instructions like division that take multiple cycles.

Let's keep it simple and just assume the maximum values possible, without looking at cache sizes and memory restrictions. I know that others try to value those things in when calculting the MIPS number, but as workloads of programs and games and be very different, I will not dive into that here.

Instead we compare some systems main CPU with the bare numbers for now:

These 3 CPUs have one simple thing in common: they use a CPU that is capable of executing 1 instruction per cycle, so the reachable MIPS result just from the clock speed of the CPU.

For the GBA, that's it already. There is no other component in the system that can do any calculations that could be read back in a meaningful way, so this calculation power is the top limit. Always keep that in mind if you ever play a 3D title on the GBA. What the developers reached there with so few processing power is amazing.

Both PSX and N64 instead have additional chips that help them calculate things that would require more processing power than the CPU has. Or in case of the N64 only one: the RSP.

The PSX has 3 of them:

- MDEC is a fixed logic processor for image and movie decompression

- GTE is a math processor with fixed instructions for all kind of 2D and 3D calculations (movement, lights and color, mapping 3D to 2D space, ...)

- SPU is a fixed logic sound processor for 24 channels and various extra features

The RSP in the N64 is only one processor, but it consists of 2 subsystems:

- Scalar Unit that works like a standard 32 Bit MIPS processor (just like the PSX CPU at higher clock speed)

- Vector Unit that can do 8 * 16 Bit calculations in parallel

Other than the fixed function chips of the PSX, the RSP can do whatever calculation the developer wants. This is often referred to as microcode: a special code for 3D calculations is loaded into the RSP and for the next run you could make it act as if it was behaving comparable to a GTE. A different microcode and it can do movie decompression or audio.

This makes the RSP very flexible, but you can already see that it must cover up what 3 chips did in the PSX, so let's see how fast the RSP is and compare it against the single chips of the PSX.

- the RSP Scalar unit is easy in this regard. It's just the MIPS CPU running at 62.5Mhz, so it will yield 62.5 MIPS. Because the RSP memory is so tighly coupled, this is even the case when memory operations are involved, with only some minor exceptions

- the RSP Vector units power is also very easy to calculate: 8 operations in parallel at 62.5 Mhz means a total of 500 MIPS, an insane large number for this generation of consoles. However, there are some drawbacks: most operations are only 16 Bit and the Vector Unit has no operand forwarding. 16 Bit means that some calculations that need large numbers or high precision need multiple instructions for 1 calculation. What the operand forward means we will discuss later in detail but for now: it makes it more difficult to write fast code or if you don't take care, your code might only run at 25% speed

- Both units can run in parallel, as the RSP can decode and process 2 instructions per cycle, leading to a total 562.5 MIPS

On the PSX side:

- The GTE can do 3 operations in parallel at 33.3Mhz, so that yields 100 MIPS. Due to the fixed function pipeline you could not utilize all calculations units all the time, so the real performance is lower, but we go with that number for now

- The MDEC can decode image blocks at 33.3Mhz with 2 multiplications and 2 additions per cycle and also prepare some new block in parallel as well as color convert the decoded output in parallel. So it kind of reaches a performance of 200 MIPS, but with very limited use cases

- It's very hard to get any numbers for the SPU, but let us try: it can calculate 44100 Samples per second for 24 channels. For each channel there are at least 20 calculations involved, making this a total of 21 MIPS. Reverb and Channel summing is also required, so we assume at least 30 MIPS here to cover the functionality.

Of course this is all very inaccurate and should only give a rough estimate, but it does show 2 things:

- The RSP is very powerful as a single chip and if completely used to it's full potential can outperform all the special chips in the PSX

- The gap isn't very large, which means that developers also must make use of the RSP in a good way, otherwise it will even fall back. That's why some developers like Rare developed their own microcode

This whole concept of fixed function and programmable function special chips also has significant influence on emulation. Fixed function chips are much faster to emulate as shortcuts can be made when functions are known.

One example: the MDEC block decoding typically takes 1024 multiplications. Martin Korth (also known as NO$, author of various emulators and amazing documentation for e.g. GBA and PSX) published a way to do the same in only 80 multiplications with the restriction that it only works on all known PSX games.

On the other hand, having a general purpose chip like the RSP, while requiring immense processing power to emulate in software when done low-level, is much easier to understand and research because it can be programmed freely and tests can be written for it.

In case of the FPGA, the hard/slow to emulate argument doesn't even exist, because the space requirement in the FPGA is much easier to fulfill:

Even with the RSP not being fully complete yet, there is no way it would ever require the total ressources of the 3 chips from the PSX.

Just in case you are wondering why the MDEC is so small: A lot of the functionality is done by DSP blocks(multiplications and additions) and memory blocks being read with a special access pattern and both do not contribute to the logic size in a FPGA, but would in the original chip.

So with all that learned, what are the 2 things the RSP implementation in the core is still missing?

The first is parallel execution of Scalar and Vector unit. The RSP could execute both in parallel, but there are several restrictions on when that works and when it doesn't. For example if you exchange data between both units, this occupies both calculation slots.

As there are about 30 different operations and there are plenty of combinations from things running after each other, there can be random race conditions when this feature is added. Furthermore it's not really researched which constrains really exist, so it would also require to write some testroms to find out. Overall this seems like a good task to do when some games run. The performance impact without it is likely not that large anyway.

The second missing feature has to do with the mentioned operand forwarding, which does not exist in the RSP. Now you probably ask yourself how something could be missing that doesn't even exist, but let me try to explain what operand forward even means.

The main MIPS CPU, as well as the RSP and both of it's units do calculations in multiple steps. here is how it looks for the vector unit:

- Stage 1: instruction fetch from memory/cache

- Stage 2: decode instruction and fetch operands from register bank (values for the calculation)

- Stage 3: calculate

- Stage 4: clamping and rounding of the result

- Stage 5: store values to register bank

One instruction will go through all stages, so there are 5 instructions "in flight" all the time. With every clock cycle, the instructions advances one Stage ahead.

So if we consider some ADD operation "C=A+B", then A and B would be fetched in Stage 2, C would be calculated in Stage 3, rounded in Stage 4 and stored back in stage 5.

But what if the next instruction after "C=A+B" would be "D=C+A"?

The result of C needs 3 cycles until it is written back to the register bank and can be fetched in Stage 2 and the whole processor has to wait for the result of C to be finished, before the "D=C+A" operation can leave the decode stage. Waiting means doing nothing, so it will spend up to 3 cycles with doing nothing in the stages 2-4.

If you would write code where each instruction requires the result of the previous one, the RSP vector unit would only run at 25% speed, which would be horrible.

The main N64 CPU and also the RSP Scalar unit however have operand forward. That means that results from stage 3 and 4 can be feed back immidiatly and used in the next cycle. So the processor never has to wait for the last result to be stored back before using it.

This function is crucial for a general purpose processor, as it can never know which code will be executed. (As a sidenote: this is also one of the reasons why the AO486 CPU is slower than it should be when running at 90Mhz: it's missing operand forward)

For the vector unit however, having operand forward would have a huge impact in size, because it would not only be required once, but instead for every vector unit, making it 8 times larger in case of the RSP. That's why the RSPs Vector unit doesn't have operand forward and instead has to wait when the result required is not yet fully passed through the pipeline.

This is a much smaller issue for the RSP than you would think, as the microcode can be handcrafted and optimized to take care of that and only let that happen in cases where it's really required.

I ran a test in Mario Kart 64 with my software emulator and could find 111.170 cycles of wait in 1.000.000 cycles of RSP activity, so about 11% slowdown. Banjo Kazooie only had 60.419 wait cycles in that test. Given the raw power of the RSP, 6-11% slowdown isn't really much of a deal.

In any case, this kind of wait is still missing in the core. The reason is that waiting on a calculation to complete needs tracking of which instruction will write which result and also needs to know which operand is really needed to see if a wait cycle is required.

This can be complicated with all edge cases, as there could be holes in the logic, making the RSP calculate with old results, resulting in all kind of bugs, even ones that cannot be observed directly.

That's why I decided to shut down all these holes for now by always waiting 3 cycles after each vector instruction, effectivly bringing the RSPs Vector unit down to 125 MIPS. In know this needs to be done, but there is a good time to do it: once some games work ok and only have slowdown due to the waiting, I can safely implement it and take out the risk to debug such a complicated case together with other bugs, making me blind about the real issue.

That's it for today.

I will start with working on the RDP soon. It will be very basic in the first weeks, so don't expect any ingame shots soon, but rather homebrew demos and testroms for e.g. texturing or color mixing.

Once I have it at a good level, we can start with trying games, but there will be a lot of initial bugfixing before the first one can really start. I'm still excited that it might not be that far in the future.

Have fun!

Content

Files