FPGAzumSpass

Loading data in background (Patreon)

Published:

2022-09-28 07:09:09

Imported:

2023-12

Content

Hi everyone,

today we will look at the (maybe?) final part of the CPU articles with a feature i feared very much to implement because of it's complexity and side effects.

Let's start with the status before it was implemented and look in detail what happens there:

This is the output of a test i have written for checking the timing of loading data from RAM in the CPU. It contains several subtests and we will look at some in detail.

You would assume that loading data from RAM costs some time for the RAM access itself and maybe some overhead in the CPU and that's it. Unfortunatly it is way more complicated.

Let us look at the most simple subtest first: 10NOP

In this subtest nothing is executed, the CPU is just running 10 cycles without doing anything. It is important to have a baseline for the timing: starting a timer to run, then do 10 times nothing and read out the timer again. This costs 16 cycles.

With that in mind we can do the first real test: LWUSE1

In this test 2 of the 10 NOPs are exchanged. One for loading data from ram and one for actually using the loaded value. The name of the Test comes from the ordering:

LOAD - NOP - USE

This is the typical way to load data and use it and we can see that it costs 22 cycles in total. Because the USE costs the same as a NOP, that are 6 additional cycles for a RAM read. That is all i initially wanted to know from the test.

At this state of the core, all LWUSE testcases have that timing, because a load from RAM just costs 6 additional cycles. However, the hardware shows that there is a difference in how the data is used after the loading, but why?

Let's look again at the CPU pipeline overview from the last article: we have 5 CPU stages that work in parallel. Stage 4 is doing the Load from RAM.

Whenever a stage cannot complete it's work in 1 cycle, the pipeline is typically stalled. That means, that all stages pause they work until the stalling stage has completed it's multicycle work.

This is easiest understood with the Instruction Fetch stage: if the next instruction word is not available and must be fetched from RAM, all other stages must pause, because they cannot be delivered with new instructions to execute.

This also works the same way in the other direction: if the Execute stage is stalled because it waits for the result of a multipication, the Decode stage cannot deliver it's part to the Execute stage, because there is no space yet to store and handle it.

This is the typical behavior of a In-Order pipeline with 1 unit in every stage.

The same should happen in theory with the Load/Store stage. If data is loaded from RAM, the pipeline should be stalled until the data is received and execution can continue, but the implementation in the MIPS CPU is more advanced as we will see.

Lets look at the first subtest: LW NO USE.

The name comes from executing a load instruction that loads data from RAM into a CPU register, but the register is never(or very long) not used at all.

We can see in the PS1 column, that it only takes 2 additional cycles to execute this instruction. How is that even possible? The RAM is not that fast.

The solution is both simple and very complicated: if data isn't used afterwards, you don't have to stall the pipeline. You wouldn't even have to execute the read at all.

As we don't know the program flow before executing it, we cannot know if the loaded data will ever be used, so we cannot skip the reading completly.

But when is it needed? Well, if any instruction that follows the read uses the register where the loaded value is stored in.

If we look at the pipeline again, those instructions are maybe already on flight. Let's take the first example for that: LOAD - NOP - USE:

The LOAD is already in the Load/Store stage, while the USE instruction is still in the Instruction Decode.

The USE instruction could be ADD or SUB or MUL or anything else that requires the register that was loaded. Or better: that is currently loaded, because both are executed in parallel.

As the register value is read from the register file in the Instruction Decode stage, the load must be completed in this example before the pipeline could even do 1 additional step ahead.

Now let's look at the LW NO USE case:

You can see that there is no usage of the loaded value in the next intructions, so the pipeline has no reason to stall.

Why does the LOAD still cost 2 additional cycles in this case? That is pure speculation now, as the details are not documented, so i explain it from the view of my own implementation, which also needs these two cycles:

- The first cycle is lost directly after the LOAD is started. The reason is that an instruction could be already sitting in the decode stage, but it doesn't know yet of the interlock from the currently read register. So we need to stall here, just in case the value that is currently fetched might be needed in the same cycle already. This happens in case of "LOAD - NOP - USE" from above.

- The second cycle is lost when the value read from RAM is received. As the CPU is not stalled, every cycle could do some calculation with a result that has to be stored in the register file. As only one value can be stored in the register file every cycle, the CPU cannot complete an instruction like "ADD" and store the received value from RAM in the register file in the same cycle. Therefore it must stall for 1 cycle, leaving a slot open to save the value back.

The result is, that we need a baseline of 3 cycles for every LOAD. 1 Cycle for the instruction itself, and 2 cycles for the stalls. That is even if the value is never used.

In worst case, we have to keep the pipeline stalled over the full read, which results in 7 cycles for the LOAD: 1 cycle for the instruction and 6 cycles for the RAM access.

Between these two value, everything just depends on when the value is needed:

If the value is needed after 1 or 2 instructions(LWUSE1 and LWUSE2), it costs the full time of 7 cycles. With every additional instruction before usage, the cost reduces by 1 cycle until it hits the minimum cost with 6 instructions between LOAD and USE -> LWUSE6.

We finally understood what happens and can come up with a solution.

There is one exceptions that you might have seen and i want to explain it: LWUSE0

In this case, the read value would be used in the very next instruction already.

You can see that it costs the same as if the value is not used at all. That is due to a special handling in the MIPS CPU: because the USE instruction is already in the execution stage, it cannot receive the value from the LOAD anymore(unless you do additional handling), so the MIPS CPU forbids that special case.

It's called "Load Delay Slot" and means that the loaded value cannot be used in the instruction that follows right after, there must be some instruction between both.

If you try to do that anyway, the Execute stage will instead use the value that was in the register before the load was done.

With that also being clear now, we can implement. I will not tell you all the boring details, but let the amount of work speak for itself:

It took me 1 week, with 5 commits and a total of about 600 lines of VHDL code of the CPU either touched, removed or added. That's about 20% of the CPU that received a rewrite or overhaul.

The result:

All cases mentioned before are now working the same as original hardware!

What is that? There is one subtest that is not passed. What happens there?

Until yesterday i had no idea what the reason could be, but while thinking about this article i had some idea and it turns out to be true:

There are some special handlings when doing subsequent LOADs and the stall timing seems to be related on which registers are accessed. If 2 LOADs are executed back-to-back that load into the same register, it will cost 1 cycle more time than if those LOADs fill two different registers.

That's why the subtest LW3X1, which reads 3 times in a row to different registers, costs 2 cycles less.

I already identified a possibility to save that additional cycle, but it's not yet implemented and tested if it would work.

What's the conclusion of this whole change?

Doing the simple way of always stalling on a LOAD will make it always cost 7 cycles, while this pipeline, out-of-order, parallel-read approach, whatever you want to call it, can bring that down to 3 cycles, so making it more than double speed in best case.

This should speed up games a lot, right?

Unfortunatly not. Todays compilers would be very efficient in shifting LOADs as early as possible and make the distance to the USE as large as possible, but compilers in the 90ths have not been that advanced.

So unless you have some Assembler-optimized game or routine, the difference is rather small: about 2% on average is what I measured with extreme cases being 5%.

However: being 2% off in total timing compared to the original hardware isn't something that should be ignored and i'm happy that i was able to implement this advanced CPU feature.

Have fun!