Home Artists Posts Import Register

Content

Hi Everyone,

I have been working on the last obscure parts of the MIPS CPU recently and I might be finished with it now.

The recent articles already covered two parts of the CPU with the Write Buffer and Instruction Cache and there will be at least 2 more: Today we are looking at the data cache and soon the out of order data reading will be topic.

We will start this time with an overview of the whole CPU. I didn't do this in the last articles, because the modules where well seperated, but as we learn about more parts of the CPU, it's useful to have the full picture:

This is the start for today. It does not cover all modules of the CPU, instead I will add them to the image as I introduce them.

The MIPS CPU has a 5 stage pipeline. That means that every instruction that the CPU executes will need 5 CPU clock cycles to be completed and has a latency of 5 clock cycles until it's fully done.

Every stage does a part of the processing to split the work. The main purpose for this design is the clock speed.  The less work you do between two CPU clock events, the shorter the path in the silicon will be and the higher the reachable clock rate will be.

Stage 1 - Instruction Fetch: The instruction to be executed will be fetched in this stage from either Instruction Cache or directly from memory.

Stage 2 - Instruction Decode: the fetched instruction will be evaluated: which operation shall be executed? Fetch CPU registers that will take part in the calculation. Prepare everything so the execution can be done fast.

Stage 3 - Execution: the main calculation happens here. E.g. adding values and conditional jumps but also preparation for memory load and store by calculating the effective memory address.

Stage 4 - Load/Store: This stage is dedicated to memory access. It uses the precalculated address from stage 3 and can request a read or write operation to RAM or IO Registers(GPU, SPU, ...) or from BIOS. If the instruction has no memory access, it is just forwarded to the next stage.

Stage 5 - Writeback: the result values from either calulcation in stage 3 or memory load in stage 4 will be written back to the CPU register file here, so they can be fetched again in stage 2 for the next operation.

The pipeline typically is filled, that means it has1 instruction on fly for each stage, resulting in a execution speed of 1 instruction per cycle, even if each single instruction takes 5 cycles to complete.

When a stage handling takes more than one cycle, e.g. when reading a multiply result and the multiplication takes more than 1 cycle of time, the pipeline will stall. That means all stages remain in their current state until the requested operation has completed.

We already looked at the Write Buffer in a previous article: this Fifo memory will store the data to be written to memory, so that the CPU does not have to stall and wait for the memory write to be completed. Instead the CPU can continue with the next instructions and the memory write happens in the background in parallel.

The instruction cache was part of the Spyro article. It's main purpose is to not wait for the slow memory to deliver the next instruction word, but instead have a part of the memory duplicated in the CPU, so it can be accessed at 1 instruction per clock cycle.


Let's see what we are still missing. Here is what wikipedia has to say about the PSX CPU:

So we got the CPU running at about 33.8Mhz with one ALU and shifter, both being in stage 3. We also see the 4KB instruction cache and a 1KB data cache.

Wait, the PSX CPU has a data cache? Well, kind of.

The purpose of a data cache in a MIPS CPU is to make stage 4 execute in 1 clock cycle. The idea is to have a small copy of the RAM inside the CPU and when a recently fetched value from RAM is requested again, it can be delivered from data cache instead of reaching out to RAM.

But this is not how it works in the PSX, the psx-spx documentation instead tell us the real usage:

There is a 1 KByte Memory dedicated to stage 4, but it is not used as general cache. Instead it is set up as a special area of 1 KByte in size where every write and read are executed without any latency. You can image it as 1 KByte of fast RAM inside the CPU. 

Let's add it to the image:

What that also means is that the CPU has no real data cache for RAM values.

This is a major issue for games, because most calculations are done for values stored in RAM. Fetching a value from RAM takes 6 additional cycles, so a total of 7 cycles for a memory load instruction.

If you imagine a game would need to fetch a value from RAM every 7 instructions, this would roughly half the performance of the CPU.

I made some measurements of that influence a while ago with Gran Turismo 1. In the VHDL simulation the behavior of the CPU with stalls can be easily observed and profiled.

So I measured over a total of 4293448 clock cycles and the result is as follows:

- CPU stalled for 1917836 cycles(44,6%)

- CPU stalled for data fetch : 1125418(26,2%)

The other stall cycles are for e.g. instruction fetch, DMA and waiting for GTE calculation.

What this shows is, that with a perfect data cache, the game could run 26% faster. At this particular situation it's running at 24 frame per second. With 26% more, it could be 30 fps.

So a long while ago I tried the following and introduced a optional data cache:

The data cache was placed outside the CPU. On each memory request it looked up if it has a cached value and if so, skip the memory read and instead deliver the cached value 1 cycle later.

This way, a RAM read from the cache would be done in 2 instead of 7 cycles. A good improvement and many games could profit a lot from it. 

Sure, it breaks the accuracy to the original console, but sometimes a stuttering game is just worse than not having full accuracy, so having the possibility certainly is good.


Recently I was going to overhaul the CPU to add another functionality that is often overlooked. I will write a full article for it soon, but in a nutshell:

The CPU can start a read from RAM and as long as the result is not required by any of the subsequent instructions(and several other conditions), the CPU does not stall yet, but let the ram read in background.

So the CPU could execute several independent instructions until the ram read finishes, effectivly reducing the ram read penalty from 7 cycles down to a minimum of 3 cycles, because it still needs the time to store the received value from RAM in the CPU register file.

With the change being planned to be added to the CPU, I realized that the optional data cache is not well positioned in the system. The read would still stall the CPU and because of the delayed storing of read data, the minimum read penalty when reading a cached value would also be 3 cycles as opposed to 2 cycles before with a cache hit.

As I didn't want the optional data cache to be slower than before, I instead thought about a way to make it faster:

The original MIPS CPU would have a data cache at the same spot where we have the scratchpad in the PSX CPU, so why don't we place it there...additionally.

With the data cache in the CPU, it can act as if the scratchpad was accessed, but at the full RAM range. The game doesn't have to know about it, the RAM read will just be redirected if the value is cached and the stage 4 of the CPU does not have to stall, the same as if the scratchpad is used.

This more or less shows what the PSX would have been capable of with just some properly sized integrated data cache:

With the new data cache being on average 10% faster than the old data cache, the difference to the original design is massive. Depending the game you can sometimes see 50% or even more speedup when activating it.

Be aware that this is all just optional, so if you want the original experience in these games, that is totally fine.

Going through all these components and understanding how they work together is a really important step in the development and if such a feature can be added in this process, I will not complain. In the end, I like cool features :)

Have fun!

Comments

Mike Shenton

Another great, detailed read. Interestingly I always thought the longer the pipeline the higher the clock speed, as found in Intel's much maligned Netburst architecture. The whole idea being they could aim for 10Ghz.

FPGAzumSpass

Yes, that's what i wanted to say with "The main purpose for this design is the clock speed. The less work you do between two CPU clock events, the shorter the path in the silicon will be and the higher the reachable clock rate will be."

Anonymous

Can you explain the decoder for FMV next? Is it done in CPU or is there a decoder chip? I find it easy to compression artifacts.