FPGAzumSpass

Spyro: Year of the Cache (Patreon)

Published:

2022-09-10 08:21:16

Imported:

2023-12

Content

Hi Everyone,

as promised in the last update, we will take a deeper look at the fix for the PAL versions of Spyro 2 and 3.

This one will go into details of the MIPS CPU, so we will start at the top and work our way into the details.

First I have to say that the main credit for this fix goes to stenzek, the author of DuckStation. Stenzek made an incredible useful wiki page called "Difficult to Emulate Games". This page not only lists such games, but also gives a hint to the root cause of a bunch of problems. This has proven helpful many times already for me and the PSX core.

We will also use it this time:

We can see that 2 root causes for issues are listed for the Spyro games:

The first is Libcrypt, which is a copy protection method that uses special sectors on the disc to identify if an original disc has been used. In the core, we know about these sectors using a .sbi file and can handle the CD response accordingly. Markun made the implementation of .sbi loading for the PSX core and it's working fine, so we should be ok with this first task. Thanks a lot again!

So let's look at the second task: It talks about icache, which is short for instruction cache. The instruction cache is a 4 KByte large memory in the MIPS CPU which is used to store instructions. It's main purpose is to hold the instruction fetch speed at 1 instruction per cpu clock cycle, while the external memory(BIOS or RAM) could not keep up with that speed.

Each instruction in the MIPS CPU always has a length of 32 bit and the instruction cache is subdivided into 256 cache lines with 4 instruction words of 32 bit each.

The instruction in the PSX MIPS CPU is direct mapped. That means that for each address in the RAM, it is known exactly in which cache entry the same information will be stored. E.g:

- RAM address 0 is stored in cache line 0 (Word 0 of cache line)

- RAM address 4 is stored in cache line 0 (Word 1 of cache line)

- RAM address 16 is stored in cache line 1 (Word 0 of cache line)

- RAM address 4096 is stored in cache line 0 (Word 0 of cache line)

You can see that multiple addresses in the RAM will map to the same address in the cache. Because of that, additional information must be stored, so the CPU knows to which exact address in RAM each cache line points to.

This additional information is called Tag and holds the upper address bits of the address mapping to the RAM.

That is also the reason to group the cache into smaller lines: by having several instructions in one cache line it is not required to store the mapping information for every 32 bit value in the cache, but only for every cache line.

Whenever the CPU needs to execute cached instructions from an address that is not cached yet, it needs to fetch the data from RAM first, but will then store it in the instruction cache so it can be used faster next time.

If the cache line to be used is already occupied by previous data from a different address in RAM, it would overwrite the data part and exchange the Tag to the new address.

But that is not yet enough: after a CPU reset, all Tags would have either a defined or random state, but whatever is stored, the CPU could assume that the corresponding address and data is already fetched and valid.

To prevent that, each cache line also needs a valid bit attached:

At startup or when clearing the cache, the valid bits are reset and once data has been fetched, the bit for the Tag can be set.

With a design containing these features, we can run pretty much every game of the PS1 library. But not Spyro 3 PAL. Why?

Let's look again at the hint: "you need to handle partial line fetches from non-16-byte-aligned addresses"

We know what a cache line is: a part of the cache with 4*32 bit = 16 byte size.

But what means partial?

The CPU executes instructions sequentially, increasing a instruction pointer (or program counter) after each instruction. A typical execution flow using a cache line could look like this:

In this case all 4 stored instructions from the cache line are executed in order, starting with word 0.

However, the CPU can also jump to different addresses in the memory and execute code from there.

There is no limitation in the jump destination, so it could for example jump into the middle of a cache line and execute from there:

This is valid and as long as the cache line is already fetched, it doesn't matter at all. But if the cache line is not fetched yet, it must be fetched now.

For the first example of executing instructions from the full cache line, this is obvious and easy: the CPU first fetches word 0, then word 1, word 2 and word 3.

If the CPU starts to execute from the middle of the cache line, there are two possibilities. Either the CPU also fetches the full line, leaving the CPU with nothing to do until word 2 arrives, or it can start with fetching word 2.

The MIPS engineers decided the second would be the better way to do it, as it yields a higher performance. The difference when using RAM with burst reads isn't very substantial, but using slower storage, like the BIOS, it can be useful.

That leaves us with a problem now: we start fetching from word 2 and then fetch word 3 and then the CPU would then continue to the next cache line. As a result, word 0 and word 1 are permanently unfetched.

Whenever the CPU would execute instructions from those cache memory cells later on, it would fetch invalid instructions. But the Tag only points to the whole line, what can we do?

The solution is to have 1 valid bit per 32 bit word in the cache line, so 4 in total.

This way, the CPU can check if not only the Tag address in the cache is correct, but also if the individual instruction it wants to execute has been fetched before.

Why would this even be a problem in the first place?

If the CPU in the MiSTer PSX core would just always fetch a whole cache line, it should also work, it's just more information, which doesn't hurt, right?

Well, the problem can be found in the timing of the fetch:

If the data in RAM is modified after the initial cache line fetch, the old 1-valid cache line would already hold data in the first cells that is assumed to be valid by the CPU. When these first cells are executed the next time, they are not fetched again from RAM, but instead the old instruction is executed.

With the 4 valid bits for each cache line, the cell content is assumed to be still invalid and it's loaded from RAM before it's executed.

That's the whole magic of this special function. With these 4 valid bits per cache line implemented, Spyro 2+3 PAL can be started.

Have fun playing the game!

Additional notes for curious readers:

Some of you might remember that this bug once had a "Won't fix" note in the GitHub Bug entry. I will try to explain why:

Due to the SDRAM in MiSTer with 16 data lines being different than the EDO DRAM with 32 data lines used by the PS1, I implemented a SDRAM handling in MiSTer that runs at 3 times the CPU clock with burst reads.

Without going to much into detail: in MiSTer we always fetch 4*32 bit at once, which is exactly the size of a cache line. This is no coincidence, the whole design has been made to be sure both random access speed as well as sequential transfer speed (e.g. from DMA or cache read) can be fulfilled, no matter what.

Until recently, I made the wrong assumption that I would have to break that up to support 2*32bit or 3*32bit reads for the instruction cache. That's because I interpreted the comment from stenzek wrong.

It says "you need to handle partial line fetches", but what it really means is "you need to handle cache line valid bits for partial line fetches"

It's a small difference, but it has thrown me off for quite a while, but I'm happy I finally understood it:

We can still fetch the whole data for a cache line and just don't set the valid bits for the words that should not be fetched usually. The data written to the cache line will not matter at all with the valid bit not set, so the game can not have a problem with it.

So the solution was much simpler than I initially thought and I'm glad I took a second look at it.