Home Artists Posts Import Register

Content

Hi Everyone,

As the last bug hunting article was so well received, I thought I do another one on a recent bugfix. This time I will use the chance to explain a little bit more about the PSX internals, as the path to fixing the bug will be more technical.

If you think it's too much, you can skip to the result after the last image.


I'm currently going through the Github bug reports to see if I can reproduce them and for some there might be a quick and easy fix?

So I read about a bug in "Rascal" and tried it myself: "Text in options menu is glitching up the screen and not displayed in full"

It's not so obvious by this static image, but there should be text at the lower screen section and some of this text is moving up. You can see the strange black box on the right side between the gamepad and key symbols that should not be there.

Although the moving sprite positions made me a little suspicious, I still thought it sounded like a simple rendering problem and I wanted to try if the problem is located somewhere in the GPU.

The main reason why I thought that is, that in my software emulator this issue doesn't occur and it looks correct:

This is a very good situation for debugging a problem, because I know that the list of "draw calls" for drawing primitives like sprites or polygons is correct, as the emulator sends them to the GPU.

I made an export in the emulator and captured these draw calls into a file. This is mostly a collection of write operations(right) with 32bit data words and a timestamp(left).

What you see here is only a small part of the full list. I picked two draw calls with different length. 

For example, drawcall 1 starting with a value of 0x64808080 is of type "64", which is a rectangle. The data words then contain color, position, size and texture information.

The whole drawlist from the file, together with the Video RAM content containing the textures, fully defines how the Video RAM should look after all those drawcalls are executed. That is, because only the GPU can write to the Video RAM. This makes for a perfect situation to validate. 

So in a GPU simulation the PSX core that will do exactly that: Load the video ram contents and execute the draw calls that are saved in the file. While doing that, the simulation exports all data to be written to the Video RAM into a file again, which can then be checked against the emulator and if there is any difference, one of them must be wrong.

Unfortunatly, in this case the outputs matched. That's unfortunate, because it means that the drawing in the GPU is correct and the drawlist itself must be generated wrong in the PSX Core. So instead of a well contained bug in the GPU the issue can be everywhere.

Let's go 1 step back to see how these drawlists are transferred into the GPU.

Drawlists are usually prepared by the CPU and stored in RAM until the point in time where they are executed. For example in a 60 fps game the drawlist could be send to the GPU at the end of each frame. To get the data from RAM to the GPU, there are 2 possibilities: 

The CPU can read the data from RAM and write it to the GPU. This is rather slow, as everything would be single RAM accesses that suffer from a latency penalty.

Also the DMA can copy the data from RAM to the GPU in blocks. This is up to about 10 times faster, because the access latency only happens on the first data word. Furthermore the DMA can read from RAM and write to GPU in the same clock cycle.

The GPU however has only a small input buffer to store the data, so the DMA cannot push the whole draw list at once, it must wait for the GPU to be ready again after each block.

To support that, there is a special mode for the DMA to handle these draw lists: the Linked-List mode.

Data is stored in RAM in a format that contains the pure draw data plus 1 additional header word for every object containing the size of this block and the position in RAM of the next block.

So the DMA starts with fetching the first object and reads the size of this transfer and the address in RAM of the next one. Then it copies the first block of data to the GPU and will continue at the address mentioned in the header for the next block.

This goes until there is a block with special stop information in the header, marking it as last block. The DMA turns off after that and might inform the CPU with an interrupt that it finished.

This sounds relativly simple and you would think that such a static list is also easily debugged, because if you know the first block, you will know all blocks as they are linked and you can just collect all the data in RAM and compare it.

Unfortunatly it's not that simple in reality. The reason is that the DMA cannot write this list at once, because the GPU will be busy working on each command. The DMA would have to wait until the GPU is ready again to start transferring the next block.

But DMA and CPU both work on the same RAM, so if the DMA would just wait active, then the CPU would also have to wait and games would get severe slowdowns, as there would be no time to prepare the next linked list for the next frame. Instead, the DMA goes to sleep between 2 GPU Blocks and the CPU takes over and works while the DMA is waiting for the GPU.

Maybe you can already guess what is happening next?

Clever game designers know roughly about the speed of the GPU and DMA and want to save memory. So instead of having several linked lists, they modify the linked list while the DMA is still working on it, assuming the DMA is already further ahead in the list.

But if it is not and and the DMA and CPU work around the same blocks, you can get all kinds of effects, ranging from wrong positions or colors of primitives, permanently wrong data in VRAM for textures or GPU crashes.

As the GPU in the PSX Core is benchmarked to be faster in nearly all cases, this situation should not occur. So what happens instead?

A linked list can also look like this: it contains zero payload.

In this case each block is just skipped and the next block is executed until a block is found with actual draw data inside.

In the case of Rascal, the linked list starts with about 250 empty blocks. This sounds strange, as it just wastes memory and processing time, but is not uncommon. Several games do this, probably to make the programming easier with a fixed list where they can just shift primitives in and out.

The crucial part is now what happens in the DMA after reading such an empty block: it goes to sleep just as if there was data. It will pause for a while before trying to deliver the next block, even though the GPU isn't working at all. So after each of these 250 empty blocks the CPU will take over and will indeed work on the "new" linked list already.

What determines now if the CPU will be too fast and interfere with the DMA processing?  The amount of time the DMA will pause until it will take over again. But this time is not known. 

There are some testsroms from Jakub Czekański (JaCzekanski), the developer of the Avocado PS1 emulator, that try to measure it:

The tests tell that with some empty linked list DMA ongoing, the work in the CPU takes about 60% longer, but the problem is that it depends on what type of work that is in the CPU, so it's only true for this particular test. 

We need to jump to some CPU details at this point.

Internal instructions(e.g. add or jump) in the CPU take only 1 cycle in the PlayStations MIPS CPU.

If we assume for now that the DMA pause would be 4 cycles long, the CPU could execute 4 of such instructions before the DMA takes over again, seen in the first execution chain in the image:

In the lower 3 lines, you can see some cases where data is read from RAM, which takes 7 cycles in this example.

Because the CPU cannot pause while a memory transfer is still ongoing and the DMA could not access the RAM anyway, the CPU time slice is longer.

At this point we should come back to Rascal and what this game is doing. We already know that it is preparing a linked list and to do that it will likely need many memory transfers. So it would not be surprising if the CPU gets more time, because it waits for the RAM access to finish, right?

When observing this behavior in the simulation, it turns out that there is something wrong: the game reads and writes several times in one DMA pause. This should not be possible, as the DMA should take over after the first memory transfer, as the CPU time slice has ended.

So here we found the bug now and it's in the CPU: the logic that lets the DMA take over from the CPU will only do that when there is no ongoing memory transfer and no new memory transfer:

What I forgot to think about when designing the CPU is, that after a memory transfer instruction completes, the next of these instructions can start immidiatly, because it's already waiting in the pipeline of the CPU.

So the CPU only stalls due to the data read, but will continue the very moment the requested data is received and with that new request the DMA cannot take over and this can repeat again and again, letting the DMA starve.

The fix is relativly easy: make the CPU stall when there is a DMA request. 

With that fix, Rascal gets significantly less CPU time while the DMA is pausing and the linked list will stay correct.

It also turns out, not only Rascal is fixed by this change, but also some other games, as the restless testers on the Discord Channel found out fast: Vigilante 8, DoDonPachi and Final Fantasy Tactics.

Overall, it was quite a journey to go through several components in detail to track this issue down, so I thought it would fit well for a report here.

I know that I pushed it hard with the length and technical details this time, but hopefully it was still worth reading and I didn't lost you halfway through.


The fix is already available in the unstable builds already:

https://github.com/MiSTer-unstable-nightlies/PSX_MiSTer/releases/tag/unstable-builds

If there is no big regression found, this will be the base for the new release in some days.

Have fun!

Comments

Anonymous

This is a really great description and a mystery resolved. Nice to follow for software people, even though it deals with hardware.

Anonymous

Excellent and fascinating read. Thamk you!