Home Artists Posts Import Register

Content

Hi Everyone,

today we are going to investigate the details of several components of the PSX again, so we can understand a particular bug "Grand Theft Auto (PAL)" showed.

The bug report on GitHub tells the following:

"Engine noises, tire squeals are completely off. All you can hear is a strange ticking sound."

When trying the game, this can clearly be heard. A good amount of sound effects are either missing completly or cut off very fast. As sound effects are coming from the Sound Processing Unit (SPU), we are going to start the search there.

The SPU has 24 voices that play back ADPCM compressed data from the SPU RAM, so let's have a look at the RAM content first.

Luckily, we have the support of emulators here. DuckStation plays all sounds ingame, so the SPU RAM content must be good. Also it offers an option to export the SPU RAM into a file. We can use that to compare it to the content of the core, taken with a savestate.

The reverb and capture area can be ignored for the comparison, as they are wraparound buffers, which cannot be compared directly as they change the content all the time.

But even the ADPCM sample area shows large differences when comparing the core against DuckStation, so there must be a problem with data in the SPU RAM.

To see why this can happen, we take a look at how data is written to the SPU RAM. Just as with the GPU and VRAM, the SPU RAM is also not directly accessible by the CPU:

Instead, data is copied with either CPU or DMA to a small buffer in the SPU and from there on transferred to the SPU.

This buffer in the SPU is only 32 * 16 Bit in size, which means that copying large amounts of data must be done in many chunks.

The SPU itself will copy the data from this buffer to the SPU RAM also in small steps. To see how that works, we need to look at how the SPU works in general.

You can clearly see the goals of the hardware developers when you look at how the SPU works. The whole clock design of the PSX is made to match the sound sampling perfectly.

The SPU is made to work at 44100Hz, with each sound sample having 768 cycles to be calculated. 44100Hz x 768 = 33.868.800Hz, which is the clock frequency used by the CPU, so everything lines up perfectly. (This is very special, because it's not just given. Unlike the audio, the video out is completly asynchronous to the CPU or SPU, which leads to some problems, but that is a topic for another episode.)

The SPU must calculate all 24 voices, reverb, capture buffers and merging in CD audio in these 768 cycles = 22.67 microseconds.

The majority of these calculations are bound to the SPU RAM access speed in their calculation time.  As random access in the SPU has to be assumed, each RAM access has a fixed timing of 8 cycles. 

Most of the SPU time is spend by the voice calculcation, which is done sequentially for all 24 voices, with each voice requiring two memory accesses.

Where is the memory transfer in this scheme now? You can see that each voice has a time slot of 1 SPU RAM access, which is dedicated for data transfer.

The engineers decided that this is a better way of transfering data, instead of doing it as one big block somewhere in this 768 cycles process, maybe because it allows for faster reaction time.

As the SPU ram is 16 Bit wide, this now tells us that the transfer speed is 24 * 16 Bit per 768 cycles. Why is that so important? Because we got a hint!

Stenzek did a change for DuckStation to fix "broken sound effects" in "Grand Theft Auto London". 

Given that this is also a GTA game and has to do with sound, I don't believe it's a coincidence. So let's look in detail what the change does. It's fairly small, so we should be able to also use it, right?

The change tells us that the transfer time per 16 Bit is halfed and that fixes the problem. And indeed: I tried it in my software emulator and that does fix the sound! 

Success! Or maybe not?

What is this transfer time? Well, we looked at it before and found that the transfer time based on the documentation should be 24 * 16 Bit in 768 cycles.

This means we should be able to transfer 16 Bit in 768 / 24 = 32 cycles.

Wait, didn't DuckStation have a transfer time of 32 BEFORE the change? So is the documentation wrong?

Where should this additional speed even come from? The SPU only has 48 cycles = 6 RAM accesses per 768 cycles left, so 16 cycles tranfer time seems impossible?

We need to look further:

I wrote a test for checking the transfer speed from RAM to SPU RAM using DMA, just as the game does it.

You can see the result of my PAL PSX with PSIO above, but what does it tell?

The first test "SPU1BLOCK1" will copy 1 * 32 Bit only. It more or less checks for the DMA latency.

The second test "SPU1BLOCK16" will copy 16 * 32Bit, which will be received on SPU side as 32 * 16 Bit, the size of a full buffer. This test together with the first, allows to check how fast the transfer speed from DMA to SPU is.

97 cycles - 37 cycles = 60 cycles for 15 additional 32 Bit words, so 4 cycles per 32 Bit word. Sometimes this varies and costs 4 cycles more, probably due to a SDRAM refresh.

The third test "SPU16BLOCK16" will copy 16 blocks of 16 x 32 Bit each. 

As the DMA begins the transfer of the subsequent blocks RAM to SPU after the copy buffer inside the SPU is fully transferred to SPU RAM, this test will enable us to check how fast the buffer in the SPU is actually transferred to SPU RAM.

Because the last block is basically for free, as we don't have to wait for it to be written to SPU RAM, we only need to wait for 15 * 32 transfers.

15 blocks * 32 words * 32 cycles per word = 15360 cycles. If we add 16 * 100 cycles per DMA we end at ~17000 cycles.

So it all fits together and the test confirms that the transfer speed is really 32 cycles per word, not 16 as the fix for DuckStation suggests. For validation sake only, this is what the test shows for DuckStation:

You can see that DuckStation is faster in every test: the initial DMA cost, the DMA transfer speed and also the SPU transfer speed.

So while this helped with the bug, it's wrong and I don't think we should apply this as a fix. Even if we wanted: because the core runs on hardware and has real memory latencies, there is not even a simple way to reach the increased transfer speed DuckStation uses.

In the next step, let's verify the core is correct:

The initial DMA cost(Test 1) is higher than on a real PSX, but that is somewhat expected due to CPU register write speed being to slow currently, which is a known issue that needs a fix at some point. We ignore that gracefully for now.

The SPU RAM transfer speed(Test 3) is in the same ballpark, so that looks good.

The DMA to SPU transfer speed however is wrong. It's only 2 cycles per 32Bit, but should be 4 cycles.  Thankfully, slowing down a memory transfer isn't hard, so let's implement it directly.

The result is here:

The transfer speed is right now, which is good.

I don't understand why the total transfer speed for 16 block is still too fast. I can only assume the reason to be that the real hardware doesn't accept a new DMA as soon as the buffer in the SPU is empty, but rather needs more time?

This could be the topic of further research, but for now, we know that the core is "about right" and also (more important) faster than a real PSX,  and we already know that faster is better in this case thanks to DuckStation.

As this is all verified now, we need to dig deeper.

After tracing the SPU RAM for a long time, I found that the difference in the SPU RAM is introduced at this point in the game:

A loading screen. 

Well, it could have been expected, but sometimes thinking more about it gives additional clues.

A loading screen usually has CD transfer going on in the background to fetch the required data, so maybe this could be a reason?

I made a savestate right before the SPU data is transferred so we can look at it in the simulation. 

The whole data transfer can be watched nicely there...but what is that? The DMA to SPU is stopped when it still has more than 400(0x0191) blocks to deliver:

We need to look for anything next to the DMA stop to find why the game could decide to stop the DMA.

Given that the program flow itself should be fine, we might look for actions that could stop this flow. Good candidates are interrupts, as they force the program flow to do something else.

Can we see any interrupt before the DMA stop?

Indeed, there is an interrupt from the CD subsystem right before the DMA stops. It's coming from command 0x0A = "Init", send by the CPU to the CD controller.

The documentation says about this command:

"Multiple effects at once. Sets mode=00h (or not ALL bits cleared?), activates drive motor, Standby, abort all commands."

Also the documentations tells us that the command takes some time: minimum about 60000 cycles, on average about 80000 cycles and maximum...what?

What does 00xxxxxh mean? It's unknown? It can be anything? What does it depend on? The documentation says clearly that it does "abort all commands", so the previous command should not matter, right?

Let's see...before Init is requested by the game, it requested 0x09 = "Pause". The documentation says about the Pause command:

"Aborts Reading and Playing, the motor is kept spinning, and the drive head maintains the current location within reasonable error."

Doesn't sound very complicated? Sounds more like it does about nothing, just stops reading data. Because of that it should be very fast in execution...

Not really. Depending on single or double speed it will be roughly 1 million or 2 million cycles. That is far longer than you would expect.

In the simulation it can be seen that the "Init" command cancels the "Pause" command, just as the documentation says: "abort all commands".

But what if that is not true? What if the processing time of the "Pause" command is actually required?

We can do a small change to the "Init" command to not ignore the processing time of the previous command, but instead only add additional processing time for the Init command.

The result is as we would have hoped: the "Init" now takes much more time and the DMA to SPU can complete without a problem.

After building the FPGA with this change it shows that the sounds are now playing.

That means that the game seems to really depend on the "Pause" and "Init" commands to take a long time. Long enough to finish the whole data transfer.


So this bug entry brought us 2 fixes: one in the CD subsystem and one in the DMA transfer speed.

But now I have some bad news for you: while most of the sounds are working now, there are still some sound issues in this game, coming from the motors of the cars.

As we now know that the data in the SPU RAM is most likely correct, this might be something else in the SPU that will require a seperate research.

I need to leave you with this slightly disappointing result and hope you still like the fixes that where required nonetheless.

Have fun!


Edit: all changes are available in the unstable builds and the test is also available including source code at the same place as my previous tests:

https://github.com/RobertPeip/PSX

Comments

Anonymous

This is awesome. I love these detailed updates!

Anonymous

Awesome in deep analysis, as usual!