FPGAzumSpass

DMA timing accuracy with a twist (Patreon)

Published:

2022-10-26 07:58:56

Imported:

2023-12

Content

Hi everyone,

as i've told in the previous posts, i was working on the DMA timing accuracy over the last month, with a small pause for the filtering and 24bit mode. Today, i want to present the results of this journey.

We start with a small introduction: what is the DMA?

DMA stands for Direct Memory Access and allows to transfer data from RAM to different devices: GPU, SPU and MDEC and also backwards to RAM from different devices: GPU, SPU, MDEC, CD, OTC.

The main purpose is that the CPU doesn't have to copy the data, as it would be much slower doing so. The transfer rate is around 10 times higher with the DMA.

The DMA itself is configured for different actions. Most important: source and destination, amount of words to be copied and the transfer type. There are 3 transfer types supported:

- instant transfer of the whole data, e.g. transferring a CD sector into RAM

- block mode to transfer in chunks. E.g. write multiple chunks of 16 words to GPU to store textures

- linked list mode which can transfer e.g. polygon drawing lists to the GPU, with one polygon each transfer

As you can see, there are several combinations possible and testing everything was quite some work. I wrote 12 DMA tests with a total of around 180 subtests.

Let's start with something easy:

While there might be a lot of numbers on the screens, it's not all that difficult.

There are 5 columns:

- Subtest name: tells you what is tested

- Test Length: tells how many words are transfered in this subtest

- Cycles: tells how many CPU cycles the whole transfer sequence costs, including setup costs

- PS1LO/HI: tells how fast the PS1 will work. Cycles should be betwen LO and HI to pass the subtest

OTC is a special DMA that has no real source device. Instead it just writes a empty linked list to RAM, to help prepare a rendering list. As such, it can write to RAM as fast as possible and doesn't have to wait for anything. So it's a good start for testing DMA performance for the direction to RAM.

There is nothing too exciting here: you can see that the core was 2 cycles slow, which got improved. I also added logic to detect page switches and account for refresh timing of the Playstations SDRAM.

Now we do something more interesting:

This is the test for transferring data from RAM to MDEC, so the opposite direction.

What you can see is, that the core was 6(!) cycles slower than the PSX. Now that's a huge amount. What's even more interesting is the fact that writing the DMA register without starting the DMA(WAITDONE = 31 cycles) and running it with Length 1 is only 4 cycles difference.

4 cycles for random access memory read seems just too fast, knowing that the CPU requires 6 cycles there. So the DMA is 2 cycles slower and costs no overhead at all?

That all seemed very unlikely, but what should i do? The test shows that the PS1 is that fast, so i have to fulfill it. I thought about where i can save some cycles and found a total of 7 cycles i could possibly save, with 2 of those 7 being a major pain. But let's start with the first 5.

- DMA: allow for early DMA stop when last word is transferred from ram to device

That was the first change i made and it reduces the overhead by 2 cycles. This is done by handing over the memory bus to the CPU when the DMA knows it's currently transferring the last word instead of waiting for it to be done.

- DMA: start SDRAM reading earlier

As the name says: when starting a DMA, the SDRAM is requested one cycle earlier, so it will deliver data one cycle earlier and therefore has 1 cycle less overhead.

- DMA: use data from SDRAM directly without FIFO for MDEC and GPU

This change saved another 2 cycles of overhead. Before, the data read from SDRAM was going through a FIFO in the DMA, because the target device might be slower than the SDRAM transfer speed. This however is only the case for the SPU, because it has a 16 bit data bus. GPU and MDEC have a 32 bit bus and can directly take any data, so the FIFO can be skipped for them.

So there was only 1 cycle left to be saved. Two changes would have been possible:

- reading another cycle earlier from SDRAM by making the SDRAM request combinatorial in the same cycle the DMA will switch over. This would mean there is a 33mhz(CPU, DMA) to 100mhz(SDRAM) clock domain crossing without latching the data before. This will risk timing closure of the FPGA, so i wanted to avoid it

- using the data from SDRAM directly, without clocking it in to the 33Mhz clock domain. That also would mean there is a clock domain crossing in a critical path

Both solutions didn't make me happy, so i put them off for the moment and continued with further tests. That's where the twist happened.

I started with testing the block mode, transferring several blocks of data with different length. Because i knew the DMA would pause between the blocks for a short time of some cycles, the CPU could then do some actions.

The test is build to do a busy loop, asking the DMA if it is fully finished. However, asking the DMA means reading it's register, which will take 5 cycles, over and over again until the DMA is finished.

I wanted to avoid that and inserted 100 NOPs after starting the DMA and asking if it's finished. This should be plenty of time to do some blocks, because the CPU would stall while the DMA runs and only some of these NOPs get executed.

To my surprise, it turned out that the total cost of a multiblock DMA in this condition was: zero!

I searched up and down my test where the mistake was, but there wasn't any. The DMA was costing nothing or in other words, the CPU was as fast as if the DMA wasn't running.

This is the point where you start to question everything you know: could it even be possible? Could it be? Could the DMA run in parallel to the CPU?

Let's go to the very start:

CPU and DMA both access the main RAM, so when the DMA is running, the CPU cannot read instructions from RAM and cannot read or write data to RAM. The same applies to e.g. GPU access. The CPU cannot access the GPU, while the DMA is transferring data. The bus is busy.

What does the documentation say:

Well, you can interpret that as you like. For me, I always read that while theoretically possible, it's not supported by the system.

What do emulators do?

As far as i found, they all pause the CPU when the DMA is running.

But still, the test shows that likely this is not true. So instead of doing the test i wanted to write, I wrote a different one:

The test does nothing more then starting a very simple DMA and running different amount of NOPs in the CPU.

And indeed: you can see that the core consumes 1 additional cycle for every additional NOP between 2 and 11.

(NOP 0+1 are special cases, where the DMA busy is requested before the DMA starts, so it costs one additional read and therefore the total cost is higher. We should ignore these cases here)

On the PSX however, executing 2 NOPs in the CPU or 8 NOPs makes absolutly no difference. So it's proven at this point that the CPU and DMA run in parallel.

How is that even possible?

The CPU can run the instructions from Cache and as long as it doesn't access memory, it doesn't have to wait.

I wrote another test to show that the CPU can indeed execute as many instructions as it wants in parallel, even working on Scratchpad memory and GTE without any stall. However, it will stall as soon as any external request is done. What the documentation tells about on-chip IO ports or CDROM waitstates is not true: the CPU will stall at these accesses....with one exception.

You may remember the Write Buffer article:

https://www.patreon.com/posts/mips-write-71140383

The CPU can write some values (up to 4) while the DMA is running without stall. These values will not be written to RAM or IO devices at that point yet, instead it will be done when the DMA has finished.

I opened a pull request for all these findings, so that they will make it into the documentation and they got already merged:

While implementing this behavior so late in the project was not straightforward, it also helped me with the previous problem.

You can see that the core now fulfills the timing. But how? The overhead was only reduced by 5 cycles, not by 6.

In fact, the DMA cannot transfer a block of 1 word in 4 cycles as assumed before. That was only because it ran the CPU in parallel. Instead it needs 6 cycles as you can see from the NOPS 2 and NOPS 8 test, which also fits perfectly to time that the CPU requires to read a word from RAM. Digging deeper here really helped me out.

Conclusion:

With the implementation of the overhead reduction and the CPU-DMA parallel processing, all worst case timings could be fulfilled. Now the DMA only needed to add waitstates for things that are slower on a real PSX than on the core, which was a much easier task.

These waitstates are skipped if the High Turbo mode is used for additional performance.

All tests mentioned here and many more can be found in my Github.

90% of these changes are already in the last release of the core, the remaining ones will make it into the next one.

Thanks for reading and have fun!