DDR3 Tests finished (Patreon)
Downloads
Content
Hi everyone,
in the last post I told you that I want to write a DDR3 test core. This is done now and I want to talk about what this core can do, how it works and what the results mean for the N64 core.
The repository can be found here:
https://github.com/RobertPeip/MiSTerDDR3Test
I also attached a RBF of it at the end of this post. Be aware that it was only tested with HDMI out, I don't know if analog out on a CRT will work properly.
Let's look at what the core tests:
You will find 7 captions with numbers being output and constantly updated once you start up the core. It should look like this.
- Clockrate: will measure the currently used clock using a different clock, just to verify that the clock is stable and has the correct speed
- Transfers: will count up constantly by 1 for each full transfer being finished. Mostly useful to see that the measurement is still running
- Delay Min: will show the lowest latency for a single transaction that was measured over time. Value is in clock cycles. Usually does saturate instantly to the lowest value.
- Delay Max: same for maximum measured delay. This number will increase over time when some random events occur that lead to long latency.
- Delay Avg: average delay over the last 65536 transfers. The left number is a round down integer, the right value is the exact total delay in hexadecimal and can be used to see more exact values. E.g. in the image: Avg Delay of 15 clock cycles, but more exactly it's 0xFCD13, which calculated to around 15.8 clock cycles
- Bytes / s: will display how many Bytes could be written or read in the last second. Will update once per second. Measures the real throughput without any overhead.
- Burstwait: small detail which will show how many clock cycles a burst read was interrupted in the middle of the transfer. Has the same update interval as the Delay Avg. I used it to check if I can rely on burst reads to deliver data without pause once the first data word is received. Result: I cannot.
When you open up the OSD the test will pause and all measurements are reset, meaning you can reset the Delay Min/Max by opening and closing the OSD.
There are also some options in the OSD:
- Direction: test reading or writing performance
- Address Mode: You can either let the test read/write to static address, so always the same address is written or read, let the address count up or down or have it totally random in a 4 Mbyte area.
- Clock Mhz: change between 62.5, 85, 100 or 125 MHz at runtime
- Burst: only applies for reads. Change the amount of 64bit words that are read in a single transfer, sizes from 1,2,4,8...to 128
Now let's talk about the results.
The N64 RDRAM Memory delivers data at 500Mhz with 9bit. Let's ignore the 9th bit for now and assume it's 8, because the 9th bit can only be used for rendering anyway and we have special options for it we will discuss when the RDP is developed.
8 Bit at 500mhz is the same as 64Bit at 62.5Mhz, so exactly what our DDR3 would do when running at 62.5Mhz. This is the most interesting setting for us. What can we reach here?
The maximum data rate we reach with random writes is 455 Mbyte/s. The latency is extremly low, thanks to the DDR3 interface from Altera queuing up writes very efficiently and the bare bandwidth of the DDR3 being much higher than our interface to it.
For random reads with a medium(32) burst size we can still get 360Mbyte/s and have a average latency of 11.5 clock cycles.
Is that enough for the N64? I don't know. It's hard to tell how efficient the bus arbitration on the N64 really is. I would doubt it's faster than that due to all the overheads, but unfortunatly I cannot run the same test on my N64.
Maybe we can just go the safe way and increase the clock rate of the RAM?
This is the bandwidth with large bursts and 125 Mhz. It outperforms even the theoretical values of the N64 easily in terms of bandwidth. Maybe we should just go that way and be safe?
We need to talk about another reason why that is useful: latency.
At 62.5Mhz the average latency is 11.5 cycles => 16ns * 11.5 = 184 ns
At 125Mhz the average latency is nearly 18.5 cycles => 8ns * 18.5 = 148 ns
So by increasing the clock to 125Mhz we also improve the average latency by 36 ns or 24%, which is a significant amount.
What is even the required latency?
I wrote some tests for the RDRAM access that I can run on my N64. It runs some action and measures the time with the CPUs internal clock counter, which is running at 93.75MHz
- fetching data for the CPU
- fetching a single instruction for the CPU
- fetching 8x32 Bit instruction words for the CPU instruction cache
- DMA transfer RAM <-> RSP memory
The CPU latency is the worst, because it has to go through the 32bit bus that connects the CPU and the RCP. Therefore the latency for reading a single word from RAM is as high as 30 clock cycles. The same applies to the single instruction fetch.
Instruction cache fetching takes 42 clock cycles in best case.
DMA also has initial cost of around 30, so maybe 26-28 cycles for the RAM access. I need to improve my tests here to get more accurate values.
All these values are also somewhat random due to e.g. refresh, so what i listed here are best case access times only, but the result is that the best case latency is around 260ns, so way above our average with the DDR3.
But what about the our worst case?
The DDR3 in MiSTer is shared, so the HPS/Linux and scaler also need to access it. For the bandwidth this is no issue, but the access for reads is blocked when one of these is currently also working with the DDR3. That's why our worst case access delay is relativly high and higher than what the N64 shows.
My plan for that is to do cycle counting. I already did something similar with the GBA core: the memory access can lead to the system being slower than it should be. Whenever that happens, the core gains some "credit" for the next access to make it faster.
One example:
- Core starts at 0 credits
- random read is requested and takes only 18 clock cycles. We don't deliver it for another 12 clock cycles to the cpu to match original timing of 30 clock cycles
- random read is requested and takes 50 clock cycles. We deliver it instantly and because we are 20 cycles late, we gain a credit of 20.
- random read is requested and takes 20 clock cycles. We deliver it instantly, which is 10 cycles to fast, so we reduce our credit to 10.
- random read is requested and takes only 18 clock cycles. We don't deliver it for another 2 clock cycles , so that together with our credit the 30 clock cycles are full again and we are back at cycle accuracy
Is this a dirty trick? Yes of course. But it's the only way we have on the MiSTer due to the shared DDR3.
Also the consequences of it are far smaller than you might think. As the RDRAM is shared in N64, the CPU can never be sure what the timing of a RAM access really is.
- It could be delayed due to VI reading out image data to give it out to the screen
- RAM could be refreshed
- RAM could be on a different page from some other transfer
As the core at least will compensate for any timing differences, it will still be on the right track at the end.
GBA without SDRAM, PSX or AO486 all live with these random accesses, the later 2 even without any compensation.
A last result of the test core to stray any doubts:
This is the maximum bandwidth our DDR3 provides at 125Mhz. 872Mbyte/s. In parallel to HPS and Scaler.
The N64 in 9bit mode has 562,5 Mbyte/s theoretical maximum it cannot ever reach. There is enough air :)
Have fun!