FPGAzumSpass

TLB progress (Patreon)

Published:

2023-11-22 15:05:08

Imported:

2023-12

Downloads

Download N64_20231121.zip (browse »)

Content

Hello Everyone!

I started with the work on the TLB implementation and made some good progress over the last week. There are already a bunch of games that are playable now, which have not been before:

Castlevania, Goemon, Bomberman, Paper Mario, Rakugakids, Killer Instinct, Mario Golf and some more.

Some others however don't work yet. Why is that? Let's see.

Virtual address memory access

The name TLB itself is kind of misleading for the overall function. TLB stands for "Translation Lookaside Buffer", but it's only part of the functionality which would be better named as "Virtual Address mapping".

Let us assume a game wants to read the memory content from RDRAM addres 0x40. That alone sounds easy, but there are several possibilities:

- It can read from 0xA0000040 and it will be a uncached read

- It can read from 0x80000040 and it will be a cached read

- It can read from 0x12345040 and it could be a virtual address read...or a bug

The uppermost 3 bits of the address are not really used for addressing memory, but instead tell the CPU which access method it should use. The first two methods are simple: just define if cache is used or not and the read the memory from a flat model where 4 or 8 Mbyte of RAM can be accessed.

With virtual address, this is different. You can have much more virtual memory, e.g. 1 Mbyte for every task and where that memory is located in RAM is none of the tasks business. It can just assume it has 1 Mbyte of RAM and some memory management is taking care that the overall system doesn't run out of RAM. This can be easier from a programming perspective, but will cost overhead.

Buffer part of the TLB

To keep this overhead small is the task of the TLB. It provides a table of 32 entries, each allowing one memory block of configurable size to be mapped somewhere else in RAM.

Because it's tight coupled within the CPU, the extra cost for the memory remap is near zero once the task is running and only the memory allocation will create overhead, but easily in the area of 1-2% performance loss in a typical application.

Due to having 32 entries, you could even give a task multiple memory blocks or switch between tasks without having to update the table.

However, the TLB is really a buffer, not a cache and this difference is important. Whenever a memory address has no matching entry in the TLB, a TLB miss exception is generated from the CPU, leading the program flow to a defined exception handler in software, which then has to take care of setting up the TLB and then retry the execution. A real cache like the instruction or data cache in the CPU would fetch the required memory without giving control back to a software handler.

Difficulty in the core

I often talked about the TLB being one of the difficult things to do for the core on the DE10-Nano and the reason is the CPU timing. The N64 CPU runs at 93.75 Mhz.

This clock speed in itself is already very high for a CPU on the Cyclone 5, but there are some things that make it even worse:

- the N64 CPU has full operand forward: it can almost always use the result of a previous operation in the next instruction immidiatly

- the N64 CPU is 64 bit wide and 64 bit single cycle shifters or add operations are slower than 32 bit ones

- the N64 CPU has no extra cost for jumps: even after a jump, the instruction at the new address is immidiatly fetched for the next cycle to be processed by the instruction decode

- the virtual address decoding is free of cost when the TLB contains the requested address

But the worst thing is that all these are combined to one single huge problem: the critical path in the CPU. All these things needs to happen after each other in a single clock cycle of 10.66ns only.

The functionality in itself isn't that bad. I implemented it in my software emulator in about 2 days, but the reason why developing it is so difficult for the FPGA is that every step must be carefully decided, otherwise the maximum clock frequency we can run the CPU at would drop.

The progress

So what is already done now?

The whole TLB itself is implemented. Games can store the table contents and they can use it. However, the lookup is still slow for 2 reasons:

- Currently every memory access needs to lookup the TLB again, it's not free of cost. This open task must be done to keep the original speed

- the virtual address cannot yet be used together with instruction cache or data cache

The result is that at the moment only games that very rarely use the TLB will work ok. Every game that uses it constantly would be so slow that it takes very long to load or it might not even work at all. The CPU operates like it would have only about 2 Mhz in that case and I cannot blame a game if it crashes with such a slow CPU. Possibly there are of course also bugs still in the implementation.

Still, if you want to try it with the games listed above, a testbuild is attached. I cannot say how stable those games are currently, but it looked not bad for the first minutes.

That's it for today. A more detailed article with implementation details will follow once the implementation is complete.

Have fun!

Downloads

Content

Files