FPGAzumSpass

Open tasks (Patreon)

Published:

2023-10-25 09:36:15

Imported:

2023-12

Downloads

Download N64_20231025.zip (browse »)

Content

Hi Everyone!

I got a lot of requests about an overview of the project and what things are still to do, so this post will cover it in detail.

First, let's sum up the progress in the last week:

- A hang in several wrestling games and NBA 99 hang was fixed

- Noise dither and Alpha noise implemented, can be seen in the Mario warp effect

- Texture clamp and mirror in copy mode fixed, used in the texts in StarCraft 64

- RGBA correction implemented: subpixel accuracy Gouraud shading, fixes several polygon edge bugs

- Interlace field reported correct, fixing Interlaced mode for several games

- SNAC was implemented by Blue1

- Emulated TransferPak was implemented, fully supports Pokemon Red,Blue and Yellow for now

- Fix for Dual Texturing mode that causes texts and textures in games to be missing or wrong

You can see, it's a mix of bugfixes and new features and I think it will stay that way for a while. You might expect that at this point with a good percentage of games working, there would not be many features left, but unfortunatly there are still plenty. Some games just don't use all of them. So let's look at the components in detail and what things are missing.

CPU - TLB

The most obvious missing feature of the CPU is the Translation Lookaside Buffer (TLB). It's a memory remapping feature in the CPU that around 15% of the games use and those games will just not work as long as this is not implemented. As it is required for a some big titles like GoldenEye it is probably the most wanted feature of the core, but I have to disappoint you if you think it would be the highest priority.

The reason is mainly that the CPU in the MiSTer core is THE most critical part of the whole core due to it running at 93Mhz. It was a bunch of work and also a micacle by the Quartus software to even make a it fit at all. Adding the TLB would be on the most critical path inside the CPU, making it even harder.

This alone is something that is likely still possible to handle, but it will make every build more difficult once it is in. What does that mean? Every other feature will be costing more time and debugging will be more difficult. Due to that, I will postpone the feature until all other important CPU, RSP and RDP tasks are done. I fear you have to wait some more months until it is finally there, but trust me, for the overall project progress it's the better choice.

CPU - Write queue

The CPU in the N64 has a write queue for outgoing writes to memory or I/O registers. Typically a CPU has to stall when it does a write to memory until the write was executed, because the next command could already execute another write or read. With a write queue, this write is not executed instantly but fed into a buffer that will execute it as soon as possible. Up to 4 words with 64bit can be stored in there, so the CPU can work ahead instead of waiting for the memory.

The core currently misses that feature, so the CPU performance is weaker than it should be. As I already implemented that feature in the PSX cores CPU, I know what is to be done and I will likely work on it soon.

CPU - Read pipeline interlock

The CPU executes one instruction after each other in a 5 stage pipeline. Memory reads are executed from either Cache or RDRAM in stage 4, while the next instruction could already calculate in stage 3 and needs the result of this memory read. Therefore the CPU has to stall until the read was executed and continue once the value has been fetched.

That feature is already implemented, otherwise the CPU wouldn't work at all. However, the CPU has 32 registers and each read will place the fetched value in one of these registers. If the programmer or compiler took care of this use-after-read interlock, they might not use the read value instantly after fetching it, but place some other instructions in between. In such a case the CPU doesn't have to stall, because the value is not immidiattly needed.

This interlock condition is currently not built into the CPU, it will always stall on a read, assuming the value is always needed. This was easier to design, as the CPU doesn't need to take care what the next instruction will do. Depending on how optimized the code it, it will cost CPU performance to not have it. It should be straight forward to implement but might break things in edge cases, that's why it was better to wait until games run stable before adding that, so that a regression due to that feature being added is found and fixed easily.

CPU - Cache commands

The CPU has 2 caches, one for instructions and one for data. Each of them can be managed with special cache commands. For example:

- a cache line can be cleared

- a cache line can be preloaded

- a cache tag can be overwritten

Some of these features are required to handle cache coherency with the RAM and some of them seem to be there only for the purpose of having the feature set full, but they don't have any use in the real world. Like, why would you exchange the tag of a cacheline without making it invalid? It's like saving a phone number on your phone with the wrong name attached to it.

Therefore I didn't implement around 80% of the instruction cache commands and around 40% of the data cache commands as I considered them useless in real world applications.

Am I just lazy? The reason for this decision is that the instruction cache is the critical path in the CPU and as we have to live with our relativly old and slow FPGA. We cannot everything there, but have to carefully choose what is really needed.

I spend a lot of time to optimize this route in the CPU as it always seemed to be impossible in the past. I might do a whole article just about this single critical path when TLB is also there, but for now I give you one example: A huge part of the instruction cache is implemented two times to have two paths being checked for a cache hit in parallel instead of one single longer path. This decreases the maximum delay of that path by 1 nanosecond. Yes, we really need to think in such small numbers with a CPU that only has 10.66ns available every clock cycle. A 10% gain was worth it and hopefully you understand why I need to be cautious about adding things in this path.

RSP - Dual pipeline

The RSP has a "scalar" part, which is just a normal MIPS CPU like the PSX CPU but also a "vector" part, both integrated and strongly coupled. They are coupled so strong, that the instruction decoder already decides whether the next instruction is for the vector unit or for the scaler part.

That's not all. Because those parts both exists, the designers decided that it is possible to execute instructions in both in parallel. To do that, the instruction fetch and decode must support fetching and decoding of two instructions in parallel and have a complicated logic to decide if the next two instruction can be run in parallel or not.

This part is completly missing in the core currently, it can only execute in either scalar or vector unit each clock cycle. In theory adding this feature could make the RSP up to 100% faster. However, this is often not the case. Most of the time, the RSP execution speed is no bottleneck at all, as it spends most of it's time waiting for memory or for the RDP to finished the next draw command....or just being switched off, waiting for the next frame.

So I wouldn't expect real world differences above 5-20% in typical use cases. Still, it's one big feature and important and needs to be done at some point.

RDP - Level of Detail

This is one of the big features still missing in the RDP. Depending on the distance to the camera, a different texture can be chosen by the RDP. This is mostly used to reduce texture noise for more distant textures, but can also be used for some effects like the changing Bowser/Peach portrait in Mario 64. The N64 however doesn't really do a distance calculation, but instead uses the difference in the texture coordinates from the current to the next pixel in x and y direction.

Unfortunatly that means that the texture coordinate of the next pixel must be already available for the current pixel, which will require a 1 step longer render pipeline. Furthermore the LoD calculation itself will also cost 1 clock cycle, so in total 2 additional pipeline steps. I need to figure out how to implement this without it costing way to many ressources, so implementing that feature will cost more time.

RDP - YUV textures

The RDP does allow plenty of different texture formats and most of them have been implemented now. The only major one being YUV textures. Those are not simple textures that give you a color information for each pixel, but instead have 3 components that need to be calculated in multiple steps.

The reason why this is complicated is that the solution the original developers found was very clever: the color is not decoded instantly when the texture is fetched, but instead uses a second step in the color combiner. Therefore it's more work and as not many games use that at all, it's lower priority.

VI - Async video out

The video output is currently still using a close to original, but not exact, refresh rate, using the RSP clock rate instead of a dedicated video clock. The reason is that asynchronous logic inside an FPGA is always tricky. You have to take care about exchanging information from one clock domain to another, otherwise you get glitches or even logic hanging in states that usually could not happen.

This makes development and debugging more difficult and therefore I postponed it for later. The most obvious need for that is when you use analog out to a CRT. Inexact timing can lead to the image not being centered or blurry for interlaced content.

Is that all?

No! These are only the big features and there are several tasks that require less than 1 day for other things,but I can't list them all here. Some are missing features of the console, some are just housekeeping of the core itself.

Also there are two further large blocks of work:

First are bugfixes. Plenty of games still have smaller or larger issues that need to be researched and the reason found. This is a ongoing process and takes several months for sure.

Second are timing measurements and improvements. All components currently are implemented on a cycle by cycle basis using the original clock rate. Exception is only the current VI and DDR3 being accesses at double clock speed for several reason. You can look at the seperate article I posted for this.

Having the components run this way will lead to an intrinsic accuracy that is already relativly close to the original, but not exact. To do that some parts have to be measured.

One example: when the CPU does stall for a memory read, will it stall for 1, 2 or 3 clock cycles more than the memory access takes?

For PSX I wrote plenty of such timing tests in MIPS Assembler and compared the core with a real console, leading to very close to 100% timing accuracy for the CPU, GTE and Main RAM. I plan to do the same also for the N64 core.

So that's it for today. I have to apologize for not putting nice images at all these topics, but it's often difficult to do so. If you have any questions about the progress and my plans, feel free to ask.

Attached is the latest version of the core.

Have fun!

Downloads

Content

Files