Home Artists Posts Import Register

Downloads

Content

Hi everone,

I spend the last week working on one of my favorite topics: increasing the speed of a system.

This has always been exciting for me, as you can probably tell by all the things i have done for that in the past, like the AO486 speedup as well as fastforward and turbo options in most of my cores.

I just love to tinker around with how to make a system run faster that it's supposed to be. The use cases usually come by itself, as even in a retro system computing power often has some kind of use.

The PSX core already got a turbo option that adds caches and speeds up some parts of the system, but it was not enough for me.  

This time it also got plain and simple double CPU speed. The post could end here, but unfortunatly it's not all that easy, so let's look into it.


The PSX CPU usually runs at 44100*768 Hz = 33.8688 MHz.

I probably mentioned this before, but i still find it amazing, the CPU actually uses a multiple of the sound sampling frequency to line up SPU and CPU perfectly.

This new special PSX core runs the CPU instead with 67.7376 MHz(2x), but keeps the original speed(1x) for all other components.

It may sound like a simple change in the clock frequency but unfortunatly it's not. Usually the CPU can communicate with other parts of the system with a synchronized clock frequency,  as all components connected to the CPU also use this 33 Mhz clock for register interface and also the RAM interface.

The possible way to handle it with double CPU clock is to decouple the CPU from the rest of the system, letting it run with it's own 2x speed and convert signals to 1x system speed before "leaving" the cpu. This needed some reworks in the CPU, but turned out to be a good approach as it's working stable now as far as i can see.

But there is a downside: i will not be able to integrate this change into the main PSX core and switch it at runtime. This means that this 2x CPU clock version of the PSX core will only exist as a second core/rbf next to the normal PSX core. It's more a strange version that is only of use for some games anyway, so i think this will be ok.

If you have a game that suffers from severe slowdowns that not even the normal Turbo option can solve, but you still really want to play it, give it a try.

Otherwise, you can probably forget this special build.


As an example we will look at one special game today to illustrate where slowdown is coming from in most PSX games and what the root causes are.

This example is Shadow Man. A game infamous for it's horrible performance on the PSX, often dipping to as low as 6 fps in the NTSC version. The PAL version is a little bit better and mostly can keep at least 10 fps. Not that it would be great with that.

With a speed like that, it's obvious that we would want at least double performance to make it playable. The Turbo option did the first step with reaching about 50% plus, but it was just not enough, as you can see:

What is this game even doing to be so slow?

Well, it turns out to be the same that nearly all games with slowdown suffer from: CPU performance. The CPU in the PSX is really slow for doing 3D calculations.

Being based on a 1989 CPU design, it is not even just running at a low clock speed, but also has very few instruction cache with a simplistic cache logic and it completly misses a data cache.

Think about early 3D gaming on a PC with a 33Mhz CPU or just look at the Shadow Man minimum requirements on the PC version and it should be obvious: 

Pentium 200 Mhz....as minimum!

So you would expect they have to optimize the game a lot for the PSX to make it work properly, but unfortunatly this is not what happened.


Let's look at 3 examples of the game to see what the issues are. To research that, i loaded a savestate in the VHDL simulation and could analyze all the details of what the game does.

You can see that rendering the image is not a big deal for the GPU. It takes up about 21 ms and could therefore nearly reach full 50 fps.

The geometry calculation on the other hand takes up way more time. It's 48ms here and this is already a capture with the double clock speed CPU!


Geometry calculation is done with the support of the GTE coprocessor. The CPU can send tasks to the GTE that would take much more time when calculated in the CPU.

One of these tasks is converting a 3D space position into 2D screen coordinates. This requires some multiplication and a division, both being slow in the CPU.

The GTE can calculate this operation for 1 point in 3D in only 15 clock cycles, so it's great that Shadow Man is using plenty of them, right?

Each of these little spikes here is one of these RTPS commands for the GTE.

There are plenty of them, about 7000 per frame calculation. Each of them costs 15 clock cycles, so that's 105000 clock cycles per frame.

Let's assume we want to hit 30 fps, we would only have slightly more than 1 million CPU cycles per image.

But there is nothing we can do, it must be done? Well yes and no.

You typically want to draw triangles in 3D, which consist of 3 points. Each of these must be converted with a RTPS command?

No! The GTE offers a RTPT command which will convert 3 points in one run. Executing this command costs slightly more time, but only 21 clock cycles, compared to 45 clock cycles for running 3 RTPS commands.

Using RTPT instead would have cut down this conversion time from 105000 to about 50000 clock cycles.


The second example shows another oversight in the geometry calculation:

The is a function in the game, running the same code over and over again. 

Unfortunatly, the developers didn't took care that a subfunction also fits into the cache, so they are overwriting each other, leading to significantly more execution time.

Each run of this function costs 2851 clock cycles in total, with the CPU waiting on cache fetch being 888 cycles of that. So this part could have been 45% faster.

How? By taking care that the subfunction is not placed in RAM at a address that would lead to cache collisions.

The main function is running from address 0x14A50, while the subfunction runs around address 0x13A50. The PSX cache is so simple, it cannot store the value of both at the same time. To work around that, timing critical function can be moved around in memory to make sure this doesn't happen.

I did this extensivly for my timing tests to make sure the test is running 100% out of cache.


The third example is the world design itself and how it is managed. Let's take a look at what is rendered in which order:

Thankfully the PSX has no GPU performance issues, otherwise the extensive amount of overdraw would really hurt here. Overdraw is when a pixel on the screen is drawn several times after the other, requiring a higher fillrate to be able to hold up against it.

But there are things drawn, that are gone later and never would have to be in that place.

A very good example can be seen in the upper right image. The game renders a hidden passage below the village, that is later not visible anymore due to the ground.

Not only is there no reason to render it at this position in the world, because you could not see it anyway as long as the covering is not removed, it also means that the game is calculating the full geometry of it the whole time.

The same goes for the church, calculating and rendering the interior from this far, even with the door still closed.


These are just some quick examples that show the lack of optimizing in this game, which could have helped to make it run much better. At least the game gave me some use of the new 2X CPU core. Together with the existing Turbo mode it can work around the performance issues with raw processing power and make the game more playable.

I attached the core for you to play around. Please handle it with care and only use it when you can live with bugs or crashes. It's not the normal way the PSX is supposed to run.

Have fun!

Comments

Anonymous

Sounds like a fun project to see how much performance you can get out of the core, even if it has to be an experimental build. Thanks!

Muriel Melvin

That thing about the CPU running at a multiple of 44.1kHz kills me. 😆 One of these days I need to try out these performance improvements with LSD Dream Emulator. The frame rate is so bad in there that it’s really hard to play for any length of time.