Porting an arcade core to the Turbo Chameleon 64 – Part 1 – 2020-02-26
There are many, many arcade re-creation cores in existence now – and only a handful have been ported to the Turbo Chameleon 64. When I discovered that Rampage exists for both MiSTer and MiST my interest was piqued, because this is a game I played on the Amiga as a kid, and while the Amiga version’s not a bad conversion, the original arcade game is significantly better.
I was looking today at ways of improving the throughput of the EightThirtyTwo CPU. The design as it stands is very simple, and didn’t make any attempt to perform result forwarding or instruction fusing. These are both strategies for improving the performance of certain constructs, and I wasn’t sure which of these two techniques I should use.
In brief, without either mechanism implemented, when the CPU encounters code such as:
li 0 mr r0
it has to wait until the first instruction has finished writing to the tmp register before moving its new contents into the pipeline, and only then finally writing it to r0.
In my last post I touched briefly on the 832a assembler which I wrote as the first part of my solution to improving the code density of compiled C code.
An assembler that takes a single source file and spits out a ready-to-run binary file is not particularly difficult to write, but it’s not particularly useful either – in order to be useful we need to be able to link together multiple code modules.
I’ve joked a few times in this series about being too lazy to write an assember – but it would be more true to say that the stop-gap solution I was using was adequate, so my time was better spent on the more enjoyable aspects of the project. I am now feeling the limitations of using the GNU assembler to produce a bytestream for a target it knows nothing about, and to improve either the performance or code density of the vbcc backend’s output any further, I need to address the problem I’ve had so far with cross-module references…
Here’s an optmisation I should have spotted much sooner: sign extension.
In the C programming language char and short types are converted to integers before being subjected to mathematical operations – so called “integer promotion”. There are circumstances where this can be avoided – and vbcc has a hook for precisely this purpose, allowing CPUs which have byte- or word-oriented versions of their arithmetic operations to avoid promotion. EightThirtyTwo doesn’t provide this luxury.
My project time over the last few weeks has been spent improving the vbcc backend for the EightThirtyTwo CPU, trying to find ways of making the generated code more efficient. Many of these changes have been simple tweaks, like avoiding sign-extensions when comparisons are only concerned with equality. I’ve also put some effort into recognising when a previous operation has left a value in the tmp register and thus when reloading it can be avoided.
The biggest improvement of all, however, has come from changing how the code generator handles loading and storing values. Previously I had reserved two GPRs – r0 and r1 – for the code generator, but on an architecture with only eight general purpose registers, one of which is the PC and one of which is busy being the stack, depriving the compiler of two registers hurts performance – so I’ve re-worked the code generator so that it reserves just one register for its own use.
Since my last post I’ve put some further work into the EightThirtyTwo vbcc backend, and I now have it working well enough to compile a working Dhrystone benchmark. That’s the good news. Here’s the bad news:
User time: 1541 Microseconds for one run through Dhrystone: 61 Dhrystones per Second: 16223 VAX MIPS rating * 1000 = 9231
OK, that’s not terrible, but it’s not as high as I was hoping.
One of the main goals with a pipelined CPU is keeping the pipleline as full as possible at all times. The biggest obstacle to this goal is the existence of ‘hazards’ – situations where what’s about to happen at the beginning of the pipeline depends on an outcome that’s yet to be determined further down the line. This is a particular weakness of the EightThirtyTwo CPU: because we have only eight registers, and because many operations involve the tmp register, there’s a good chance that any given instruction will depend on the result of the previous one. One way around this is to use results forwarding where, for example, a result calculated in the ALU can be sent directly to the ALU’s inputs for the next instruction rather than being first written to and then read again from the register file.
I haven’t yet attempted to implement this for the EightThirtyTwo. Instead, I’ve attempted something far less sane for a lightweight CPU – dual-threading!