Porting an arcade core to the Turbo Chameleon 64 – Part 2 – 2020-03-06
For the Chameleon64 port of Rampage I needed some way of showing an On-Screen Display. While the core doesn’t have many options to worry about, I did want to be able to display a “Loading” message and an “Error” message if loading the ROM failed.
There are several different ways we could approach this, but as you’ll recall from part 1, we’re very short of block RAM for this project, so minimising block RAM usage will be my priority.
Tunneling debugging information over JTAG – 2020-03-04
One of my primary platforms for FPGA tinkering is the Turbo Chameleon 64 cartridge – which comes in two flavours: the original V1 hardware which features a Cyclone III FPGA and the V2 hardware which has a very similar Cyclone 10LP FPGA (basically the same thing in a newer package).
While this cartridge is intended as an expansion for the venerable Commodore 64 8-bit computer from the 1980s, it can nonetheless run other more general-purpose cores, so most of my projects have Chameleon64 targets. The one downside of this hardware is the lack of general purpose IOs. It has no built-in serial port, and nowhere really convenient to attach a USB-serial dongle either. It’s possible to misuse the IEC port for this purpose, but then I need to remember to disable it before distributing a finished core (I doubt a 1541 disk drive would appreciate having RS232 data spewed at it). There’s also a USB debugging protocol built into the cartridge, which I haven’t yet explored – mostly because so many of my projects can be built for multiple platforms, I’m reluctant to put a large amount of effort into supporting features only available on one of them.
I discovered the other day, however, that it’s possible to tunnel a UART-style connection over JTAG.
Porting an arcade core to the Turbo Chameleon 64 – Part 1 – 2020-02-26
There are many, many arcade re-creation cores in existence now – and only a handful have been ported to the Turbo Chameleon 64. When I discovered that Rampage exists for both MiSTer and MiST my interest was piqued, because this is a game I played on the Amiga as a kid, and while the Amiga version’s not a bad conversion, the original arcade game is significantly better.
I was looking today at ways of improving the throughput of the EightThirtyTwo CPU. The design as it stands is very simple, and didn’t make any attempt to perform result forwarding or instruction fusing. These are both strategies for improving the performance of certain constructs, and I wasn’t sure which of these two techniques I should use.
In brief, without either mechanism implemented, when the CPU encounters code such as:
li 0 mr r0
it has to wait until the first instruction has finished writing to the tmp register before moving its new contents into the pipeline, and only then finally writing it to r0.
In my last post I touched briefly on the 832a assembler which I wrote as the first part of my solution to improving the code density of compiled C code.
An assembler that takes a single source file and spits out a ready-to-run binary file is not particularly difficult to write, but it’s not particularly useful either – in order to be useful we need to be able to link together multiple code modules.
I’ve joked a few times in this series about being too lazy to write an assember – but it would be more true to say that the stop-gap solution I was using was adequate, so my time was better spent on the more enjoyable aspects of the project. I am now feeling the limitations of using the GNU assembler to produce a bytestream for a target it knows nothing about, and to improve either the performance or code density of the vbcc backend’s output any further, I need to address the problem I’ve had so far with cross-module references…
Here’s an optmisation I should have spotted much sooner: sign extension.
In the C programming language char and short types are converted to integers before being subjected to mathematical operations – so called “integer promotion”. There are circumstances where this can be avoided – and vbcc has a hook for precisely this purpose, allowing CPUs which have byte- or word-oriented versions of their arithmetic operations to avoid promotion. EightThirtyTwo doesn’t provide this luxury.
My project time over the last few weeks has been spent improving the vbcc backend for the EightThirtyTwo CPU, trying to find ways of making the generated code more efficient. Many of these changes have been simple tweaks, like avoiding sign-extensions when comparisons are only concerned with equality. I’ve also put some effort into recognising when a previous operation has left a value in the tmp register and thus when reloading it can be avoided.
The biggest improvement of all, however, has come from changing how the code generator handles loading and storing values. Previously I had reserved two GPRs – r0 and r1 – for the code generator, but on an architecture with only eight general purpose registers, one of which is the PC and one of which is busy being the stack, depriving the compiler of two registers hurts performance – so I’ve re-worked the code generator so that it reserves just one register for its own use.
Since my last post I’ve put some further work into the EightThirtyTwo vbcc backend, and I now have it working well enough to compile a working Dhrystone benchmark. That’s the good news. Here’s the bad news:
User time: 1541 Microseconds for one run through Dhrystone: 61 Dhrystones per Second: 16223 VAX MIPS rating * 1000 = 9231
OK, that’s not terrible, but it’s not as high as I was hoping.