That’s more like it…

The EightThirtyTwo ISA – Part 13 – 2020-01-12

My project time over the last few weeks has been spent improving the vbcc backend for the EightThirtyTwo CPU, trying to find ways of making the generated code more efficient. Many of these changes have been simple tweaks, like avoiding sign-extensions when comparisons are only concerned with equality. I’ve also put some effort into recognising when a previous operation has left a value in the tmp register and thus when reloading it can be avoided.

The biggest improvement of all, however, has come from changing how the code generator handles loading and storing values. Previously I had reserved two GPRs – r0 and r1 – for the code generator, but on an architecture with only eight general purpose registers, one of which is the PC and one of which is busy being the stack, depriving the compiler of two registers hurts performance – so I’ve re-worked the code generator so that it reserves just one register for its own use.

r0 was often used as a passing register when fetching values from external objects – so the address would be written to r0 and a ‘ld’ instruction issued. In many cases I found that I could avoid this by placing the address in tmp (which is necessary anyway, in order to move it to r0), and using the ‘ldt’ instruction instead.

Likewise, I had reserved r1 for storing values to memory. Again, in many cases this can be avoided, by writing the address to tmp and using ‘stmpdec’ instead. One subtlety which makes this trickier than it needs to be is that stmpdec predecrements the tmp register before storing – so we have to add 4 to the contents of tmp before storing. There’s no ‘stmp’ equivalent, since I decided early on that, only having enough encoding space for one of the two instructions, the predecrement version would be more useful. I may yet revisit this decision, but for now ‘stmpdec’ works fine when the memory object is referenced symbolically.

When the memory object is not referenced symbolically, things are trickier. Having fetched our target address we would need to add four to it in order to use ‘stmpdec’ – however using a register to do this defeats the purpose of reworking the code generator, so I needed a different solution. Instead, I simply exchange tmp and the source register, use the regular ‘st’ (store tmp to address-in-register) instruction, and then exchange them back. Often the final swap can be avoided, too, if the value being written isn’t referenced further in the program.

As a result of these changes, I’m now able to give the compiler an extra register to play with, and in conjunction with a few other more minor tweaks and optimisations, the Dhrystone score is looking somewhat more respectable:

User time: 1164
Microseconds for one run through Dhrystone: 46 
Dhrystones per Second:                      21477 
VAX MIPS rating * 1000 = 12220

OK, that’s still not stellar for a CPU running at 133MHz, but it’s definite progress.

There is one more change I can make which I believe will make a significant difference: Currently static and external symbols are referenced using the following construct:

    ldinc r7
    .int  <symbol>

This adds 4 to the program counter, triggers a load based on its previous value and loads the four bytes that program flow skipped over into the tmp register. The problem with this is that in changing program flow it causes the instruction prefetch to be flushed, which causes two loads from memory to refill. It also triggers a separate load for the value skipped over, and depending upon alignment that value might be split over two words – thus in the worst case this construct triggers four loads!

In many cases the same thing could be achieved using a PC-relative load:

    li IMW1(PCREL(<symbol>)-1)   // 2 li instructions gives us
    li IMW0(PCREL(<symbol>))     // a 12-bit reach.
    addt r7

This causes no extra loads whatsoever – so why am I not using this method instead? Because due to the fact that I don’t yet have an architecture-specific assembler or linker, I have no way of resolving the PC-relative symbols across compilation units. Provided all symbols are within a single compilation unit there’s no problem – however I don’t believe there’s any way for the backend to tell whether or not a symbol refers to an object within the same file. Just to evaluate the possible performance gains I may well write a small filter that replaces such ldinc’s where possible.

Retro Ramblings

Musings on FPGA and Retro Computing

Leave a Reply Cancel reply