Good news… and bad news!

The EightThirtyTwo ISA – Part 12 – 2019-12-07

Since my last post I’ve put some further work into the EightThirtyTwo vbcc backend, and I now have it working well enough to compile a working Dhrystone benchmark. That’s the good news. Here’s the bad news:

User time: 1541
Microseconds for one run through Dhrystone: 61
Dhrystones per Second: 16223
VAX MIPS rating * 1000 = 9231

OK, that’s not terrible, but it’s not as high as I was hoping.

So what are the reasons for this performance level, and can we do anything about it?

One of the major reasons for the rather lacklustre performance is simply the design of the CPU. Because nearly all operations involve the tmp register, it causes many hazards and thus pipeline bubbles, so the cycles per instruction count isn’t great. It may be possible to move tmp from the register file into the ALU itself to alleviate this problem, but I’m not yet sure what impact this would have on logic footprint, so for the moment I’m going to concentrate on the generated code and see what issues I can address there.

One of the first constructs I see when I look at the assembly output created for dhry_1.c is this:

     ldinc    r7
.int _Ptr_Glob
mr r1

ldinc r7
.int _rec2
st r1

ldinc r7
.int _rec2
mr r1

The ldinc/.int pair is the easiest way to get a pointer to another object into a register, and the only way if the object is in a different compilation unit, but it’s not the fastest or most efficient. It causes the program counter to change, which flushes the instruction prefetch and refetches at the new address, and also triggers a data load. Also, we’re currently repeating the load of _rec2, so we need to find a way to prevent that.

If the objects in question are in the same compilation unit (and currently I don’t think there’s any way to tell whether they are – but that’s a different story!), a better option would be:

     li       IMW1(PCREL(_rec2-1))
li IMW0(PCREL(_rec2))
addt r7
mr r1
li IMW1(PCREL(_Ptr_Glob+4-1))
li IMW0(PCREL(_Ptr_Glob+4))
addt r7
stmpdec r1 // Store r1 to temp, predecrementing tmp, hence the +4 above

That’s eight instructions rather than six, but it’s also eight bytes, as opposed to eighteen, and doesn’t change the program counter, so doesn’t interfere with prefetch. There’s just a store operation instead of three loads and a store, so this method is much more efficient.

Following this we have:

     ldinc   r7 
.int _rec2 + 4
mr r1
li IMW0(0)
st r1

ldinc r7
.int _rec2 + 8
mr r1
li IMW0(2)
st r1

ldinc r7
.int _rec2 + 12
mr r1
...

We’re writing to _rec2+4, then _rec2+8, then _rec2+12, but loading the address afresh each time. So we should attempt to recognise that instead we can simply increment r1. I’m now 90% convinced that I should replace the now-largely-redundant sth instruction with store-and-post-increment, which would also be useful here.

Another issue:

     mt  r0
mr r2
mt r2
and r2
cond NEQ

In copying r0 to r2, the code places the value in tmp, but then copies it to tmp again before anding it with itself to set condition flags. This kind of thing could be solved with peephole optimisation of the assembly output, and vbcc does have provision for that – but more advanced analysis could determine whether r2 is used at all after the test, and if not, apply the test directly to r0 instead. I suspect I will probably end up writing a more comprehensive optimiser, involving another intermediate code format that’s easier to scan and traverse, cataloguing object lifespans and suchlike.

What I will say about the Dhrystone project is that it’s been very valuable in shaking bugs out of the code generator and verifying its correctness.

Leave a Reply

Your email address will not be published. Required fields are marked *