The Quest for Code Density

The EightThirtyTwo ISA – Part 19 – 2020-03-25

One my main goals when starting the EightThirtyTwo project was to minimise the amount of block RAM I needed to devote to firmware; indeed, this was even more of a goal than was raw speed.

So how does the architecture measure up to that goal?

Firstly, the instruction set is quirky enough to present a compiler with significant challenges in generating efficient code. The limited number of general purpose registers, the lack of any kind of indexed store instruction and the bottleneck created by the temp register all require the programmer to be smart about the order in which things happen. The more mainstream, register-rich orthogonal ISAs give compilers an easier time since they can often rearrange instructions without side-effects; this is very much not the case for EightThirtyTwo.

For this reason, when evaluating EightThirtyTwo’s code density I will consider the density of compiled code separately from hand-assembled code – not least because the former depends a great deal on the compiler and how good a job the backend author managed to do.

The process of gradually improving the backend has been one of reading some generated assembly code (often from the Dhrystone benchmark) and looking for less-than-efficient (sometimes even brain-dead) code patterns that could be replaced with something smarter.

VBCC has a nice in-built interface for this replacement; before the code generator proper is called, it calls a backend function with a preliminary Intermediate Code list for the function about to be generated. The backend can then analyse the code, and create a custom “addressing mode” structure where it can give the code generator information about, say, using a post-increment addressing mode – though precisely what you store in the addressingmode structure is up to the author; one of the attributes I store is whether or not a calculation result is needed subsequently or can be considered disposable. (In the latter case a byte load or store to a calculated address can often use the single-instruction ldbinc or stbinc rather than the instruction pairs byt; ld or byt; st, because the post-increment side-effect is harmless.)

The code generator has free reign to do what it likes with the tmp register, and also reserves r0 from the general purpose register pool for its own use. This means that VBCC only knows the contents of r1 through r6 – and will thus quite often emit code to calculate a value that’s already in either tmp or r0. To avoid this unnecessary duplication, I added some code to track the contents of those two registers and replace such code with a simple “mr” or “mt, mr” as appropriate. This is one of the most complex parts of the code generator, but the payoff is worth it.

Since early January I’ve been keeping a log of both the size and the reported speed from the Dhrystone simulation, and the code generator improvements since then have brought the code size down by around 19% and improved the execution speed by around 53%.

But how does it compare with other architectures?

To test this fairly I should collect together a suite of test programs of different types from different sources, and build them for various different platforms. I’m not going to do that – at least not at this stage!

Instead, just to get a first-level feel for how the code density compares, I’ve just compiled the Dhrystone demo for 832, ZPU, MIPS and M68k, using size optimisations in all cases. I’ve also compressed each binary with lz4 and gzip, just out of curiosity. (A binary format compressing particularly well would indicate high redundancy, which would be expected to reverse-correlate with code-density.)

MIPS: 5240 bytes. LZ4 compressed: 3940 bytes. Gzip compressed: 2878 bytes
ZPU: 5544 bytes. LZ4 compressed: 4034 bytes. Gzip compressed: 3173 bytes
M68K: 4502 bytes. LZ4 compressed: 3359 bytes. Gzip compressed: 2643 bytes
EightThirtyTwo: 4822 bytes. LZ4 compressed: 3360 bytes. Gzip compressed: 2644 bytes

So there are some big surprises there. Firstly, it’s pleasing to note that 832 is beating both MIPS and ZPU. We have a way to go before we can match M68k – however the compressed size is almost identical, which is all the more interesting when you consider that the hand-rolled lz4 decompression routine I wrote in 832 assembler months ago ended up being almost exactly the same size at its M68K counterpart.

Calculating compression ratios from these numbers would be somewhat misleading since Dhrystone incorporates about a kilobyte of static text which will be the same for all three binaries, so one would need to subtract the size of this text (1172 bytes) and its compressed size (700 and 509 bytes for LZ4 and GZIP, respectively) before working out ratios – but having done that, the 832 code compresses significantly further even than MIPS or ZPU, despite the binary already being smaller – perhaps this is an indication that there’s still room for significant improvement in the code generator?

Retro Ramblings

Musings on FPGA and Retro Computing

Leave a Reply Cancel reply