Assembling is the easy bit!

The EightThirtyTwo ISA – Part 16 – 2020-02-08

In my last post I touched briefly on the 832a assembler which I wrote as the first part of my solution to improving the code density of compiled C code.

An assembler that takes a single source file and spits out a ready-to-run binary file is not particularly difficult to write, but it’s not particularly useful either – in order to be useful we need to be able to link together multiple code modules.

Thus, 832a now has a cousin: 832l, the linker.

Tradionally, compiled and assembled object code is split into chunks, known as sections, of various types. Typical types might be:

  • .code or .text – the actual code of the program. Normally this would only be read, not written to, unless your program employs self-modifying code tricks. If it doesn’t employ such tricks then .code / .text sections can reside in a ROM if need be.
  • .rodata – read-only data – constants that are only read, not written to, so can reside in ROM if need be.
  • .data – mutable data – the program’s working set. Must reside in RAM for obvious reasons.
  • .bss – uninitialised data. Space is reserved for this in RAM but there’s no corresponding data on disk. Part of the start-up code’s job is to clear the BSS region so that it’s initialised with zeroes.

In addition to these, I want to support .ctor and .dtor sections. These contain just as single pointer to a constructor or destructor function, which allows a code module to be automatically initialised on startup and cleaned up on exit. The linker must collect these together into a list, and sort them in order of priority. The startup-code is responsible for calling them one by one.

For the moment I don’t need to worry about ROM / RAM distinctions, but the basic feature set I want to support is as follows:

  • Linking of multiple code modules. (Wouldn’t be much of a linker without this feature!)
  • Removal of unused code. To make this easier I’m currently creating a unique section for every function in the C backend. The linker starts by visiting the first section of the first object file, resolves any references and recursively repeats the process for the section containing each reference’s target. It keeps track of which sections have been “touched”, and discards any that haven’t.
  • Weak linkage. This allows a function or variable to be overriden by another version in another code module. For instance, my startup code contains an interrupt stub function with weak linkage, which is overridden if another object contains an interrupt function, but prevents the system crashing on interrupt if there’s no such definition.
  • Constructors and Destructors, as mentioned above.
  • Efficient representation of resolved symbols. This is the big one, and the one I couldn’t support using the previous GCC-based toolchain without writing a full GCC backend.

With the previous toolchain solution, the only way to load the address of an external reference was to use a ldinc idiom, like so:

    ldinc r7
.int _externalsymbol

This works just fine, but requires five bytes, and since it causes program flow to skip over the embedded value, it hurts performance by triggering as many as four loads, depending on alignment.

A better solution would be to use a string of “li” instructions to build the required value six bits at a time. That’s easy enough if that value is known at assembly time, but what if it isn’t? Traditionally you would have to emit enough instructions to contain the worst-case address, which for this archecture would mean leaving room for a 32-bit value, requiring six consecutive ‘li’ instructions. Not ideal given that in many cases the target will be well within the 12-bit range afforded by two consecutive ‘li’s – and a even a single ‘li’ can give you a PC-relative reach of +31 / -32 bytes.

To do this efficiently the linker would “relax” the reference to its smallest possible size once the addresses of the symbols are known. The problem is that shrinking the references changes the addresses!

To solve this, I ended up using a multi-pass approach:

  • To begin with I assign the minimum possible size to all references
  • Then calculate a tentative address for all sections and symbols.
  • I now calculate actual sizes for the references based on the tentative symbol addresses, noting whether or not any references expanded in the process.
  • Next I assign new addresses to sections and symbols based on the new reference sizes.

I repeat the last two steps until the symbols no longer expand, which generally happens after between 2 and 5 passes.

(Note: this process would be prone to oscillation if I were always to assign the best-possible size to a reference; one reference shrinking could potentially increase the distance between two symbols if there’s a “.align” directive between then. To avoid this I only allow references to expand, not to shrink. This will mean I very occasionally miss the optimal solution, but that’s an acceptable trade-off for it actually working.)

So how much difference does this make?

So far it seems to reduce the size of a compiled executable by roughly 10%. Speed is improved, too.

This change, in combination with some other tweaks I’ve made (including one quite major one involving alignments) has improved the Dhrystone score significantly:

Microseconds for one run through Dhrystone: 39 
Dhrystones per Second: 25432
VAX MIPS rating * 1000 = 14471

That’s running from Block RAM. The biggest surprise, however, was how much difference this made when running from SDRAM instead. The VAX MIPS rating from SDRAM was previously betwen 8 and 9000. Now I get:

Microseconds for one run through Dhrystone: 47 
Dhrystones per Second: 21222
VAX MIPS rating * 1000 = 12075

Code for the CPU, assembler, linker and C backend can be found on GitHub as usual:

(The separate EightThirtyTwoDemos project will need updating to the new toolchain, so isn’t yet functional.)

Leave a Reply

Your email address will not be published. Required fields are marked *