In my last post I outlined a very simple test program written in C and showed the ZPU assembly language that the ZPU GCC toolchain creates.
Let’s take a closer look at the program and see what’s going on.
As I mentioned, the ZPU doesn’t have a traditional register file – instead it has a stack, similar to the way languages such as Forth and Postscript work. What I’ll do here is go through the test program line-by-line and analyze what’s going on:
main: im -1
This pushes an immediate value, in this case -1, to the stack. The im instruction is interesting; its operand is only 7 bits, but if two im instructions follow each other, the second shifts the value on top of the stack 7 bits left and puts its own operand in the low 7 bits. This means that loading a full 32-bit value requires five consecutive im instructions! That’s not actually anywhere near as nuts as it might sound: Firstly any RISC architecture with an instruction length of 32-bits or less requires more than one instruction to load an immediate 32-bit value. Since the ZPU’s opcodes are only a single byte long, loading a 32-bit immediate requires five bytes of code, while a CPU with a 32-bit instruction size would require eight! Secondly, the first im instruction is sign-extended across the whole word, so values at either end of the 32-bit range can be loaded with significantly fewer than 5 instructions. This is why I chose 0xffffff80 as the address of my hardware register in the test program – that address can be loaded in two instructions. 0xffffffc0 would have been a better choice still; it can be loaded in just one.
Stack contents after this instruction has completed: <-1> …
Pops the top value off the stack, and adds it to the stack pointer, pushing the result. (Pushspadd is an emulated instruction in the small core, so is slow.)
Stack contents: <old stack pointer -1> …
Use the top value from the stack as the new stack pointer
Stack contents: <free space> <free space> …
The above code has basically allocated space on the stack for two words – which is not amazingly useful – you could achieve the same thing with “im 0; nop; im 0” – but it makes more sense with larger allocations.
im 0 storesp 8
This pushes the immediate value 0, then writes it to the stack, 8 bytes (2 words) in.
Stack contents before storesp: <0> <free space> <free space> …
Stack contents after storesp: <free space> <0> …
.L2: loadsp 4
Copies the value four bytes (1 word) into the stack, and pushes it to the top.
Stack contents: <0> <free space> <0> …
im 1 addsp 12
Adds 1 to the value 12 bytes into the stack, leaving the result on top of the stack.
Stores the result of the previous addition 12 bytes into the stack.
Stack contents: <0> <free space> <1> …
stack contents: <0xffffff80> <0> <free space> <1> …
Pops two values from the stack, and writes the second value to the address pointed to by the first.
Stack contents: <free space> <1> ,,,
loadsp 4 im 1 addsp 12 storesp 12 im -128 store
This is a repeat of the previous six instructions – GCC has unrolled the loop
Stack contents: <free space> <2> …
impcrel .L2 poppcrel
These two instructions form a PC-relative “jump” instruction. impcrel is more of a compiler directive than a true instruction, and poppcrel is an emulated instruction, so is slow on the small version of the core.
There are some compiler flags we can use to tell GCC to avoid emulated instructions – this can make a big difference to speed, and in situations where code space is tight, can remove the need to include the emulation code.
CFLAGS= -mno-poppcrel -mno-compare -mno-eq -mno-byteop -mno-shortop -mno-callpcrel \ -mno-call -mno-lshiftrt -mno-ashiftl -mno-ashiftrt -mno-neqbranch -mno-pushspadd \ -mno-neg -mno-mod -mno-div -mno-mult
These flags will cause the GCC-generated code to avoid the use of the named instructions. One of the things I want to try is to see which of these I can include in the small core while still keeping the core under 1,000 logic elements.