2019-08-09 – Part 1 – The Pipe Dream
In 2019 there are any number of off-the-shelf CPU cores which can be used in FPGA projects, some imitating long-established CPUs such as x86, MC68000, Z80, MIPS, ARM, etc – and some more specifically targetted at the FPGA space, such as NIOS, Microblaze, ZPU, Moxie and suchlike. So why on earth would I consider creating a brand new one from scratch?
As always, because I wanted to learn something, and because even though there are so many existing options, there are still applications for which none of them is ideal.
My design goal is to create a CPU that’s reasonably small – not much bigger than ZPUFlex – so takes up somewhere in the region of 1500 logic elements, while being somewhat faster and offering better code density. Ideally I want something that can supplant the ZPU for control module applications and allow me to reduce the amount of block RAM I have to devote to the code.
One of my favourite CPU cores at the moment is f32c – a core that’s small and fast (can achieve something like 180 DMIPS on a Cyclone III when running from block RAM and with all the shiny enabled – more like 30 when running from SDRAM in its more modest compact form.) and supports the 32-bit MIPS instruction set. MIPS is quite a nice instruction set to work with, but the code density is pretty awful since it’s a load-store architecture and instruction words are always 32-bits long.
ZPU code spends a lot of time faffing around shuffling data on the stack which can hurt performance, but the code density is surprisingly not too terrible despite this, due to the instruction length being fixed at 8 bit.
I’ll always have a soft spot for Motorola 68000 code, since I grew up on the Amiga, and its code density is good to excellent, depending on whether you’re looking at compiled code or hand-crafted assembly. 68000 softcores have quite a large footprint in an FPGA, however.
Z80 code often has very good code density, but its lack of 32-bit registers can be a bit limiting.
I also watched a YouTube video recently talking about how the x86 instruction set’s MOV instruction is itself Turing-complete and it set me to wondering just how small an instruction set could be without having a disastrous effect on code density. (Needless to say, writing programs using just MOV yields spectacularly bad code density!)
My gut feeling, and thus the starting point for my experiments, is that my best bet for minimising code size is going to be to use a fixed 8-bit instruction word length with 32-bit registers – so based on those two parameters, the EightThirtyTwo CPU project is born.
[On a totally unrelated note, I just googled that name to see if it was already in use and stumbled across a short film that Babylon 5 fans will find very interesting! https://www.youtube.com/watch?v=uKz4pvq8kZU ]
One of the key decisions to make when designing an ISA from scratch is how many registers to implement, and thus how many bits of the instruction word will be devoted to them. Too few and we’ll suffer from register starvation and spend all our time stack shuffling. Too many and we’ll use so much of our 8-bit instruction word that we won’t have enough encoding space for instructions.
I’ve chosen to use eight registers, which means three bits are needed to encode the register number. The five bits left over aren’t enough to include immediate values as well as encoding the instruction type. We also lack sufficient bits to include both a source and a destination register in our instruction words, so I decided to specify a “temporary” register which will be used to build immediate values and as a source for two-operand instructions. The instruction format will thus be: [ooooorrr]
We need a “Load immediate “instruction which carries immediate data in the instruction word. The simplest way to handle this is to specify that if bit 7 is set, the rest of the word contains immediate data, like so: [1iiiiiii]. This halves the available opcodes, though, from 32 to 16 (plus the load immediate opcode, so 17 in total.) If this turns out not to be enough, then we could specify the Load Immediate format as [11iiiiii], halving the immediate data range but opening up the [10ooorrr] range for opcodes, which gives us 25 in total.
In part 2 I’ll look at the instructions themselves and how different choices will impact the code density.