The EightThirtyTwo ISA – Part 8 – 2019-10-20
Over the last few weeks I’ve been working on a VHDL implementation of the EightThirtyTwo ISA, using the excellent combination of GHDL and GtkWave to develop, debug, rip up, start over and ultimately produce something capable of sending “Hello, World!” to a UART.
This has been a steep learning curve since I’ve not tinkered with simluation at this level before (previously I’ve always used SignalTap for debugging), and neither have I worked on a pipelined CPU before.
The design I’ve produced hits my target logic budget pretty well – roughly 1,500 logic elements when built for Cyclone III. While it would be nice to get the size down a bit further, it should be noted that the register file is implemented as logic – so there’s no block RAM usage – and we also have full load/store alignment.
The design is split into the following units:
- The Decoder is a piece of combinational logic which takes the source opcode and outputs signals specifying two register sources for the ALU, the ALU operation and a bitmap of enable signals for other parts of the CPU. I have adjusted the instruction encoding significantly since last time, in an attempt to minimise the logic footprint of this decoder; a poor choice of encoding can easily double its size.
- Synchronous logic in the Decode stage writes the decoded ALU operation and the contents of the appropriate registers to the ALU inputs.
- The ALU takes two 32-bit wide register inputs, a 6-bit wide immediate input (used only for the li instruction), and has two 32-bit wide outputs. The ALU is responsible for arithmetic, bitwise and shift instructions, but also handles assembling multi-byte immediate values, and post-incrementing and pre-decrementing addresses. The first ALU output can be written either to a register or to tmp. In most cases the second of the two outputs simply passes through its corresponding input, which can be written to tmp or to the Load/Store unit – this is useful for “swap”, for store instructions, and for the special case “add r7” instruction which places r7’s old value in temp. The multiply instruction, on the other hand, produces a 64-bit wide result, so it places the upper half on the second output, from where it is written to temp.
- The Execute stage is responsible for advancing the program counter, and determining when it’s safe to do so. I haven’t yet attempted to mitigate hazards at all – we simply stall the PC and insert bubbles into the pipeline, but in the longer term it might be possible to do some rudimentary instruction fusing here, shaving a couple of cyles off constructs such as “li 0; mr r0”.
- The Memory stage triggers load/store operations, and also writes the results from the ALU back to the register file.
- Finally, what would be the Writeback stage in a classic design, but in fact its job is largely done by the Memory stage. All we do here is wait for Load/store operations to finish, and for Load operations, write the result to temp.
I’ve made a few more changes to the ISA, too: Firstly, while I don’t yet have any kind of interrupt support, I’ve been considering how to add this. Lacking any instructions capable of saving or restoring CPU flags, I’ve decided simply to map them into the top four bits of the Program Counter. This reduces the available program address space from 4GB to 256MB – I can live with that, and it means that saving the program counter at the start of an interrupt or subroutine automatically saves the flags too. While this does mean that a subroutine can’t return a result to the caller by way of status flags, this would have been difficult anyway because pulling the return address from the stack will affect the zero flag.
The other significant change to the ISA is that I decided against implementing the ltmpinc instruction, which would have triggered a load operation using the contents of temp as an address, incrementing temp, and storing the result in the nominated register. It would have been the only load instruction that needed to write to the register file; all other register file writes come directly from the ALU, so it would have added significantly to the complexity. Instead, since it’s essentially free to implement, I’ve added a “ldidx” instruction, which uses the sum of tmp and the nominated register as a load address, reads from memory and writes the result to tmp.
I’ve also created a second repository, similar to my earlier ZPUDemos, to contain test projects for this CPU. The CPU itself is a subproject.