Two threads are better than one…

The EightThirtyTwo ISA – Part 10 – 2019-11-16

One of the main goals with a pipelined CPU is keeping the pipleline as full as possible at all times. The biggest obstacle to this goal is the existence of ‘hazards’ – situations where what’s about to happen at the beginning of the pipeline depends on an outcome that’s yet to be determined further down the line. This is a particular weakness of the EightThirtyTwo CPU: because we have only eight registers, and because many operations involve the tmp register, there’s a good chance that any given instruction will depend on the result of the previous one. One way around this is to use results forwarding where, for example, a result calculated in the ALU can be sent directly to the ALU’s inputs for the next instruction rather than being first written to and then read again from the register file.

I haven’t yet attempted to implement this for the EightThirtyTwo. Instead, I’ve attempted something far less sane for a lightweight CPU – dual-threading!

The idea behind dual threading is simply that two independent instruction streams are guaranteed not to have dependencies upon each other, so the number of hazards is massively reduced, and total throughput increases, even though each thread is slowed a little. Each stream requires its own fetch logic, its own decoder and its own register file, so needless to say the logic footprint of dual-threaded mode is significantly larger than single-threaded mode, but nowhere near twice the size – more like 50-60% extra. (If this seems like a lot, remember that EightThirtyTwo currently uses logic rather than block RAM for its register file, and we now have two of them.)

The next question to consider is how to handle startup when there are two independent program counters: Do we designate two different addresses as startup/reset vectors, or is there a better way?

After some thought, I decided to extend the solution I’d chosen for interrupts – namely, using the same startup address for both threads, but with different condition codes set. This is similar to how the Posix fork() function works – the two processes diverge based upon the return value of fork().

The condition flags when program flow hits location 0 are now defined as follows:

  • Carry clear, Zero clear => first thread. This can be detected with “cond SGT”
  • Carry set, Zero clear => second thread. This can be detected with “cond SLT”
  • Carry clear, Zero set => interrupt. Can be detected with “cond GE”, provided “Carry clear, Zero clear” has already been tested for, since GE would also match that case.
  • Carry Set, Zero Set => not yet defined.

The startup code in multithreaded mode thus looks like this:

#include "assembler.pp"

	.section .text.startup

_start:
	cond	SGT	// Z flag and C flag both clear -> Thread 1
		li	IMW0(PCREL(.start1))
		add	r7

	cond	SLT	// Z flag clear, C flag set -> Thread 2
		li	IMW0(PCREL(.start2))
		add	r7

	cond	GE	// Z flag set (by elimination), C flag clear
		li	IMW0(PCREL(.interrupt))
		add	r7

	// By elimination, Z flag set, C flag set - currently reserved.
.spin:
	cond NEX
	li	IMW0(PCREL(.spin))
	add	r7

.start1:
	...

I found that dual thread mode complicated the hazard logic somewhat, and significantly reduced the maximum speed of the design, which prompted me to do some redesigning. The hazard logic is now somewhat less convoluted and more readable, and on my DE2 board I now have a demo dual-threaded project running at 133MHz. I haven’t yet determined the maximum reliable speed in single-threaded mode, but I’m pretty sure the critical path is now the multiplier.

One other improvement I’ve made to the CPU’s design – which impacts single-threaded mode too: the cond NEX instruction now pauses the CPU instead of merely skipping instructions unconditionally until the next cond or PC-writing instruction. This was primarily to prevent an idle thread from slowing down its partner by spinning needlessly and constantly requesting instruction fetches. Thread 1 will be re-awoken by an interrupt (whether or not the rest of the interrupt logic is enabled) – I haven’t yet implemented re-awakening thread 2, but I would like to arrange for it to be re-awoken upon the program flow of thread 1 returning from the interrupt handler.

So am I likely to use the processor in dual-threaded configuration in a real project? In all honesty, probably not – the logic footprint is larger than I’d like in dual-thread mode – but I’ve learned enough in the process for it to be well worthwhile as a project.

My next goal for the EightThirtyTwo CPU is to make the VBCC backend complete enough that I can compile and run a Dhrystone benchmark. I only have the haziest idea of what performance will be, but I’m expecting it to be somewhere between 30 and 50 DMIPS per thread when running from block RAM. [Edit: as of 2019-11-29 I don’t have the benchmarch working yet, but looking at the code I’m getting from the C backend, I suspect that estimate is very ambitious. I believe the CPU is capable of performance in that ballpark, but probably only when hand-coded in assembler!)

Leave a Reply

Your email address will not be published. Required fields are marked *