Efficient sign extension

The EightThirtyTwo ISA – Part 14 – 2020-01-17

Here’s an optmisation I should have spotted much sooner: sign extension.

In the C programming language char and short types are converted to integers before being subjected to mathematical operations – so called “integer promotion”. There are circumstances where this can be avoided – and vbcc has a hook for precisely this purpose, allowing CPUs which have byte- or word-oriented versions of their arithmetic operations to avoid promotion. EightThirtyTwo doesn’t provide this luxury.

In theory there’s no reason why I couldn’t extend the byt and hlf modifiers to apply to ALU operations as well as just loads and stores – but I don’t want to increase the CPU’s logic footprint at this stage – and it’s not always possible to evade the need for promotion anyway – so we need a way to implement it efficiently.

If the value being promoted is an unsigned char or unsigned short then things are easy; if the value has just been loaded from memory into a register then the upper bits of the register will have been zeroed automatically, and we don’t need to do anything further.

The difficulty lies with signed chars and signed shorts, since the highest bit of the value being promoted must be duplicated throughout the upper bits of the register.

My first attempt at achieving this was very simple:

    li 24
    shl r0
    sgn
    shr r0

a mere four bytes and it works just fine – it shifts an 8 bit value 24 bits to the left, aligning the most significant bit of the original value with that of the register. Then it shifts the same amount in the opposite direction, having first set the ‘sgn’ modifier, which modifies the ‘shr’ instruction to be ‘arithmetic shift right’ rather than ‘logical shift right’.

The problem with this is that because my shifter only shifts one bit per cycle, it’s very slow!

My next attempt made use of the multiply instruction. Because I’m using an FPGA I have embedded multipliers at my disposal so there’s no reason not to have a nice fast multiply instruction. Using multiplies, the same algorithm looks like this:

    li IMW4(0x01000000)
    li IMW3(0x01000000)
    li IMW2(0x01000000)
    li IMW1(0x01000000)
    li IMW0(0x01000000)
    mul r0
    li IMW1(0x100)
    li IMW1(0x100)
    sgn
    mul r0
    mr r0

Eleven bytes, but much faster. (The mul instruction performs 32 bit x 32 bit -> 64 bit multiply. The upper 32 bits end up in the tmp register, so after the second multiply the sign-extended value is in tmp.)

When code includes many conversions, however, building all those 25-bit immediate values 6 bits at a time with ‘li’ adds up and makes the code larger than it needs to be.

I was hit by a flash of inspiration a couple of days ago and found a much better way to do this which is faster still than the ‘mul’ method while being no larger than the version using shifts:

First, add 0xffffff80 to the original signed char value. If the original value is negative, then this operation will overflow. I was expecting at this point to detect the overflow with ‘cond LE’ and perform a different final manipulation for signed and unsigned values. As it happens, it’s simpler than that:

For negative values, the addition leaves the upper 25 bits clear, but for positive values it leaves them set. This is the exact opposite of what we want – so in both cases we just need to flip the upper 25 bits; we thus xor with 0xffffff80. Very conveniently, this is the same value used in the first step, so it’s already in tmp!

    li IMW1(0xffffff80)
    li IMW0(0xffffff80)
    add r0
    xor r0

Just four bytes and much, much faster than the version that uses shifts. I really should have spotted that sooner!

So case closed? Well… no – nothing’s ever quite that simple!

Both the shift and the multiply methods of sign extension have the beneficial side-effect of overwriting the upper bits regardless of their existing contents, whereas the addition/xor trick only works if the upper bits are all zero; moving to the addition/xor method revealed another bug in my code generator – I was failing to mask off those upper bits when converting wide values to a narrower type. Having addressed that issue, the new sign-extension code seems to be working correctly, and the Dhrystone score at 133MHz is now up to 12.723 DMIPS.

Retro Ramblings

Musings on FPGA and Retro Computing

Leave a Reply Cancel reply