Gotta go fast…

Writing a new SDRAM controller – part 3 – 2021-07-25

In the previous instalments I talked about improving throughput by interleaving read and write transactions to different banks, as well as access patterns I need to avoid and some subtle timing issues which need to be considered.

The challenges remaining to be solved are ensuring that the controller responds in good time to incoming requests on multiple ports, and making the controller run fast enough that it meets timing at the desired speed. For the PC Engine core the desired clock speed is 128MHz

Typically the SDRAM controllers used in retrocomputing projects will use a state machine which cycles through a number of pre-determined states. Often one of these will launch Active commands, another will send Reads or Writes, and another will latch the incoming data. The controller will generally take a port-oriented view of incoming transactions, priority-encoding the various ports’ request signals, checking that the appropriate bank isn’t in use, and serving the highest priority port whose bank is free.

I wanted to do things a little differently.

Instead of a state machine, I wanted to use more of a simple CPU-like pipeline model, using something analogous to hazards and bubbles. Instead of port-oriented view, I take a bank-oriented view of incoming requests – the result of which is that on every alternate cycle I can begin a transaction to any bank which isn’t currently in use.

There are two advantages to this:

Firstly it helps maximise throughput, since a bank isn’t left waiting for a state-machine slot to come round (some controllers mitigate this by taking “shortcuts” through the states, though this adds complexity in keeping track of which requests are in flight and ready to be latched. Specifically, I found it made it very difficult to add delays to avoid the read-followed-by-write issue I talked about previously).
Secondly it moves some of the priority encoding away from the most time-critical point – which as we shall see later, is important.

The pipeline has several stages, which I call “RAS”, “CAS”, Mask” and “Latch”. Some of these are more than 1 cycle long, but because we only launch transactions on alternate cycles (since we have to obey the SDRAM chip’s minimum time between Active commands) we only need one set of storage registers per stage.

In the RAS stage we select a bank, and write the port, the column address and any write data to registers, then set a timeout which will block that bank until it expires.
The last cycle of the RAS stage passes this data to the CAS stage.
The CAS stage sends a read or write command to the chip. The address isn’t needed any further, but the rest of the transaction data is passed onto the Mask stage.
If this was a read transaction, the last cycle of the Mask stage passes the port onto the Latch stage; otherwise the cycle is complete

In the PC Engine core, none of the ports has an address space wider than 8 megabytes, which means they all fit within a single bank. This makes life somewhat easier – it means we can dedicate one bank each to the two VRAM areas, another to ROM and WRAM, and the fourth to ARAM. (This is roughly in order of how urgently we have to service each port.)

The critical paths timing-wise in the typical retro-core’s SDRAM controller tend to be between the incoming request signals and the SDRAM Address pins. The reasons for this are twofold: firstly, there tends to be some complicated priority encoding in choosing which port will be serviced, which ultimately selects one of several inputs to multiplexers on the address lines. What’s worse, this is only the case for RAS (Active) cycles – and for CAS (Read/Write) cycles the address bus must be driven with a different signal entirely. Oh, and there needs to be an initialization sequence, too, which also needs to write yet other values to the address bus. All this tends to mean there’s often a tangle of combinational logic and multiplexers preceding the output registers driving the SDRAM address pins. Simplifying this is vital if we’re going to get anywhere near 128MHz.

Generally the goal is to get transactions underway as soon as possible, so the temptation is to try and avoid registering the incoming addresses and request signals, in which case any delays on those will contribute to the critical paths. To avoid this, I do in fact register these signals. Each bank has its own priority encoder (running in an “always @(posedge clk)” block – which, of course, can be converted to combinational logic should I wish, just by changing this to “always @(*)” ). Thus, however many ports we’re servicing, we have a maximum of four requests for the main controller to service, and the per-bank request signals are nice clean, fresh registers with no timing baggage.

The address passed to the SDRAM address pins in the RAS phase has to be selected and written as quickly as possible, but this isn’t the case either for the initialisation or CAS phases – in both those cases we know several cycles in advance what value will have to be written, and this helps ease timing; we can simply write the value a cycle in advance to a holding register, and write that register’s contents as a default value to the address pins any time we’re not in the RAS phase.

In other words, instead of doing what amounts to:

case(state)
  ras: sdram_addr <= <complicated port-oriented priority encoding>
  cas: sdram_addr <= casaddr;
  init_precharge_all: sdram_addr <= <bit 10 high>;
  init_set_mode: sdram_addr <= <mode value>;
endcase

… or worse still

if(init)
  sdram_addr <= <whatever the init logic currently wants to write>
else
  case(state)
    ras: sdram_addr <= <complicated port-oriented priority encoding>
    cas: sdram_addr <= casaddr;
  endcase

We’re now doing:

sdram_addr <= next_a;
if(ras)
  sdram_addr <= <somewhat simpler bank-oriented priority encoding>;

if(cas_coming_up_soon)
  next_a <= casaddr;
else if(init_set_mode_coming_soon)
  next_a <= <mode value>;
else
  next_a <= <bit 10 high>

Obviously these are massively over-simplified for illustration purposes, but note that much of the complexity has moved away from the critical sdram_addr signals, into much less critical registers.

Speaking of critical registers, the Cyclone series chips offer “Fast IO Registers” which are useful when sending signals off-chip. Basically this means the registers are physically very close to the pin itself, so any internal routing delay is minimised. There are limitations, however – there can’t be any combinational logic between the register and the pin, so for outputs we can’t do anything like:

assign sdram_addr  = (state == ras) ? ras_addr : cas_addr;

since this would insert multiplexers between the the ras_addr and cas_addr signals and the sdram_addr pins.

Likewise, for inputs it’s tempting to do something like this with incoming data:

always @(posedge clk) begin
  case(latch_port)
    PORT_ROM: rom_q <= SDRAM_DATA;
    PORT_VRAM: vram_q <= SDRAM_DATA;
    ...
  endcase
end

Again, this inserts multiplexers between the incoming data lines and the registers belonging to each port.

In both cases there’s nothing “illegal”, per se, about doing this, but it prevents the use of Fast I/O Registers, so we incur extra delays on the IO pins – which may or may not cause a problem.

If we want to take advantage of Fast Inputs we have to latch incoming data into a holding register, and then write its contents to the various ports a cycle later. It is possible, however, to avoid incurring that extra cycle’s delay, by placing a multiplexer on each port’s output, like so:

always @(posedge clk) begin
  sdram_data_reg <= SDRAM_DATA;

  case(latch_port)
    PORT_ROM: rom_q_reg <= sdram_data_reg;
    PORT_VRAM: vram_q_reg <= sdram_data_reg;
    ...
  endcase
end

assign rom_q = (latch_port == PORT_ROM) ? sdram_data_reg : rom_q_reg;
assign vram_q = (latch_port == PORT_VRAM) ? sdram_data_reg : vram_q_reg;
...

There’s one more piece of the puzzle I haven’t talked about yet, and that’s refresh – I will cover this next time.

Retro Ramblings

Musings on FPGA and Retro Computing

Leave a Reply Cancel reply