Part 2: Saving cycles
In the first part of this series I covered some basic tidy-ups to the code to make it easier to maintain. Now I’ll look at how we can speed things up.
The first avenue for speeding things up lies in this code:
process(clk) begin if rising_edge(clk) then sd_data <= (others => 'Z'); if sd_data_ena = '1' then sd_data <= sd_data_reg; end if; sd_addr <= sd_addr_reg; sd_ras_n <= sd_ras_n_reg; sd_cas_n <= sd_cas_n_reg; sd_we_n <= sd_we_n_reg; sd_ba_0 <= sd_ba_0_reg; sd_ba_1 <= sd_ba_1_reg; sd_ldqm <= sd_ldqm_reg; sd_udqm <= sd_udqm_reg; end if; end process;
This code copies internal versions of the SDRAM control signals to the SDRAM chip, on a clock edge. This is fine, except all the signals in question are already assigned on a clock edge within the state machine, so we’re effectively delaying them all by an extra clock. If we were running the SDRAM chip at a very high speed, and if these signals were derived from complicated combinational logic chains then doing this would be worth doing for stability, but in the Megadrive core it shouldn’t be necessary, so we’ll just comment out the “if rising_edge(clk) then” and its associated “end if;”. The only other thing we have to do is reduce the number of clocks the state machine waits after asserting _CAS since the data will now arrive from the SDRAM chip one clock sooner.
So that’s the low-hanging fruit – where else can we improve throughput?
The SDRAM chip is being clocked as a relatively low rate of 108MHz in this core, so a little over 9ns per cycle. Looking at the datasheets for the SDRAM chips used in the MIST and Chameleon 64 boards, the Active to Read delay and Precharge to Read delays are 18ns or less, which means two clocks will be sufficient, while the SDRAM controller is currently allowing three. This would be necessary if the chip were being clocked faster, and also necessary for the chip used on the DE2 board which has a minimum time of 20ns for both those parameters – so in the interests of trimming as much time as possible, I’ve made the delays configurable on a per-board basis.
The next avenue for improving performance is to exploit bank interleaving. The SDRAM chips are split into four distinct banks which are able to operate more-or-less independently, so if a request arrives on different ports simltaneously but need to access different banks, it’s possible to do some setup for the next access in advance, precharging the previously row and opening the next row on a bank while a read to another bank is in progress.
I’ve added a signal called preselectBank, which the state machine sets to high any time it knows the bus will be clear for the next cycle, and can thus accept a setup command for the next bank. Setting up the next bank is as easy as this:
if preselectBankPause /= 0 then preselectBankPause <= preselectBankPause - 1; end if; if preselectBank='1' and preselectBankPause=0 then if nextRamState /= RAM_IDLE and (currentBank /= nextRamBank or ramAlmostDone='1') then -- Do we need to close a row first? if banks(to_integer(nextRamBank)).rowopen='1' and banks(to_integer(nextRamBank)).row /= nextRamRow then -- Wrong row active in bank, do precharge to close the row sd_we_n_reg <= '0'; sd_ras_n_reg <= '0'; sd_ba_0_reg <= nextRamBank(0); sd_ba_1_reg <= nextRamBank(1); banks(to_integer(nextRamBank)).rowopen <= '0'; -- Ensure a gap of at least one clock between preselecion commands preselectBankPause<=prechargeTiming-1; elsif banks(to_integer(nextRamBank)).rowopen='0' then -- Open the next row sd_addr_reg <= nextRamRow; sd_ras_n_reg <= '0'; sd_ba_0_reg <= nextRamBank(0); sd_ba_1_reg <= nextRamBank(1); banks(to_integer(nextRamBank)).row <= nextRamRow; banks(to_integer(nextRamBank)).rowopen <= '1'; -- Ensure a gap of at least one clock between this and next command preselectBankPause<=rasCasTiming-1; end if; end if; end if;
This, combined with a couple of other timing tweaks, like dispatching precharge and active command directly from the idle state, instead of from the first state in the read process, takes the response time about half way towards where it needs to be to solve the glitching problems.
Once these relatively easy tweaks are done, the only way to improve throughput further is to introduce some caching. There are a number of ways this could be done, but the easiest for me was just to integrate the two-way cache from my TG68MiniSOC and ZPUDemos projects. The only difficulty here is that the cache is set up for 8-word bursts, and the the SDRAM controller’s currently running in 4-word burst mode, so I’ve changed the controller to use 8-word bursts, but to terminate the bursts early for any port other than the VRAM port. I’m not yet certain whether performance would be better if the core used 4-word bursts and cachelines instead, but for now it works pretty well.
The core is still not completely glitch-free, but it’s to the point where all important in-game elements are visible, making some games playable that really weren’t beforehand. Once I’ve tested the core a little more on both platforms, I’ll make binaries available for download.
Excellent news! How about synthesis constraints? Are you getting stable outputs from the synthesizer or are you recurring to ugly tricks to get working core files?
The music from JT12 will only sound correct if the synthesis is good. I stopped working on JT12 because the original project was not well specified and it often resulted in the Z80 timing going bad, which affected the interface with JT12. If you have a stable tinder I’d like to get a copy so I can finish my work on JT12.
I’ll try your core files when I get home!
Thanks for your efforts
Nothing has changed with regard to project timing, so I’m still getting builds with bad timing between the clocks. I will take a look at that when time allows. My gut feeling is that the problem is due to phase relationships between the different sub-clocks in the system, which might well be different each build – but I will investigate further.