Improving the Megadrive / Genesis core


The Megadrive/Genesis core has been plagued from the start with graphical issues that result from the SDRAM controller not responding quickly enough.  Over the last few days I’ve finally put some time into understanding the SDRAM controller used by the project and spent some time improving its throughput.

The existing SDRAM controller was based on the reference implementation supplied by Peter Wendrich for the Chameleon 64, with ports added to better match the needs of the Megadrive core – but there was quite a lot of dead code not used in this project, making it tricky to work on.  So the first thing I did was to refactor the controller, removing dead code, unused ports, and tidying up the code in general.

We have four access ports from the core, one of which is only used during bootup so can be ignored when considering performance.

The ports are as follows:

  • 1 16-bit wide write-only port for writing a ROM image into memory
  • 1 64-bit wide read-only port for reading instructions from a ROM image.  This is 64-bits wide so as to take advantage of burst reads.  When running program code, most accesses will be sequential, so saving three lots of command setup and reading 64-bits in one burst is a big win.
  • 1 16-bit wide read/write port from the CPU, for reading and writing data.
  • 1 16-bit wide read/write port from the chipset.  This port taking too long to respond is what’s causing the graphical issues with the Megadrive core.

My game plan for improving throughput was to make use of bank interleaving.  The SDRAM we’re working with has four independent banks, so while we’re waiting to data to arrive from one bank it’s perfectly possible to prepare the next bank for reading in advance.  This requires the data to be efficiently distributed between banks, which was not the case here.  Bits of the incoming address were mapped in order from MSB to LSB, to the SDRAM’s bank bits, the row bits and finally the column bits, so on the Chameleon’s SDRAM where we have 2 bank bits, 13 row bits and 9 column bits, we have 24 bits in total, addressing 16 million-odd 16-bit words.

That makes 32 megabytes of RAM, split into 8 megabytes per bank, so the vast majority of games will end up in a single bank, and since the 68000’s address space is only 16 megabytes in size, the top two banks will never be accessed.

To fix this, I adjusted the address mapping so that the bank bits fall between the row and column bits.  The means that the address space is striped across all four banks, in chunks of 1 kilobyte, massively improving the chance of concurrent accesses hitting different banks.

In order to improve the readability of the code, I’ve created signals within a record (the closest thing in VHDL to a structure in C) and separated the port address mapping from the priority encoding and command dispatching, like so:

	type ramPort_record is record
		ramport : ramPorts;
		bank : unsigned(1 downto 0);
		row : row_t;
		col : col_t;
		udqm : std_logic;
		ldqm : std_logic;
		pending : std_logic;
		burst : std_logic;
		wr : std_logic;
	end record;
	type ramPort_records is array(3 downto 0) of ramPort_record;
	signal ramPort_rec : ramPort_records;


-- -----------------------------------------------------------------------
-- Create row, column, bank and pending signals for each port

	-- ROM Write port

	ramPort_rec(0).pending<='1' when (romwr_req /= romwr_ackReg) and (currentPort /= PORT_ROMWR) else '0';
	ramPort_rec(0).bank<=romwr_a((colAddrBits+2) downto (colAddrBits+1));
	ramPort_rec(0).row<=romwr_a((colAddrBits+rowAddrBits+2) downto (colAddrBits+3));
	ramPort_rec(0).col<=romwr_a(colAddrBits downto 1);
	-- ROM Read port
	ramPort_rec(1).pending<='1' when (romrd_req /= romrd_ackReg) and (currentPort /= PORT_ROMRD) else '0';
	ramPort_rec(1).bank<=romrd_a((colAddrBits+2) downto (colAddrBits+1));
	ramPort_rec(1).row<=romrd_a((colAddrBits+rowAddrBits+2) downto (colAddrBits+3));
	ramPort_rec(1).col<=romrd_a(colAddrBits downto 3)&"00";

	-- 68K RAM port

	ramPort_rec(2).pending<='1' when (ram68k_req /= ram68k_ackReg) and (currentPort /= PORT_RAM68K) else '0';
	ramPort_rec(2).bank<=ram68k_a((colAddrBits+2) downto (colAddrBits+1));
	ramPort_rec(2).row<=ram68k_a((colAddrBits+rowAddrBits+2) downto (colAddrBits+3));
	ramPort_rec(2).col<=ram68k_a(colAddrBits downto 1);

	-- VRAM port

	ramPort_rec(3).pending<='1' when (vram_req /= vram_ackReg) and (currentPort /= PORT_VRAM) else '0';
	ramPort_rec(3).bank<=vram_a((colAddrBits+2) downto (colAddrBits+1));
	ramPort_rec(3).row<=vram_a((colAddrBits+rowAddrBits+2) downto (colAddrBits+3));
	ramPort_rec(3).col<=vram_a(colAddrBits downto 1);

Since this is all just combinational logic and signal routing, it should have minimal impact on the controller’s size, if any at all.

Priority encoding is now as simple as this:

	ramPort_pri<=0; -- Default value set to avoid a latch being created.
	for i in 0 to 3 loop
		if ramPort_rec(i).pending='1' then
		end if;
	end loop;
end process;

This gives us a single signal which is high when one or more ports requires service, and sets ramPort_pri to the highest numbered active port. Finally we use the result of the priority encoding to multiplex the ports, like so:

	nextRamState <= RAM_IDLE;
	nextRamPort <= PORT_NONE;
	nextRamBank <= "00";
	nextRamRow <= ( others => '0');
	nextRamCol <= ( others => '0');
	nextLdqm <= '0';
	nextUdqm <= '0';
	nextBurst <= '0';

	if ramPort_req='1' then
		nextRamState <= RAM_READ_1;
		if ramPort_rec(ramPort_pri).wr = '1' then
			nextRamState <= RAM_WRITE_1;
			nextLdqm <= ramPort_rec(ramPort_pri).ldqm;
			nextUdqm <= ramPort_rec(ramPort_pri).udqm;
		end if;				
		nextBurst <= ramPort_rec(ramPort_pri).burst;
		nextRamPort <= ramPort_rec(ramPort_pri).ramport;
		nextRamBank <= ramPort_rec(ramPort_pri).bank;
		nextRamRow <= ramPort_rec(ramPort_pri).row;
		nextRamCol <= ramPort_rec(ramPort_pri).col;
	end if;

end process;

Separating the three distinct aspects of what was previously a monolithic code section should make it easier to work on, especially if I end up adding extra ports. The next**** signals are used by a state machine to perform the actual fetching, and it’s this state machine where I was able to save some cycles and improve throughput. This I shall describe in detail next time.

Leave a Reply

Your email address will not be published. Required fields are marked *