Improving the Megadrive / Genesis core

Part 3: Tweaking the VDP implementation
2018-04-20

In the second part of this series, I increased the throughput of the Megadrive core’s SDRAM controller, which gave nearly but not quite enough extra bandwidth to solve the sprite display problems. To improve things yet further I need to look at the VDP implementation itself…

The VDP or Video Display Processor, unsprisingly, handles the Megadrive’s video output. How the real thing is implemented I don’t know, but the display portion of the VDP in the FPGA core is implemented as three FIFO queues, one each for the two background layers and one for the sprite layer. These FIFOs each contain a complete scanline’s worth of data, and are filled from within a state machine that requests data from memory.

The other end of each FIFO is read by another process, which merges the three layers appropriately to create the video signal, and clears the sprite channel’s FIFO behind it as it goes, in preparation for the next scanline. The implication of this is that if the reading process gets ahead of the writing process, it will emit blank sprite data – which is precisely the symptom we’ve been seeing.

The memory requests from the three channels’ state machines are marshalled by another process which arbitrates using simple priorities: Background layer B has the highest priority, followed by background layer A, then sprite data. (There is also a second sprite process which handles a different aspect of sprite display, and a DMA process which has even lower priority.)

The way this marshalling happens gives us our first avenue for improving throughput:

Each state machine raises a “sel” signal when it requires data, and the marshalling process then sends requests to the SDRAM controller in priority order, like so:

if rising_edge(CLK) then
	case VMC is
	when VMC_IDLE =>
		vram_u_n_reg <= '0';
		vram_l_n_reg <= '0';
		vram_we_reg <= '0';
			
		if BGB_SEL = '1' and BGB_DTACK_N = '1' then
			vram_req_reg <= not vram_req_reg;
			vram_a_reg <= "00" & "1100000" & BGB_VRAM_ADDR;
				
			VMC <= VMC_BGB_RD1;
		elsif BGA_SEL = '1' and BGA_DTACK_N = '1' then
			vram_req_reg <= not vram_req_reg;
			vram_a_reg <= "00" & "1100000" & BGA_VRAM_ADDR;
				
			VMC <= VMC_BGA_RD1;
		elsif SP1_SEL = '1' and SP1_DTACK_N = '1' then
...

Then as each response comes in from the SDRAM controller, it asserts an acknowledge signal, and returns to the IDLE state waiting for the next request, like so:

	when VMC_BGB_RD1 =>		-- BACKGROUND B
		if vram_req_reg = vram_ack then
			BGB_VRAM_DO <= vram_q;
			BGB_DTACK_N <= '0';
				
			VMC <= VMC_IDLE;
		end if;
...

The sequence of events thus looks like this:

Clock 1: Video channel asks for data
Clock 2: Marshalling process passes request for data to SDRAM controller
Clock n: SDRAM controller serves data
Clock n+1: Marshalling process signals to video channel that data is ready
Clock n+2: Video channel can process data, marshalling process can serve another channel

Because the marshalling process is acting as a middle-man, it delays both the initial request and the result by one clock each; if the video channel were talking directly to the SDRAM we could eliminate both Clock 2 and Clock n+1 in the sequence above. We only have one VRAM port on the SDRAM controller, though - and only one cache - so we can't eliminate the marshalling process entirely. We can, however, eliminate the step at Clock n+1, by making each video channel state machine react directly to incoming data, rather than having the marshalling process forward the data. The way we do that is to create an "early_ack" signal for each channel, using combinational logic, like so:

early_ack_bga <= '0' when VMC=VMC_BGA and vram_req_reg=vram_ack else '1';
early_ack_bgb <= '0' when VMC=VMC_BGB and vram_req_reg=vram_ack else '1';
early_ack_sp1 <= '0' when VMC=VMC_SP1 and vram_req_reg=vram_ack else '1';
...

That alone is not sufficient, because the *_VRAM_DO signals are assigned by the marshalling process, so their contents still lag behind the incoming SDRAM data by one clock. To solve this, we multiplex those signals between the live incoming data and registered data, like so:

BGA_VRAM_DO <= vram_q when early_ack_bga='0' and BGA_DTACK_N = '1' else BGA_VRAM_DO_REG;
BGB_VRAM_DO <= vram_q when early_ack_bgb='0' and BGB_DTACK_N = '1' else BGB_VRAM_DO_REG;
SP1_VRAM_DO <= vram_q when early_ack_sp1='0' and SP1_DTACK_N = '1' else SP1_VRAM_DO_REG;
...

The video channel can now service data one clock sooner, but we want the marshalling process to be able to dispatch the next request sooner, too.
To achieve this, we move the priority encoding into combinational logic and assign the result to a new VMC_NEXT signal:

	VMC_NEXT<=VMC_IDLE;
	if BGB_SEL = '1' and BGB_DTACK_N = '1' and early_ack_bgb='1' then
		VMC_NEXT <= VMC_BGB;
	elsif BGA_SEL = '1' and BGA_DTACK_N = '1' and early_ack_bga='1' then
		VMC_NEXT <= VMC_BGA;
	elsif SP1_SEL = '1' and SP1_DTACK_N = '1' and early_ack_sp1='1'then
		VMC_NEXT <= VMC_SP1;
...

We then assign this to VMC any time there's no active request being served, set the RAM address accordingly, and trigger a new access, like so:

if rising_edge(CLK) then
...
	if vram_req_reg = vram_ack then
		VMC <= VMC_NEXT;
		case VMC_NEXT is
			when VMC_IDLE =>
				null;
			when VMC_BGA =>
				vram_a <= BGA_VRAM_ADDR;
			when VMC_BGB =>
				vram_a <= BGB_VRAM_ADDR;
			when VMC_SP1 =>
				vram_a <= SP1_VRAM_ADDR;
...
		end case;
		if VMC_NEXT /= VMC_IDLE then
			vram_req_reg <= not vram_req_reg;
		end if;
	end if;
...

Any time a request is delayed by another channel being serviced, it will now be serviced one clock sooner than it would have been beforehand.

The slowest remaining part of the sprite system was now in the sprite channel's state machine, which writes four times to the sprite FIFO for every word of data received. Each of those writes was taking two clocks, and by far the simplest way to speed that up was simply to move that state machine and the RAM containing its FIFO onto the faster clock used by the SDRAM controller - so this small part of the VDP now operates at 108MHz instead of 54MHz.

These changes still weren't quite enough to solve the glitching issues; by now I was seeing a new glitch that I hadn't come across before, where there would be a thin irregular stripe of transparent pixels through certain sprites. I finally realised this was due to the sprite FIFO's reading and writing processes crossing over each other. This turned out to be due to the fact that the sprite channel's write process was being triggered halfway through the display of the preceding scanline, and with the changes I'd made so far, the writing process was now fast enough to catch up and overtake the video beam! This was easily solved just by waiting a bit later before allowing the sprite process to start.

The end result should, I hope, be an end to incomplete sprite rendering the Megadrive core.

Retro Ramblings

Musings on FPGA and Retro Computing

Improving the Megadrive / Genesis core

Leave a Reply Cancel reply