{"id":1194,"date":"2018-04-20T21:41:50","date_gmt":"2018-04-20T21:41:50","guid":{"rendered":"http:\/\/retroramblings.net\/?p=1194"},"modified":"2018-04-20T21:41:50","modified_gmt":"2018-04-20T21:41:50","slug":"improving-the-megadrive-genesis-core-3","status":"publish","type":"post","link":"https:\/\/retroramblings.net\/?p=1194","title":{"rendered":"Improving the Megadrive \/ Genesis core"},"content":{"rendered":"<p><strong>Part 3: Tweaking the VDP implementation<\/strong><br \/>\n<strong>2018-04-20<\/strong><\/p>\n<p>In the second part of this series, I increased the throughput of the Megadrive core&#8217;s SDRAM controller, which gave nearly but not quite enough extra bandwidth to solve the sprite display problems.\u00a0 To improve things yet further I need to look at the VDP implementation itself&#8230;<\/p>\n<p><!--more--><\/p>\n<p>The VDP or Video Display Processor, unsprisingly, handles the Megadrive&#8217;s video output.\u00a0 How the real thing is implemented I don&#8217;t know, but the display portion of the VDP in the FPGA core is implemented as three FIFO queues, one each for the two background layers and one for the sprite layer.\u00a0 These FIFOs each contain a complete scanline&#8217;s worth of data, and are filled from within a state machine that requests data from memory.<\/p>\n<p>The other end of each FIFO is read by another process, which merges the three layers appropriately to create the video signal, and clears the sprite channel&#8217;s FIFO behind it as it goes, in preparation for the next scanline.\u00a0 The implication of this is that if the reading process gets ahead of the writing process, it will emit blank sprite data &#8211; which is precisely the symptom we&#8217;ve been seeing.<\/p>\n<p>The memory requests from the three channels&#8217; state machines are marshalled by another process which arbitrates using simple priorities: Background layer B has the highest priority, followed by background layer A, then sprite data.\u00a0 (There is also a second sprite process which handles a different aspect of sprite display, and a DMA process which has even lower priority.)<\/p>\n<p>The way this marshalling happens gives us our first avenue for improving throughput:<\/p>\n<p>Each state machine raises a &#8220;sel&#8221; signal when it requires data, and the marshalling process then sends requests to the SDRAM controller in priority order, like so:<\/p>\n<pre>\r\nif rising_edge(CLK) then\r\n\tcase VMC is\r\n\twhen VMC_IDLE =>\r\n\t\tvram_u_n_reg <= '0';\r\n\t\tvram_l_n_reg <= '0';\r\n\t\tvram_we_reg <= '0';\r\n\t\t\t\r\n\t\tif BGB_SEL = '1' and BGB_DTACK_N = '1' then\r\n\t\t\tvram_req_reg <= not vram_req_reg;\r\n\t\t\tvram_a_reg <= \"00\" &#038; \"1100000\" &#038; BGB_VRAM_ADDR;\r\n\t\t\t\t\r\n\t\t\tVMC <= VMC_BGB_RD1;\r\n\t\telsif BGA_SEL = '1' and BGA_DTACK_N = '1' then\r\n\t\t\tvram_req_reg <= not vram_req_reg;\r\n\t\t\tvram_a_reg <= \"00\" &#038; \"1100000\" &#038; BGA_VRAM_ADDR;\r\n\t\t\t\t\r\n\t\t\tVMC <= VMC_BGA_RD1;\r\n\t\telsif SP1_SEL = '1' and SP1_DTACK_N = '1' then\r\n...\r\n<\/pre>\n<p>Then as each response comes in from the SDRAM controller, it asserts an acknowledge signal, and returns to the IDLE state waiting for the next request, like so:<\/p>\n<pre>\r\n\twhen VMC_BGB_RD1 =>\t\t-- BACKGROUND B\r\n\t\tif vram_req_reg = vram_ack then\r\n\t\t\tBGB_VRAM_DO <= vram_q;\r\n\t\t\tBGB_DTACK_N <= '0';\r\n\t\t\t\t\r\n\t\t\tVMC <= VMC_IDLE;\r\n\t\tend if;\r\n...\r\n<\/pre>\n<p>The sequence of events thus looks like this:<\/p>\n<ul>\n<li>Clock 1: Video channel asks for data<\/li>\n<li>Clock 2: Marshalling process passes request for data to SDRAM controller<\/li>\n<li>Clock n: SDRAM controller serves data<\/li>\n<li>Clock n+1: Marshalling process signals to video channel that data is ready<\/li>\n<li>Clock n+2: Video channel can process data, marshalling process can serve another channel<\/li>\n<\/ul>\n<p>Because the marshalling process is acting as a middle-man, it delays both the initial request and the result by one clock each; if the video channel were talking directly to the SDRAM we could eliminate both Clock 2 and Clock n+1 in the sequence above. We only have one VRAM port on the SDRAM controller, though - and only one cache - so we can't eliminate the marshalling process entirely. We can, however, eliminate the step at Clock n+1, by making each video channel state machine react directly to incoming data, rather than having the marshalling process forward the data. The way we do that is to create an \"early_ack\" signal for each channel, using combinational logic, like so:<\/p>\n<pre>\r\nearly_ack_bga &lt;= '0' when VMC=VMC_BGA and vram_req_reg=vram_ack else '1';\r\nearly_ack_bgb &lt;= '0' when VMC=VMC_BGB and vram_req_reg=vram_ack else '1';\r\nearly_ack_sp1 &lt;= '0' when VMC=VMC_SP1 and vram_req_reg=vram_ack else '1';\r\n...\r\n<\/pre>\n<p>That alone is not sufficient, because the *_VRAM_DO signals are assigned by the marshalling process, so their contents still lag behind the incoming SDRAM data by one clock. To solve this, we multiplex those signals between the live incoming data and registered data, like so:<\/p>\n<pre>\r\nBGA_VRAM_DO &lt;= vram_q when early_ack_bga='0' and BGA_DTACK_N = '1' else BGA_VRAM_DO_REG;\r\nBGB_VRAM_DO &lt;= vram_q when early_ack_bgb='0' and BGB_DTACK_N = '1' else BGB_VRAM_DO_REG;\r\nSP1_VRAM_DO &lt;= vram_q when early_ack_sp1='0' and SP1_DTACK_N = '1' else SP1_VRAM_DO_REG;\r\n...\r\n<\/pre>\n<p>The video channel can now service data one clock sooner, but we want the marshalling process to be able to dispatch the next request sooner, too.<br \/>\nTo achieve this, we move the priority encoding into combinational logic and assign the result to a new VMC_NEXT signal:<\/p>\n<pre>\r\n\tVMC_NEXT<=VMC_IDLE;\r\n\tif BGB_SEL = '1' and BGB_DTACK_N = '1' and early_ack_bgb='1' then\r\n\t\tVMC_NEXT <= VMC_BGB;\r\n\telsif BGA_SEL = '1' and BGA_DTACK_N = '1' and early_ack_bga='1' then\r\n\t\tVMC_NEXT <= VMC_BGA;\r\n\telsif SP1_SEL = '1' and SP1_DTACK_N = '1' and early_ack_sp1='1'then\r\n\t\tVMC_NEXT <= VMC_SP1;\r\n...\r\n<\/pre>\n<p>We then assign this to VMC any time there's no active request being served, set the RAM address accordingly, and trigger a new access, like so:<\/p>\n<pre>\r\nif rising_edge(CLK) then\r\n...\r\n\tif vram_req_reg = vram_ack then\r\n\t\tVMC <= VMC_NEXT;\r\n\t\tcase VMC_NEXT is\r\n\t\t\twhen VMC_IDLE =>\r\n\t\t\t\tnull;\r\n\t\t\twhen VMC_BGA =>\r\n\t\t\t\tvram_a <= BGA_VRAM_ADDR;\r\n\t\t\twhen VMC_BGB =>\r\n\t\t\t\tvram_a <= BGB_VRAM_ADDR;\r\n\t\t\twhen VMC_SP1 =>\r\n\t\t\t\tvram_a <= SP1_VRAM_ADDR;\r\n...\r\n\t\tend case;\r\n\t\tif VMC_NEXT \/= VMC_IDLE then\r\n\t\t\tvram_req_reg <= not vram_req_reg;\r\n\t\tend if;\r\n\tend if;\r\n...\r\n<\/pre>\n<p>Any time a request is delayed by another channel being serviced, it will now be serviced one clock sooner than it would have been beforehand.<\/p>\n<p>The slowest remaining part of the sprite system was now in the sprite channel's state machine, which writes four times to the sprite FIFO for every word of data received.  Each of those writes was taking two clocks, and by far the simplest way to speed that up was simply to move that state machine and the RAM containing its FIFO onto the faster clock used by the SDRAM controller - so this small part of the VDP now operates at 108MHz instead of 54MHz.<\/p>\n<p>These changes still weren't quite enough to solve the glitching issues; by now I was seeing a new glitch that I hadn't come across before, where there would be a thin irregular stripe of transparent pixels through certain sprites. I finally realised this was due to the sprite FIFO's reading and writing processes crossing over each other.  This turned out to be due to the fact that the sprite channel's write process was being triggered halfway through the display of the preceding scanline, and with the changes I'd made so far, the writing process was now fast enough to catch up and overtake the video beam!  This was easily solved just by waiting a bit later before allowing the sprite process to start.<\/p>\n<p>The end result should, I hope, be an end to incomplete sprite rendering the Megadrive core.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Part 3: Tweaking the VDP implementation 2018-04-20 In the second part of this series, I increased the throughput of the Megadrive core&#8217;s SDRAM controller, which gave nearly but not quite enough extra bandwidth to solve the sprite display problems.\u00a0 To &hellip; <a href=\"https:\/\/retroramblings.net\/?p=1194\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4,8],"tags":[],"class_list":["post-1194","post","type-post","status-publish","format-standard","hentry","category-fpga","category-hardware"],"_links":{"self":[{"href":"https:\/\/retroramblings.net\/index.php?rest_route=\/wp\/v2\/posts\/1194","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/retroramblings.net\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/retroramblings.net\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/retroramblings.net\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/retroramblings.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1194"}],"version-history":[{"count":6,"href":"https:\/\/retroramblings.net\/index.php?rest_route=\/wp\/v2\/posts\/1194\/revisions"}],"predecessor-version":[{"id":1200,"href":"https:\/\/retroramblings.net\/index.php?rest_route=\/wp\/v2\/posts\/1194\/revisions\/1200"}],"wp:attachment":[{"href":"https:\/\/retroramblings.net\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1194"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/retroramblings.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1194"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/retroramblings.net\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1194"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}