Experimenting with TG68

Part 12a – a better cache

The TG68MiniSOC project has so far used a very simple cache for the CPU. The SDRAM controller is set up to use four-word bursts, so the cache simply stores each complete burst. Under this scheme, when data is read sequentially from RAM, only one read in four will need to wait for the SDRAM controller. This is the simplest possible example of a Direct Mapped cache, with just a single cacheline.

There’s plenty of scope for improvement, however: a very common scenario is for a piece of code to run repeatedly in a tight loop, so a cache large enough to contain the entire loop can make a massive difference to performance.

The easiest way to make the cache larger would be to add more cachelines, however this naive approach has a number of disadvantages.

The simple single-cacheline implementation stores the cached address and data in registers. This works fine for a single cacheline, but as you add more, the amount of resources used for storage increases quite rapidly.

When there are multiple cachelines there has to be a method of determining which line holds which data. The simplest approach is to use a few bits of the address to determine which cacheline should be used – so if we had, say, 16 cachelines available, we’d perhaps use the lowest 4 bits of the address to select a cacheline. Like the initial simple implementation, this would be a Direct Mapped cache. The disadvantage is that if the working set contains two memory locations which are each mapped to same cacheline, the cache will frequently flush one to make room for the other.

Alternatively, we could give the cache complete freedom to allocate cachelines, based on, for example, which cacheline was least recently used. This would be a Fully Associative cache, and the logic required to achieve this would be pretty complicated.

Generally a compromise somewhere between the two approaches is used, and for the TG68MiniSOC project I’m going to use a 2-Way Associative Cache. This is similar to the Direct Mapped approach, except that each memory location could be cached in either of a pair of cachelines, rather than one predetermined cacheline. We track which of each pair of cachelines was most recently used, and when another address wants to use the cacheline, we evict the Least Recently Used entry.

The fastest design for a cache would have everything stored in registers. This takes up too much space, however, so we have to employ BlockRAM. On the Cyclone III we have M9K blocks available, which as the name suggests, are 9kbit in size. Since our bursts, and thus cachelines, are 4 words (64 bits) long, a single M9K can hold 128 cachelines, which we will treat as two overlapping sets of 64 cachelines. Since the CPU reads data in 16-bit chunks, we’ll store each cacheline as four individually addressable 16-bit words.

We also need to store metadata about the cachelines (the address they pertain to, and the Least Recently Used flag) – otherwise known as the tag – so we’ll use a second M9K for that. To make sure we have room for the tag, the least-recently-used flag and for a valid flag, we’ll make the BlockRAMs 18-bits wide rather than 16; the fact that the BlockRAMs are 9K, not 8K, means we can do this without using extra resources.

So how do we decide which cacheline to use for a given address? Again, we’ll take the simplest possible route, and break the address down as follows:

Bit 0 is irrelevant because we’re working in 16-bit words, and the CPU addresses bytes by way of the UDS/LDS signals.
Bits 2:1 specify which of the four words of a cacheline we’re interested in.
Bits 8:3 specify the six bit address within the BlockRAM that will hold this cacheline. The two cachelines we can choose between will have addresses {1’b0,addr[8:3]} and {1;b1,addr[8:3]} respectively.
Bits 25:9 have to be stored in the tag, which, it turns out is no problem, since we have 18-bit wide words. The highest bit will be used as a “most recently used” flag. Using this scheme we can support an address space 64 megabytes in size.

A first stab at the 2-way cache is in the repo, but needs some bug fixing – most importantly support for writes, and dealing with byte-writes.

To be continued…

Retro Ramblings

Musings on FPGA and Retro Computing

Leave a Reply Cancel reply