A Trip to the Warehouse
In my article about GDDR5X, I already went through the basic architecture of a DRAM, but it's worthwhile to go through it again, and add a metaphor to make it clearer.
Imagine you have a very large warehouse. It has 16 floors. Each room has multiple aisles. Each aisle has lots of buckets in which products can be stsored. The floors, the aisles, and the buckets are nicely numbered.
There are conveyor belts in each aisle that can both ways: the transport products from an aisle to the loading dock, or vice versa.
The warehouse employs one handler per room. When you need a product, you tell the handler for that room to go to a particular aisle. He'll switch on the conveyor belt in the right direction, fetch the product the specified bucket, drop it onto the conveyor belt, and the product will be transported to the loading dock.
When you need one product, the total time to get it is the time for the handler to run to the right aisle, and then the time to fetch that product. This can take quite a while, especially the first part where he needs to run from one aisle to another.
If you need multiple products from the same aisle, you only need to run to the right aisle once. The handler can grab one product drop it only the belt, then another. So the average time per product, the average throughput, is quite a bit shorter. What we see here already is that it helps to group requests for products that are stored together in the aisle! But every time we need a product from a different aisle, we have to wait for the handler to run.
Wouldn't it be great if that time can be used for something else?
Well, if we have a lot of products to request, we could ask the handler of room A to run to aisle A1 while the handler from room B is fetching products from aisle B3. By the time we have the products from B3, the handler of room A may have reached aisle A1 and we can immediately start asking him to fetch his products. And while that happens, we can send the handler of room B to his next aisle.
If we plan this well, we can avoid downtime!
And if it takes a very long time for a handler to switch aisles, we just have more handlers run in parallel to their aisles. After all, we have 16 rooms, so 16 handlers. All of them can be running around to some aisle, while one is fetching products.
There are a few requirements to make this work as efficiently as possible:
- We need a long list with products that need to be fetched, so we can plan ahead and start sending handlers for different rooms to their aisles.
- The products needs to be spread as evenly as possible among different rooms so that we can have multiple handlers running around. If all products are coming from the same room, the other handlers won't have anything to do.
- For products on thte list that are from the same room, we should sort them in such as way that all products from the same aisle are grouped together. This way, once our handler is in that aisle, he can grab all those products in one go without incurring the penalty of having to switch aisles.
We can also see that the worst possible case consists of a list of prodcuts that are all from the same room but all from a different aisle.
Our handlers are also human: they'll get dehydrated because of all this running around. So, every once in a while, we need to send them to the kitchen for a glass of water. In some warehouses, there's a coordinated drinking time, where all handlers go to the kitchen together (the customer will just have to sit and wait until they come back), but some warehouses allow handlers to go to the kitchen one by one. In fact, the customer even can decide which handler should go to the kitchen next! So if the customer plays it right, he'll send a handler to the kitchen for a room from which he doesn't immediately need a prodcut...
It should not be surprising that the warehouse that I've described above operates very close to the way all current DRAMs work.
In terms of organization:
- The room corresponds to a DRAM bank.
- The aisle is a row or a page (the words can be used interchangeably)
- A bucket, the location where a single product is stored, is a column.
- The product itself is a data atom.
And in terms of operations:
- A handler running to an aisle corresponds to a opening a page, also called a page ACTIVATE command.
- Leaving an aisle corresponds to closing a page, also called a PRECHARGE.
- Fetching a product from its bucket is a READ, and putting a product into a bucket is a WRITE.
- And, finally, going for a drink is a REFRESH operation.
Warehouse Scheduling Headaches
We saw earlier that by arranging the requests for products in a clever way, you can make sure that all products on the list are delivered in the shortest time possible. However, as a customer, you'd need quite a bit of knowledge about how to warehouse is organized. It's probably better to hand off this work to a specialist. Let's call him the warehouse manager. He takes the list of products from multiple customers, sends commands to the handlers, and gives the products to the customers.
The warehouse manager has a pretty complex job: he receives product lists from multiple customers at the time. Once he has all the product lists, he needs to sort the list for optimal efficiency in general, but, at the same time, he also needs to take into account the special needs of his customers.
Imagine one customer, an online shop for bicycle parts that needs 100 similar products are all coming from the same aisle. A different online shop, say Amazon, that needs 100 products of a wide variety, all stored in different rooms. And an online grocery that needs vegetables from the produce aisle, which happens to be in the same room as the bicycle parts
The manager might first send a handler to fetch all parts from the bicycle aisle, then send the other handlers to other aisles for the Amazon products. The first handler fetches all bicycle products. When done, he's sent to the produce aisle. Meanwhile, various handlers fetch the Amazon prodcuts. And, finally, the produce is fetched.
Everything was scheduled such that every handler had to do the minimum amount of running around and all products were fetched in the shortest overall time possible.
Also, the vegetables were stale by the time they arrived at the customer.
If the vegetables order had been given a higher priority, the manager might have interrupted the handler who was fetching bicycle parts after he had completed 50 of them, sent him to the produce aisle, fetch the vegetables, sent him back to the bicycle aisle, and fetch the remaining 50 parts.
It would have taken a little bit longer, and the flow of products on the conveyer belt out of the warehouse may have been interrupted for a short while, but the vegetables wouldn't have perished.
Our warehouse manager corresponds to a memory controller with priority-based arbiter and scheduler.
Mapping of Address Bits to a DRAM
It should be clear by now that it's important to spread products around as evenly as possible among different rooms to ensure that multiple handlers can run around while one handlers is fetching products.
Let's translate this to the world of a CPU that's connected to some hypothetical DRAM.
The DRAM in question has 64 byte atoms. Each page is 4KB. It has 16 banks. And it has a total amount of memory of 1GByte.
When looking at the address space in terms of bytes, that translates into:
- 1 GByte = 30 bits (2^30 = 1,073,741,824)
- 64 byte atom = 6 bits
- 16 banks = 4 bits
- A 4KB page size = 64 atoms of 64 bytes = 12-6 = 6 bits
- Row address is 14 bits (30-4-6-6=14)
Since we always do transactions with atoms, addresses that will be sent to the RAM are (4 bank bits, 14 row bits, 6 column bits), for a total of 24 bits.
And now comes a very interesting question: if you have an address space of 1GB, how do you map those bits of the bank, row, and column bits?
It may not be obvious at first, but this mapping can have a huge impact on system performance.
Let's illustrate this with some mapping cases...
Mapping for Maximum Bank Rotation
The CPU deals with addresses of 30 bits. Let's map these addresses to the DRAM as follows:
- address[5:0] : mapped to a single atom, so it's not mapped to a row, column, or bank number.
- address[9:6] : mapped to bank bits B[3:0]
- address[15:10] : mapped to column bits C[5:0]
- address[29:16] : mapped to row bits R[13:0]
Remember how we said that we want to make sure that handlers are running around while one is fetching a book? This mapping ensures that this will be the case in an extreme way: when stepping linearly through a block of memory, such a the CPU simply linearly stepping through instructions, the address will go up one by one, and, for each 64 bytes, data from a different bank will be requested.
If fetching logic is just a little bit smart, it will know that address have a tendency of going up linearly, and it will prefetch multiple 64 byte atoms up front. If it's prefetching 256 bytes, that means that it can have 4 pages from 4 different pages being active at the same time.
Using the warehouse analogy, this satisfies the need to keep multiple handlers busy at once, but it has a major flaw: it makes the handler fetch only 1 product from an aisle, and use a handler from a different room to fetch next one. We've seen that it makes sense to group product requests such that a handler can fetch a bunch of products without having to change aisle. In addition, if the time to run from one aisle to another takes longer than the time to fetch 4 products, prefetching 4 products will not be enough to ensure that a steady stream of products will be arriving on the conveyor belt! You'd need to prefetch even more products for make sure this happens. An additional issue is that handlers running around all the time requires energy. Nobody likes wasting power like this.
Mapping for Minimum Bank Rotation
What if we maps the address to the DRAM completely different:
- address[5:0] : mapped to a single atom, so it's not mapped to a row, column, or bank number.
- address[11:6] : mapped to column bits C[5:0]
- address[15:12] : mapped to bank bits B[3:0]
- address[29:16] : mapped to row bits R[13:0]
For the same CPU workload as earlier, fetching from the address space in a linear fashion, things look very different now. For the first fetch, a single handler goes to a particular aisle, and then he stays there, fetching 64 products. All the other handlers are sitting idle, until the active handler almost reaches the end of the aisle (remember, there's a prefetch of 4 products). At that point, the prefetch will roll over to the next room. While the active handler is fetching the last atoms, the next handler has enough time to get to his aisle. By the time the CPU needs that data, the handler is ready to go.
For this particular workload, we have a perfect mapping. Almost all products are fetched from the same aisle. Handlers are only running around when strictly necessary. Energy usage is minimum.
Unfortunately, this system can break down for non-linear workloads. Let's see what happens if our CPU copies a block of data from one part of memory to another.
If it so happens that the source address and the destination address of our data falls into the same room (with 16 banks, we have 1 chance out of 16 for this to happen), the memory controller can reorder reads and writes to memory such that they are grouped as much as possible, but there are limits to that. In practice, our single handler will be running around between those 2 aisle all the time. Performance would go down dramatically. You could say that one chance out of 16 for this to happen isn't too bad, but all of this happens at a very low level, even below the operating system. People typically want predictable performance, not a Russian roulette where performance vary a lot based on the luck of the draw about where arrays of data are stored in memory.
Our first memory mapping wouldn't have this issue: you'd still have the same chance of reads and writes ending up in the same bank, but since the mapping ensures that you'd rotate from one bank to the next all the time, the controller will never be stuck in one bank for a long time. With only a minor amount of buffering or reordering of reads and writes, you could prevent the memory pipeline from trashing between different aisles.
Mapping for Best Performance
So let's propose yet another mapping:
- address[5:0] : mapped to a single atom, so it's not mapped to a row, column, or bank number.
- address[8:6] : mapped to a column bits C[2:0]
- address[12:9] : mapped to a bank bits B[3:0]
- address[15:13] : mapped to a column bits C[5:3]
- address[29:16] : mapped to a row bits R[13:0]
This mapping ensures that multiple atoms are fetches from the same aisle, yet also makes sure that there is sufficient rotation between rooms so that thrashing between different aisle from the same room can be avoided with only slightly more buffering and reordering of transactions in the memory controller.
Bank Group Aware Mapping
Things can get even more complicated: these days, banks in memory are grouped together. For example, you could have 4 groups of 4 banks each. This grouping matches where banks are located physically on the DRAM silicon. Banks that are close together share the same power network and below to the same group. The DRAM has certain limits about instantaneous power consumption, so the amount of page activations within the same bank group, connect to the same power delivery network, needs to be lower than those for different bank groups.
This can be achieved as follows:
- address[5:0] : mapped to a single atom, so it's not mapped to a row, column, or bank number.
- address[8:6] : mapped to a column bits C[2:0]
- address[10:9] : mapped to a bank bits B[3:2]
- address[11] : mapped to a column bits C[3]
- address[13:12] : mapped to a bank bits B[1:0]
- address[15:14] : mapped to a column bits C[5:4]
- address[29:16] : mapped to a row bits R[13:0]
If we assume that the B[3:2] indicates the bank group number, the mapping above guarantess that, for a linear access pattern, the memory controller will rotate heavily between different bank groups, before switching between to a bank from the same bank group.
Multiple Memory Mapping
The above discussed cases where only one memory was connected to a CPU, but most CPUs will have multiple memories. When mapping memories to CPU addresses, this needs to be taking into account as well. If we have 2 memories that each have their own dedicated data bus into the CPU, then we also have the opportunity to double the bandwidth. But this can only be done when accesses get spread equally between those memories.
Expanding the last memory mapping, this can be done as follows:
- address[5:0] : mapped to a single atom, so it's not mapped to a row, column, or bank number.
- address[8:6] : mapped to a column bits C[2:0]
- address[9] : selects DRAM 0 or DRAM 1
- address[11:10] : mapped to a bank bits B[3:2]
- address[12] : mapped to a column bits C[3]
- address[14:13] : mapped to a bank bits B[1:0]
- address[16:15] : mapped to a column bits C[5:4]
- address[30:17] : mapped to a row bits R[13:0]
Since we have 2 memories, we now have 2GB of RAM and 31 address bits.
The selection between DRAM 0 and DRAM 1 has been assigned to a relatively low address bit, but not the lowest. This way, we still guide linear accesses such that multiple atoms will be fetched from the same page, yet at the same time, transactions gets quickly spread between the different DRAMs.
Memory technologies like HBM allow up to 8 channels per stack, where each channel can be treated as a fully independent DRAM. For a configuration with 4 stacks, like AMD's Fiji silicon, this means that 5 bits will be used just to spread atoms between different channels. It's a safe bet that those 5 bits will be mapped to some of the lower bits of the address space vector.
More Mapping Considerations
Up to now, we've assumed linear access patterns to memory. While this is very common, the reality will be more complex. There can be linked lists, there can be 2D arrays, there will almost always be multiple memory access streams per CPU (instruction fetches, data fetches, all kinds of MMU related fetches), there will be multiple CPUs, and there are almost always multiple layers of caches between the CPUs and the memory.
All of this will remove a lot of access regularity, which makes it even more helpful to assign the bank (and DRAM select) bits as low as possible.
In practice, memory controllers will have a static configuration option to freely control which address bits are assigned to which DRAM row, column and bank bit. A CPU maker will try different values and benchmark tons of workloads to see which setting has the best average performance, and use that value for production.
Note that this mapping needs to be static: if it changes while a CPU is running, the address of the data as seen by the CPU would suddenly change completely!
DRAM Transaction Scheduling
Here is a non-exhaustive list of things a memory controller needs to keep in mind when scheduling transactions to a DRAM. With a lot of conflicting requirements, it shows how hard it is to make a high-performance memory controller that works well for all kinds of workloads.
-
reordering transactions is usually essential to achieve high bandwidth efficiency
Reordering makes it possible to cluster transactions to the same page, to rotate between banks in the most efficient way, to plan for bank refresh, to work around bank group restrictions, etc.
-
increased latency due to reordering
While reordering transactions is very helpful to increase bandwidth, it can hurt the latency as seen for individual clients. In addtion, the more transactions that can be reordered, and the higher the latency, the more memory is needed to store information about those transactions, as well as reserve room for read data to come back. There's a trade-off between optimal bandwidth and chip area to make it possible.
-
delaying to schedule a transaction can improve overall performance in some cases
When there is a steady, low, rate of transactions coming from a certain client, instead of immediately scheduling those transactions, it may help to delay them so that they can be bundled and sent as one group that all go to the same page or bank. This can help make the stream of transaction more regular and thus increase bandwidth, at the expense of latency for that client.
-
fairness between clients
The highest overall throughput can sometimes be achieved by giving priority to a client with high bandwidth requirements that has a very regular access pattern. For example, in a multi-CPU environment, one CPU may be copying large blocks of data for one part of the memory to the other. kHowever, this could starve other clients from getting their transactions serviced. A good DRAM controller needs to have mechanisms to make sure that this is the case.
- some clients may be given a relative share of the bandwidth
- some clients may be given absolute bandwidth guarantees
- some clients may be given latency guarantees
-
emergency scheduling
In some cases, a fairness control may not exist (because it's not generally needed) or may not work 100% (because it can be very difficult to enforce iron clad guarantees). Request from some clients that usually don't stress the memory controller can sometimes become critical to make the system work. In those case, the memory controller can provide an emergency signal: when triggered, it will increase the priority of requests for that client. A good example would be a CPU with a video output. The average bandwidth to drive a monitor is very low by today's standards, but it's very important the video output pipeline gets a steady stream of pixels. There is almost always a FIFO to decouple the uncertainty of the memory controller from this steady stream, and to make it possible to fill it with large bursts. If, for some reason, this FIFO is at risk of running out of data, an emergency signal could be raised.
-
reducing the amount of switches between reads and writes
When switching from a read to a write and back, a number of idle cycles need to be inserted on the data bus to ensure that there's never a situation where both the DRAM and the memory controller are actively driving the bus. Idle cycles mean reduced maximum throughput, so a memory controller will try to group write and read transactions together to avoid these idle cycles.
-
coherency
While reordering transactions is very important for efficiency, the reordering logic needs to take into account coherency constraints. For example: when a CPU first writes data to memory, and then reads it back, and there is no other point of coherency in the system, then the reorder logic should never swap these read and write transactions. There can be many more such coherency constraints.
-
smart refresh
The memory controller need to make sure that refresh cycles happen in time for each bank. When the DRAM only supports refresh of all banks at once, there aren't too many opportunities to be very smart about it. But newer DRAM technologies allow refresh of one bank while still allowing regular transactions on other banks. With smart scheduling, these refresh cycles can be hidden and throughput loss avoided.
The task of a high-performance scheduler is very complex and a topic of academic and commercial research. And since the performance of the execution logic of a chip is often determined by the ability of a memory system to feed it with data in a timely fashion, it's also a closely guarded company secret. So it's possible to describe what a memory controller must do at a higher level, but the how remains an open question.
Here's just one example of a paper from CMU that describes some recent memory controller research.
Conclusion
Memory controller are often treated as a simple box in a block diagram that shuffles data from a processor to DRAM and back. In reality, it may be one of the most complex and performance critical pieces of architectural puzzle. While this only scratches the surface of the engineering consideration that go behind the design of a modern memory controller, I hope that it clarifies why it is so complex and how some performance can be gained or lost.
For comments, corrections, questions, feel free to contact me at moninsider@gmail.com.
Revision History
Feb 20, 2016: first version