RAMpage FAQ
RAMpage Frequently Asked Questions
This page covers questions others have raised about RAMpage. Sometimes it's hard to cover all the issues people have doubts about in a paper, because space is limited. So if you have read any of the papers on this site and still have questions, come back here.
1. Is RAMpage really something new?
- You have the same number of levels of memory hierarchy as before. Aren't you just managing your L2 cache slightly differently?
- No. The organization is exactly as you'd expect for main memory, except we make some slightly different trade-offs for the fact that the SRAM main memory is smaller, and miss costs aren't as high.
- There is one complication: there is probably more operating system and device space that likes to be in physical RAM than we'd like to put in the SRAM level (see question 3), so it may be more realistic to speak of a 2-level main memory. But to call something organized this way a cache is to render the distinctions currently used meaningless: you might as well call DRAM an L3 or L4 cache, and disk an L4 or L5 cache.
- But hasnt all this been done before (there's been work on software-managed caches as far back as the 1980s)?
- Correct. This work draws on that precedent. But it goes a lot further. There are no cache tags, and misses are completely handled in software. The organization of the SRAM main memory really is not at all like a cache.
2. But a miss to DRAM is nowhere near the penalty of a page fault: how can you equate the two?
- A disk is hundreds of thousands to millions of times slower than CPU cycle time. How can the issues in handling page faults be compared with misses to DRAM?
- The first machine with demand paging was the Atlas, launched in 1962. Its minimum instruction execution time was 1.2µs, whereas its drum paging device could deliver a page in 2-14ms. That's a page fault cost of the order of 10,000 instructions (probably actually closer to 1,000, taking into account that typical instruction execution was longer than 1.2µs). We aren't quite there yet, but the Atlas illustrates that the point where we start to see performance wins for page faults could be a lot closer to today's miss penalties of hundreds of lost instructions.
- With a fast Rambus or similar high-speed bus-based DRAM (or an equivalent-speed SDRAM), you can expect miss penalties of the order of 50-80ns. For an instruction capable of issuing 8 instructions per clock at 500MHz, that's a worst-case miss cost of around 200-300 instructions. Probably not enough in itself to significantly outweigh the extra penalty of handling a miss in software but enough to be worth considering the trade-off of fewer misses at the expense of more penalty handling each one. But scale the block size up to 512 bytes, and you get a miss penalty amazingly close to that of the Atlas: around 1µs for a fast Rambus, which is about 4,000 instructions if you are issuing at a rate of 4GHz.
- But a disks latency to bandwidth ratio encourages very large page sizes. 4Kbytes is about the minimum, with 8K common, and sizes even in the Mbytes possible on some systems. Yet caches typically have line sizes in the range 32-128 bytes. How can you propose something more like paging for what others see as a cache?
- Good point. Two issues can affect the optimum page size aside from relative bandwidth and latency. TLB activity can be a problem if the page size is too small, and a large page size may be more than is needed for spatial locality, while reducing temporal locality by causing premature replacements. In practice, the trend is towards TLBs that can map a relatively large number of pages, which might make smaller pages in the SDRAM-DRAM page interface viable (though DRAM-disk pages would still need to be bigger). For example, with a 512-entry TLB, it's possible to map over 6% of the pages in a 4Mbyte SRAM main memory with 512-byte pages. Also, better replacement strategies possible using a software replacement model make it possible to keep the right pages in memory longer, reducing the effect of larger pages (blocks / lines) causing extra misses through failing to exploit temporal locality.
3. Do you really intend to put the OS in SRAM?
- You only have 4Mbytes of static RAM yet you talk about pinning the operating system there.
- No, this is not correct. Only data structures and code critical for context switches and handling replacements are proposed to be SRAM-resident. The VMP system for example in the 1980s had a special memory for its software replacement handler. Even in a monolithic kernel, it should in principle be possible to make this separation.
- What about page tables? You can't be serious about putting the entire page table in SRAM.
- That's why we are using inverted page tables. At the cost of slower lookups, an inverted page table has one entry per physical address. Only the SRAM-level mapping is pinned in SRAM; the DRAM page tables are based in DRAM.
- Even so, the amount of memory a typical modern OS wants physical is large.
- Correct. We may be playing with miss penalties (in terms of lost instructions) of the era of the 1962 Atlas in some of our simulations, but we can't expect 1962 operating systems. Of course if everyone did microkernels we'd be in less trouble, but we do need to consider some complexities to solve this problem in a real system, e.g., the RAMpage main memory is only a part of the physical address space, taken out of low memory, and other things that require physical addressing map to the DRAM. There are added complexities in whether such addresses should be possible to also represent in L1 and in the SRAM main memory but it seems a good start to check whether there's a performance or hardware cost win before getting into such complexities (solvable, but not central to the problem of evaluating RAMpage).
4. How do you expect to fit the entire page table in SRAM?
- A page table is pretty large. For a typical modern address space, you won't have space for it in a 4MByte SRAM.
- This is not the intention. We propose using an inverted page table, which only maps the pages actually in SRAM. Since an inverted page table is structured around physical addresses, there is some performance penalty.
- Isn't the inverted page table going to be too slow then?
- We expect that to be offset by the fact that no TLB miss for a reference that can be served from the SRAM will need to go to a lower level of memory for the page translation. In fact, we guess this is likely to be a significant win, given that any reference to DRAM for a page table lookup will introduce a major extra penalty.
5. Your initial work has compared with a direct-mapped cache: isn't this a soft target?
- Direct-mapped caches are generally disappearing. Shouldn't you have started with at least a 2-way associative cache as your competition?
- Good point. Thats why we subsequently moved to a 2-way associative L2. But we feel there is an important advantage in being able to achieve associativity with no hit-cost penalty, and adding associativity the traditional way introduces penalties.
6. You are getting results showing an improvement in the range of worse than 0 to 25%, depending on your measurement: is this really worth the effort?
- Your low-end CPU shows hardly any improvement (sometimes in fact negative, for smaller page sizes), so why bother?
- We're trying to map out the conditions under which RAMpage is a win. We don't claim it always will be. You can set up cases where having a cache isn't a win. But the cases where RAMpage doesn't clearly come out ahead are with low-end designs, below the minimum typically on sale in early 1998. This point has become less contentious over time, as the point where RAMpage is a win is no longer faster than commodity desktop CPUs.
- On the other hand, your best results assumes context switches on misses, which has to be a tenuous conclusion, given that everything depends on the operating system being suitably implemented and fitting the model.
- Good point. Without context switches on misses, we improvements of around 10%, which may be less impressive than improvements of 2450%. However, we have not looked at other alternatives to achieving the same effect as context switches on misses, such as simultaneous multithreading (SMT), or faking the changes in a hardware abstraction layer.
7. What about superscalar issues?
- Can you really model superscalar execution, including out of order execution, speculation, branch prediction and so on, with traces?
- Not completely accurately, no. We are obviously missing some important behaviour, including nonblocking misses and prefetch instructions. It would be useful to do a full execution-driven model of RAMpage. However we argue that a trace-driven model is still sufficient to make a first cut at measuring viability.
- Wasn't there a paper at HPCA'97 that showed trace-driven simulation can be very inaccurate at modelling superscalar execution?
- Correct. The paper by Pai et al. however is mainly concerned with accuracy of measurement of parallel execution and assumes (for example) that all local data references hit in the L1 cache. It is therefore difficult to draw more general conclusions from this paper as to its applicability to uniprocessor execution.
- It will however be interesting to extend the RAMpage work to examining superscalar effects, since it is clearly the intent to aim the idea at future-generation designs.
8. What of improvements in DRAM?
- In any case, you aren't really modelling a particularly aggressive memory hierarchy are you?
- True, some high-end systems may be faster, particularly using highly interleaved memory systems. But the numbers we are looking at can be rescaled so that for example an 8GHz issue rate with a 10ns SDRAM bus cycle time or equivalent Direct Rambus can be thought of as modelling a faster processor-DRAM combination. The key issue is the CPU-DRAM speed gap.
- What about Double Datarate SDRAM and other faster versions? Aren't you being unfair to the conventional hierarchy? Wouldn't Rambus, DDRSDRAM, etc. allow much lower miss penalties?
- The key issue is the latency trend in DRAM. If modes of DRAM access give improved bandwidth, RAMpage's larger SRAM page sizes can take advantage of this improvement more than the typically smaller cache block sizes in a conventional hierarchy.
- The pressure to use larger cache blocks arising from faster burst modes means that the total latency of a miss is likely to increase (even if the average time per byte decreases), increasing the value of context switches on misses.
- What about DRAM improvements with caches on the DRAM chip?
- Even these improvements must occasionally incur the full latency of a DRAM reference.
9. What of Simultaneous Multithreading (SMT)?
- Doesnt SMT address the same issue as context switches on misses?
- Yes, though at the expense of more complex hardware. The trade-offs between designing the simplest CPU possible and hence being able to scale up the clock speed as aggressively as possible, versus the speed gain of handling context switches in hardware, needs to be investigated.
- RAMpage allows full context switches on misses, as with a conventional page fault, whereas SMT needs threads which may not always be available.
- RAMpage could potentially be designed with operating system support for fast switches between lightweight threads.
- A combination of RAMpage and SMT would also be interesting to investigate.
Comments?
