The University of Queensland Homepage
School of ITEE ITEE Main Website

 The RAMpage Memory Hierarchy

The RAMpage Project


This is a project to investigate the implications of treating what up to now has been considered L2 (or even possibly L3) cache as the main memory of a computer system, and treating DRAM as a paging device.

Rationale

The rationale behind the project is that caches are starting to become multiple Mbytes in size in high-end systems (a trend which will later reach the mass market), while miss costs from SRAM caches to DRAM are becoming so high that handling misses in software and taking a context switch on a miss are starting to look like reasonable options. While software cache miss handling is not a new idea, and is probably still worth exploring, it also seems worth exploring the next logical step, namely, handling a DRAM as a paging device -- while using what used to be the lowest-level cache as the main memory.

Implications

Implications of the idea include:

  • fast hits in final level of SRAM easier: the only addressing needed at that level is page translation which has to be done anyway (ideally with a TLB hit), so the cost of cache tag lookup and comparison is eliminated
  • lower cost of the final level of SRAM: no cache tags and the associated hardware for lookups and comparisons (in an increasing number of recent designs, tags and logic for L2 are on-chip, so the RAMpage design would free up valuable real estate for other CPU performance enhancements)
  • page placement is usually more efficient than cache placement, even in a fully associative cache, where performance requirements usually dictate that sub-optimal replacement strategies be used
  • on the down side, page table handling may become more complicated, and the interaction between the TLB and the rest of the memory hierarchy will have to be investigated
  • pages are bigger than cache blocks (since otherwise the page tables would be too large; even if the new hierarchy uses a different page size on DRAM and disk because of the different speed characteristics, it is still likely that the page size cannot be much smaller than the usual minimum of 4K): this difference may increase misses

Research Questions

Out of the implications of the change in memory hierarchy, a number of research questions arise. Under what circumstances do the performance gains outweigh the losses? Would any additional hardware, such as a second-level TLB, improve performance? How are page tables best organized in this new hierarchy? How is DRAM best organized: much as before, or more like a disk cache? If current performance doesn't justify the change, will the change be justified in future, given trends in the development of various components, including the CPU, SRAM and DRAM? How does this change impact on shared-memory multiprocessors: will they look more like distributed shared memory systems in future?

Status

So far this project has passed the initial investigation stage. I have done some of the initial work with the aid of three Masters students, and some very interesting possibilities have come out of the early work and subsequent follow-ups. Preliminary indications were that it should be possible to achieve a 15-20% speed improvement on a processor with an issue rate of 4GHz, with a 4Mbyte SRAM (as compared with using the same SRAM as an L1 cache). In line with more recent designs, with clock speeds in the 1– 2GHz range capable of issuing multiple instructions per clock, more recent invetigations have increased the maximum issue rate under consideration to 8GHz. More realistic parameters for L1 cache and L2 (in the comparable conventional hierarchy) have illustrated that improvements in a conventional hierarchy, while able to hold off the point where the CPU-DRAM speed gap becomes a problem, do not stave off the problem indefinitely. With a relatively aggressive L1 cache of 512KB (256KB each of instruction and data cache) and a 2-way associative 4MB L2 cache, a conventional hierarchy running the simulated workload waited about 40% of the time for DRAM at an 8GHz issue rate. A RAMpage hierarchy taking context switches on misses was able to hide all the DRAM wait time, resulting in a speedup of over 10 versus a 1GHz conventional hierarchy with a total of 32KB of L1 cache. In other words, RAMpage with context switches on misses is much mroe scalable as the CPU-DRAM speed gap grows.


Papers

Scalability of the RAMpage Memory Hierarchy, South African Computer Journal no. 25 August 2000, pp 68-73.(abstract full paper)

Correction to RAMpage ASPLOS Paper, Computer Architecture News, vol. 27, no. 4 September 1999, pp 2-5 (abstract and full paper)

[P Machanick and P Salverda] Implications of Emerging DRAM Technologies for the RAMpage Memory Hierarchy, Proc. SAICSIT '98, Gordon's Bay, South Africa, November 1998, pp 27-40. (abstract and full paper)

[P Machanick, P Salverda and L Pompe] Hardware-Software Trade-Offs in a Direct Rambus Implementation of the RAMpage Memory Hierarchy, Proc. ASPLOS-VIII Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, San Jose, October 1998, pp. 105-114. (abstract and full paper)

Preliminary Investigation of the RAMpage Memory Hierarchy, South African Computer Journal, no. 21, August 1998, pp 16-25; co-author Pierre Salverda (abstract and PostScript).

The Case for SRAM Main Memory, Computer Architecture News, vol. 24, no. 5, December 1996 pages 23-30. This version has some minor corrections: (MS Word for Mac binhex 45K, PostScript 18K).

Unpublished

How Multithreading Addresses the Memory Wall (abstract, pdf and BibTeX) – submitted for publication.

L1 Cache and TLB Enhancements to the RAMpage Memory Hierarchy (abstract, pdf and BibTeX) – submitted for publication.

Approaches to Addressing the Memory Wall (abstract, pdf and BibTeX).


Still have questions? After reading the papers, see if the RAMpage FAQ helps you with any issues that are still unclear.

Why an Elephant?

Why not? Elephants have a good memory, and they're known to go on the rampage.

  my home page