The University of Queensland Homepage
School of ITEE ITEE Main Website

 Memory testbed

Memory efficiency

The general idea

A processor running at 2GHz (1 clock tick = 0.5ns) capable of completing multiple instructions per clock can complete several instructions per nanosecond. Cache misses from DRAM take tens of nanoseconds to complete, meaning a very big slowdown of cache misses are frequent.

Caches are generally organized into levels, typically 2 on most common current designs. The highest level (closest to the CPU: first-level or L1 cache) is faster and smaller. Caches are organized into block (also called lines) which are fixed-sized units transferred between levels.

Memory is also organized into pages, typically bigger than cache blocks (4KB and 8KB are common sizes). Page size is significant because skipping randomly over memory breaks some hardware optimizations (e.g., a small number of page address translations are kept in a hardware structure called the TLB).

Details of the memory hierarchy are not always available when it becomes necessary to do software optimization. This project requires writing a tool to explore characteristics of the memory hierarchy and discover as much as possible by time measurement.

The challenge

Different layers may be hard to tease apart. Some things to consider:

  • the L1 cache is smaller, so code with tight loops and small data structures will only use L1 after an initial warmup
  • the TLB is usually quite small, so a program which access data spaced wide apart so each access is in a separate page will quickly run through TLB entries
  • the biggest difference in speeds will likely be when you run out of caches and hit DRAM
  • there are other properties of caches like associativity which will be more of a challenge to measure

It is highly recommended that you use a language reasonably close to machine code, e.g., C or C++ for your testbed, otherwise it will be hard to be sure that memory accesses are doing what you expect them to do.