The University of Queensland Homepage
School of ITEE ITEE Main Website

 Tutorial-1 Solutions

The University of Queensland
School of Information Technology and Electrical Engineering – Semester 2, 2005

CSSE7013 - Advanced Computer Architecture
Tutorial 1 Partial Solutions

Learning Objectives

These solutions are a guide to answering the rest of the questions.

QUESTION 1—Quantitative Techniques

Use the following data [moved to the solution with extra space added for calculated numbers, shown in italics] for the original machine, a load-store architecture, for all parts of the question where calculations are required. In a load-store machine, all memory data references are either copies from memory to a register (loads), or copies from a register to memory (store). All ALU operations are on 2 source registers and 1 destination register. The timings assume no cache misses, and are averages based on the possibility that the pipeline may stall (for reasons we’ll see in later chapters). CPI of less than 1 (as here) is usually given as IPC as explained in the notes.

  1. A modification to the design is proposed whereby 90% of branches can be predicted accurately, resulting in a 50% reduction in branch CPI when a branch is predicted correctly. However, when a branch is predicted incorrectly, CPI for a branch is increased by 50%. The modification increases clock cycle time (slows down clock speed) by 15%.
    1. What is the CPI of the original machine with the given instruction mix?

      instruction categories

      dynamic frequency % clock cycles (no stalls) contribution to CPI (original)
      ALU 43 0.25 0.1075
      loads 21 0.5 0.105
      stores 12 0.25 0.03
      branches 24 1 0.24
      total CPI 0.4825
    2. What is the CPI of the modified machine with the given instruction mix?

      instruction categories

      dynamic frequency % clock cycles (no stalls) contribution to CPI (new)
      ALU 43 0.25 0.1075
      loads 21 0.5 0.105
      stores 12 0.25 0.03
      branches (predict) 21.6% 0.5 0.108
      branches (mis-predict) 2.4% 1.5 0.036
      total CPI 0.3865
    3. Which is faster, and by how much?
      Here we have to apply the CPU performance equation, and note that the instruction count IC has not changed, so we can see the relative speed change by looking at CPIold x tCC = 0.4825 x and CPInew x 1.15 x tCC =
      1.15 x 0.3865 = 0.444475tCC
      so we can see that the speedup of the new design is 0.4825/0.444475 = 1.086. What % faster is the new design as compared with the old?
    4. Now assume the instruction miss rate is 0.8% for the original and 1% for the modified architecture. In both cases, the data miss rate is 1% and a miss costs 200 extra cycles. Redo the CPI calculations for both machines; which is faster—and by how much?
      Useful to attempt this now but go back to it when we do memory systems if you don’t manage to do it yet.
  2. Recent designs rely on a very fast on-chip interface to minimize the cost of misses from the L1 cache to the L2 cache. A consequence of this design is that the L2 cache size can be limited, as opposed to earlier designs with off-chip L2 caches.
    Useful to attempt this now but go back to it when we do memory systems if you don’t manage to do it yet.
    1. Consider a 2-level cache, in which the penalty for references to DRAM is constant at 200 cycles, but the penalty for misses to L2 varies. In the 2 design variations, the following figures apply:
      • fraction of references which miss to L2 in both designs: 1%
      • fraction of references which miss from L2 in small, slower L2 design: 20% (relative)
      • penalty for misses to small, fast L2: 10 cycles
      • fraction of references which miss from L2 in bigger, slower L2 design: 10% (relative)
      • penalty for misses to bigger, slower L2: 20 cycles
    2. In the light of (i), comment on the trend towards faster but smaller on-chip caches.

QUESTION 2—Amdahl’s Law

  1. You are travelling to Sydney by car. Halfway there, you realize you have been going too slowly: in fact, you are taking twice as long as you should. Explain what if anything you can do to get to Sydney on time. Think about how Amdahl’s Law applies.
    If you have travelled half the distance at half the speed, you have used up all your time so you will need infinite speedup. Look at the Amdahl’s Law speedup formula: you should be able to make it apply to this case.
  2. It is proposed that a new fast train be introduced, linking Brisbane and Gold Coast, over a distance of 50km, with 2 stops on the way. Assume the train averages 150km/h while moving, but needs 10min at each stop to load and unload. What is the speedup versus a train which averages 75km/h but which does not stop on the way?
    We need tfast and tslow to work out speedup:
    tfast = tmoving + tstopped = 50/150 h + 2 x 10 min = 40 min
    tslow = tmoving = 50/75 h = 40 min
    so speedup = 1 (i.e., not improvement)

QUESTION 3—Performance Measurement

  1. To evaluate new computer architectures, you can use back of the envelope calculations (e.g., question 1), running real-world applications on a simulation of a new design or running traces of memory references from applications from an existing machine through a memory simulator. Trace file entries are type_of_operation memory_address where type of operation is usually one of fetch, read or write. Explain why any of the performance measures do or don’t apply in these situations:
    1. You are experimenting with variations on the pipeline timing but otherwise the design is very similar.
      Trace files could be used here but aren’t very accurate because they don’t capture details like which specific instructions executed resulting in a particular memory reference pattern. A more detailed architecture simulation would be better here.
    2. The pipeline design is just like an existing machine, but you are experimenting with variations in DRAM speed.
      Depending how much accuracy you need, either approach could apply.
    3. You are planning significant additions to the instruction set with 2 possibilities:
      1. the instructions can be generated by an experimental compiler
        You could extend an existing simulator if one existed, which is the best way to see how new instructions change performance – a trace file would not have been generated by the new instructions and since the information about what the instructions were in the run which was recorded is lost, you wouldn’t even know when he new instructions could have been used.
      2. the new instructions are too different from any existing ISA to create an experimental compiler easily
        This kind of case is hard to measure in detail – you may have to revert to back of the envelope studies until you can create a compiler, or hand-write machine code, then run it through a simulator.
  2. Another way to measure where time is being spent in a program is to use a profiler. A profiler uses various techniques like interrupting the program at random intervals, or taking measurements at specific control points (e.g., procedure call, on any change of control flow). What kinds of architecture measurements do you think profiling could apply to?
    A profiler is a much less precise form of measurement. You can pick up big hot spots e.g. a part of the program where the memory hierarchy is causing big delays. This can help e.g., with picking up program behaviours which are bad for caches or the TLB (see later when we look at memory).

QUESTION 4—More Exercise

  1. Work through the examples in Chapter 1.
  2. Work through questions at the end of Chapter 1. Warning: the exercises in the book can sometimes be hard e.g. because details are left out, requiring you to make assumptions.

Last update: Wednesday, 24 August 2005 at 5:01 PM