

HUGH=CHESSER= CSEB=1012U

W13-M

### Quiz/Exam Sample Question

Show the contents of the pipeline registers for instructions going through the pipeline (5 cycles after the first instruction begins), identify any data hazards:

```
lw $10,20($1)
sub $11,$2,$3
and $12,$10,$5
or $13,$11,$7
and $14,$8,$9
```

Pipeline - Cycle # 5



## **ALU Control Actions**

| Instruction opcode | ALUOp | Instruction operation | Function code | Desired<br>ALU action | ALU control<br>input |
|--------------------|-------|-----------------------|---------------|-----------------------|----------------------|
| LW                 | 00    | load word             | XXXXXX        | add                   | 0010                 |
| SW                 | 00    | store word            | XXXXXX        | add                   | 0010                 |
| Branch equal       | 01    | branch equal          | XXXXXX        | subtract              | 0110                 |
| R-type             | 10    | add                   | 100000        | add                   | 0010                 |
| R-type             | 10    | subtract              | 100010        | subtract              | 0110                 |
| R-type             | 10    | AND                   | 100100        | AND                   | 0000                 |
| R-type             | 10    | OR                    | 100101        | OR                    | 0001                 |
| R-type             | 10    | set on less than      | 101010        | set on less than      | 0111                 |

-**W**-13-**M** 

4

## Action of Pipeline Control Signals

| Signal name | Effect when deasserted (0)                                                                   | Effect when asserted (1)                                                                                |
|-------------|----------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|
| RegDst      | The register destination number for the Write register comes from the rt field (bits 20:16). | The register destination number for the Write register comes from the rd field (bits 15:11).            |
| RegWrite    | None.                                                                                        | The register on the Write register input is written with the value on the Write data input.             |
| ALUSrc      | The second ALU operand comes from the second register file output (Read data 2).             | The second ALU operand is the sign-extended, lower 16 bits of the instruction.                          |
| PCSrc       | The PC is replaced by the output of the adder that computes the value of PC $+ 4$ .          | The PC is replaced by the output of the adder that computes the branch target.                          |
| MemRead     | None.                                                                                        | Data memory contents designated by the address input are put on the Read data output.                   |
| MemWrite    | None.                                                                                        | Data memory contents designated by the address input are replaced by the value on the Write data input. |
| MemtoReg    | The value fed to the register Write data input comes from the ALU.                           | The value fed to the register Write data input comes from the data memory.                              |

₩13-**]**M

## Control for Pipeline – Arranged by Pipeline Stage

|             | Execut |        | s calculatio<br>Il lines | n stage | Memory access stage control lines |              |               | Write-back stage control lines |               |
|-------------|--------|--------|--------------------------|---------|-----------------------------------|--------------|---------------|--------------------------------|---------------|
| Instruction | RegDst | ALUOp1 | ALUOp0                   | ALUSrc  | Branch                            | Mem-<br>Read | Mem-<br>Write | Reg-<br>Write                  | Memto-<br>Reg |
| R-format    | 1      | 1      | 0                        | 0       | 0                                 | 0            | 0             | 1                              | 0             |
| 1 w         | 0      | 0      | 0                        | 1       | 0                                 | 1            | 0             | 1                              | 1             |
| SW          | Х      | 0      | 0                        | 1       | 0                                 | 0            | 1             | 0                              | Х             |
| beq         | Х      | 0      | 1                        | 0       | 1                                 | 0            | 0             | 0                              | Х             |

Controls same as before for the single or multi-cycle implementations, rearranged according to pipeline stage

₩13-M

## Topics:

1. Memory, Caches

Patterson: 5.1, 5.2

# Memory Technologies

- Static RAM (SRAM)
  - flip flops (transistors)
  - 0.5ns 2.5ns, \$2000 \$5000 per GB
- Dynamic RAM (DRAM)
  - Capacitor transistor pairs requires refresh
  - 50ns 70ns, \$20 \$75 per GB
- Magnetic disk
  - Nonvolatile data stored in magnetic field
  - 5ms 20ms, \$0.20 \$2 per GB
- Ideal memory
  - Access time of SRAM
  - Capacity and cost/GB of disk

# Memory Technologies (2)





Source: Wikipedia











**DRAM** 

# Memory Heirarchy

- In order to reduce access time, while keeping costs down computers employ a mixture of memory types
- Memory is arranged in an heirarchy
  - Small amount of fast, expensive memory
  - Larger amount of slower, cheaper memory
- Utilizes the "principle of locality" cache memory
  - Temporal memory accessed recently tends to be accessed again
  - Spatial adjacent memory locations to ones accessed recently are also likely to be needed

# Taking Advantage of Locality

- Memory hierarchy
- Store everything on disk
- Copy recently accessed (and nearby) items from disk to smaller DRAM memory
  - Main memory
- Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory
  - Cache memory attached to CPU

# Memory Hierarchy Levels



- Block (aka line): unit of copying
  - May be multiple words
- If accessed data is present in upper level
  - Hit: access satisfied by upper level
    - Hit ratio: hits/accesses

### If accessed data is absent

- Miss: block copied from lower level
  - Time taken: miss penalty
  - Miss ratio: misses/accesses
    - = 1 hit ratio
- Then accessed data supplied from upper level

- Cache memory
  - The level of the memory hierarchy closest to the CPU
- Given accesses X<sub>1</sub>, ..., X<sub>n-1</sub>, X<sub>n</sub>

| X <sub>4</sub>   |
|------------------|
| X <sub>1</sub>   |
| $X_{n-2}$        |
|                  |
| X <sub>n-1</sub> |
| X <sub>2</sub>   |
|                  |
| X <sub>3</sub>   |

| $X_4$            |  |
|------------------|--|
| X <sub>1</sub>   |  |
| $X_{n-2}$        |  |
|                  |  |
| X <sub>n-1</sub> |  |
| X <sub>2</sub>   |  |
| $X_n$            |  |
| X <sub>3</sub>   |  |

b. After the reference to  $X_n$ 

- How do we know if the data is present?
- Where do we look?

## **Direct Mapped Cache**

- Location determined by address
- Direct mapped: only one choice
  - (Block address) modulo (#Blocks in cache)



- # of Blocks is a power of 2
- Use low-order address bits

## Tags and Valid Bits

- How do we know which particular block is stored in a cache location?
  - Store block address as well as the data
  - Actually, only need the high-order bits
  - Called the tag
- What if there is no data in a location?
  - Valid bit: 1 = present, 0 = not present
  - Initially 0

# Cache Example (1)

- 8-blocks, 1 word/block, direct mapped
- Initial state

| Index | V | Tag | Data |
|-------|---|-----|------|
| 000   | N |     |      |
| 001   | N |     |      |
| 010   | N |     |      |
| 011   | N |     |      |
| 100   | N |     |      |
| 101   | N |     |      |
| 110   | N |     |      |
| 111   | N |     |      |

# Cache Example (2)

| Word addr | Binary addr | Hit/miss | Cache block |
|-----------|-------------|----------|-------------|
| 22        | 10 110      | Miss     | 110         |

| Index | V | Tag | Data       |
|-------|---|-----|------------|
| 000   | N |     |            |
| 001   | N |     |            |
| 010   | N |     |            |
| 011   | N |     |            |
| 100   | N |     |            |
| 101   | N |     |            |
| 110   | Υ | 10  | Mem[10110] |
| 111   | N |     |            |

<u>-W</u>13-<u>M</u>

# Cache Example (3)

| Word addr | Binary addr | Hit/miss | Cache block |
|-----------|-------------|----------|-------------|
| 26        | 11 010      | Miss     | 010         |

| Index | V | Tag | Data       |
|-------|---|-----|------------|
| 000   | N |     |            |
| 001   | N |     |            |
| 010   | Υ | 11  | Mem[11010] |
| 011   | N |     |            |
| 100   | N |     |            |
| 101   | N |     |            |
| 110   | Υ | 10  | Mem[10110] |
| 111   | N |     |            |

# Cache Example (4)

| Word addr | Binary addr | Hit/miss | Cache block |
|-----------|-------------|----------|-------------|
| 22        | 10 110      | Hit      | 110         |
| 26        | 11 010      | Hit      | 010         |

| Index | V | Tag | Data       |
|-------|---|-----|------------|
| 000   | N |     |            |
| 001   | N |     |            |
| 010   | Υ | 11  | Mem[11010] |
| 011   | N |     |            |
| 100   | N |     |            |
| 101   | N |     |            |
| 110   | Υ | 10  | Mem[10110] |
| 111   | N |     |            |

# Cache Example (5)

| Word addr | Binary addr | Hit/miss | Cache block |
|-----------|-------------|----------|-------------|
| 16        | 10 000      | Miss     | 000         |
| 3         | 00 011      | Miss     | 011         |
| 16        | 10 000      | Hit      | 000         |

| Index | V | Tag | Data       |
|-------|---|-----|------------|
| 000   | Y | 10  | Mem[10000] |
| 001   | N |     |            |
| 010   | Υ | 11  | Mem[11010] |
| 011   | Υ | 00  | Mem[00011] |
| 100   | N |     |            |
| 101   | N |     |            |
| 110   | Υ | 10  | Mem[10110] |
| 111   | N |     |            |

**W13-**M

# Cache Example (6)

| Word addr | Binary addr | Hit/miss | Cache block |
|-----------|-------------|----------|-------------|
| 18        | 10 010      | Miss     | 010         |

| Index | V | Tag | Data       |
|-------|---|-----|------------|
| 000   | Υ | 10  | Mem[10000] |
| 001   | N |     |            |
| 010   | Y | 10  | Mem[10010] |
| 011   | Υ | 00  | Mem[00011] |
| 100   | N |     |            |
| 101   | N |     |            |
| 110   | Υ | 10  | Mem[10110] |
| 111   | N |     |            |

## Address Subdivision



## Example: Larger Block Size

- 64 blocks, 16 bytes/block
  - To what block number does address 1207 map?
- Block address = [1207/16] = 75
- Block number = 75 modulo 64 = 11

| 31 |         | 10 | 9 4        | 3 0        |
|----|---------|----|------------|------------|
|    |         | 01 | 001011     | 0111       |
| 31 | 22 bits | 10 | 9 6 bits 4 | 3 4 bits 0 |
|    | Tag     |    | Index      | Offset     |
|    | 22 bits |    | 6 bits     | 4 bits     |

W13-M

### **Block Size Considerations**

- Larger blocks should reduce miss rate
  - Due to spatial locality
- But in a fixed-sized cache
  - Larger blocks ⇒ fewer of them
    - More competition ⇒ increased miss rate
- Larger miss penalty
  - Can override benefit of reduced miss rate

### Cache Misses

- On cache hit, CPU proceeds normally
- On cache miss
  - Stall the CPU pipeline
  - Fetch block from next level of hierarchy
  - Instruction cache miss
    - Restart instruction fetch
  - Data cache miss
    - Complete data access

# Write-Through

- On data-write hit, could just update the block in cache
  - But then cache and memory would be inconsistent
- Write through: also update memory
- But makes writes take longer
  - e.g., if base CPI = 1, 10% of instructions are stores, write to memory takes 100 cycles
    - Effective CPI = 1 + 0.1×100 = 11
- Solution: write buffer
  - Holds data waiting to be written to memory
  - CPU continues immediately
    - Only stalls on write if write buffer is already full

W13-M

### Write-Back

- Alternative: On data-write hit, just update the block in cache
  - Keep track of whether each block is dirty
- When a dirty block is replaced
  - Write it back to memory
  - Can use a write buffer to allow replacing block to be read first

W13-M

## Example: Intrinsity FastMATH

- Embedded MIPS processor
  - 12-stage pipeline
  - Instruction and data access on each cycle
- Split cache: separate I-cache and D-cache
  - Each 16KB: 256 blocks × 16 words/block
  - D-cache: write-through or write-back
- SPEC2000 miss rates
  - I-cache: 0.4%
  - D-cache: 11.4%
  - Weighted average: 3.2%

# Example: Intrinsity FastMATH



# Main Memory Supporting Caches

- Use DRAMs for main memory
  - Fixed width (e.g., 1 word)
  - Connected by fixed-width clocked bus
    - Bus clock is typically slower than CPU clock
- Example cache block read
  - 1 bus cycle for address transfer
  - 15 bus cycles per DRAM access
  - 1 bus cycle per data transfer
- For 4-word block, 1-word-wide DRAM
  - Miss penalty =  $1 + 4 \times 15 + 4 \times 1 = 65$  bus cycles
  - Bandwidth = 16 bytes / 65 cycles = 0.25 B/cycle

\_W\_13-M⁻

30

# Increasing Memory Bandwidth



- 4-word wide memory
  - Miss penalty = 1 + 15 + 1 = 17 bus cycles
  - Bandwidth = 16 bytes / 17 cycles = 0.94 B/cycle
- 4-bank interleaved memory
  - Miss penalty = 1 + 15 + 4×1 = 20 bus cycles
  - Bandwidth = 16 bytes / 20 cycles = 0.8 B/cycle

 a. One-word-wide memory organization

Memory

## Advanced DRAM Organization

- Bits in a DRAM are organized as a rectangular array
  - DRAM accesses an entire row
  - Burst mode: supply successive words from a row with reduced latency
- Double data rate (DDR) DRAM
  - Transfer on rising and falling clock edges
- Quad data rate (QDR) DRAM
  - Separate DDR inputs and outputs