

#### -C-SE==2021== -C-O-MFUTE-R==-O-R-GANIZATIO-N

D R

#### Agenda

#### Topics:

1. Memory, Caches

Patterson: 5.1, 5.2

#### Memory Technologies

- Static RAM (SRAM)
  - flip flops (transistors)
  - 0.5ns 2.5ns, \$2000 \$5000 per GB
- Dynamic RAM (DRAM)
  - Capacitor transistor pairs requires refresh
  - 50ns 70ns, \$20 \$75 per GB
- Magnetic disk
  - Nonvolatile data stored in magnetic field
  - 5ms 20ms, \$0.20 \$2 per GB
- Ideal memory
  - Access time of SRAM
  - Capacity and cost/GB of disk

#### Memory Technologies (2)



#### Memory Heirarchy

- In order to reduce access time, while keeping costs down computers employ a mixture of memory types
- Memory is arranged in an heirarchy
  - Small amount of fast, expensive memory
  - Larger amount of slower, cheaper memory
- Utilizes the "principle of locality" cache memory
  - Temporal memory accessed recently tends to be accessed again
  - Spatial adjacent memory locations to ones accessed recently are also likely to be needed

#### Memory Heirarchy

- In order to reduce access time, while keeping costs down computers employ a mixture of memory types
- Memory is arranged in an heirarchy
  - Small amount of fast, expensive memory
  - Larger amount of slower, cheaper memory
- Utilizes the "principle of locality" cache memory
  - Temporal memory accessed recently tends to be accessed again
  - Spatial adjacent memory locations to ones accessed recently are also likely to be needed

## Taking Advantage of Locality

- Memory hierarchy
- Store everything on disk
- Copy recently accessed (and nearby) items from disk to smaller DRAM memory
  - Main memory
- Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory
  - Cache memory attached to CPU

#### Memory Hierarchy Levels



- Block (aka line): unit of copying
  - May be multiple words
- If accessed data is present in upper level
  - Hit: access satisfied by upper level
    - Hit ratio: hits/accesses
  - If accessed data is absent
    - Miss: block copied from lower level
      - Time taken: miss penalty
      - Miss ratio: misses/accesses
        = 1 hit ratio
    - Then accessed data supplied from upper level

#### Cache Memory

- Cache memory
  - The level of the memory hierarchy closest to the CPU
- Given accesses  $X_1, ..., X_{n-1}, X_n$



a. Before the reference to  $X_n$ 



b. After the reference to  $X_n$ 

- How do we know if the data is present?
- Where do we look?



#### **Direct Mapped Cache**

- Location determined by address
- Direct mapped: only one choice
  - (Block address) modulo (#Blocks in cache)



- # of Blocks is a power of 2
- Use low-order address bits

#### Tags and Valid Bits

- How do we know which particular block is stored in a cache location?
  - Store block address as well as the data
  - Actually, only need the high-order bits
  - Called the tag
- What if there is no data in a location?
  - Valid bit: 1 = present, 0 = not present
  - Initially 0

## Cache Example (1)

- 8-blocks, 1 word/block, direct mapped
- Initial state

| Index | V | Тад | Data |
|-------|---|-----|------|
| 000   | Ν |     |      |
| 001   | Ν |     |      |
| 010   | Ν |     |      |
| 011   | Ν |     |      |
| 100   | Ν |     |      |
| 101   | Ν |     |      |
| 110   | Ν |     |      |
| 111   | Ν |     |      |
|       |   |     |      |

#### Cache Example (2)

| <b>,</b>                  | e block<br>10 |
|---------------------------|---------------|
| Index V Tag Data<br>000 N |               |
| 001 N                     |               |
| 010 N                     |               |
| 011 N                     |               |
| 100 N                     |               |
| 101 N                     |               |
| 110 Y 10 Mem[10110]       |               |
| 111 N                     |               |

-

#### Cache Example (3)

| Word<br>20               |             | Binary addr<br>11 010 |      | Hit/miss<br>Miss | Cache block<br>010 |
|--------------------------|-------------|-----------------------|------|------------------|--------------------|
| Index<br>000<br>001      | V<br>N<br>N | Tag                   | Data | 9                |                    |
| <b>010</b><br>011<br>100 | Y<br>N<br>N | 11                    | Men  | n[11010]         |                    |
| 101<br>110<br>111        | N<br>Y<br>N | 10                    | Men  | n[10110]         |                    |

-

#### Cache Example (4)

| Word addr |   | Binary addr |      | Hit/miss | Cache block |
|-----------|---|-------------|------|----------|-------------|
| 22        |   | 10 110      |      | Hit      | 110         |
| 26        |   | 11 010      |      | Hit      | 010         |
| Index     | V | Tag         | Data | 1        |             |
| 000       | Ν |             |      |          |             |
| 001       | Ν |             |      |          |             |
| 010       | Y | 11          | Men  | n[11010] |             |
| 011       | Ν |             |      |          |             |
| 100       | Ν |             |      |          |             |
| 101       | Ν |             |      |          |             |
| 110       | Y | 10          | Mem  | n[10110] |             |
| 111       | Ν |             |      |          |             |

• 5

#### Cache Example (5)

| Word addr Binary a |           | addr Hit/miss | Cache block |     |
|--------------------|-----------|---------------|-------------|-----|
|                    | 16 10 000 |               | 0 Miss      | 000 |
| 3                  |           | 00 01         | 1 Miss      | 011 |
| 16                 |           | 10 00         | 0 Hit       | 000 |
| Index              | V         | Tag           | Data        |     |
| 000                | Υ         | 10            | Mem[10000]  |     |
| 001                | Ν         |               |             |     |
| 010                | Y         | 11            | Mem[11010]  |     |
| 011                | Υ         | 00            | Mem[00011]  |     |
| 100                | Ν         |               |             |     |
| 101                | Ν         |               |             |     |
| 110                | Y         | 10            | Mem[10110]  |     |
| 111                | Ν         |               |             |     |
|                    |           |               |             |     |

• 6

#### Cache Example (6)

| Word<br>18                      |             | Binary<br>10 (  |             | Hit/miss<br>Miss            | Cache block<br>010 |
|---------------------------------|-------------|-----------------|-------------|-----------------------------|--------------------|
| Index<br>000<br>001             | V<br>Y<br>N | Tag<br>10       | Data<br>Mer | a<br>n[10000]               |                    |
| <b>010</b><br>011<br>100<br>101 | Y<br>Y<br>N | <b>10</b><br>00 |             | <b>n[10010]</b><br>n[00011] |                    |
| 110<br>111                      | Y<br>N      | 10              | Mer         | n[10110]                    |                    |

-

#### Address Subdivision



٦,

#### **Example: Larger Block Size**

- 64 blocks, 16 bytes/block
  - To what block number does address 1207 map?
- Block address = [1207/16] = 75
- Block number = 75 modulo 64 = 11



#### **Block Size Considerations**

- Larger blocks should reduce miss rate
  - Due to spatial locality
- But in a fixed-sized cache
  - Larger blocks  $\Rightarrow$  fewer of them
    - More competition  $\Rightarrow$  increased miss rate
- Larger miss penalty
  - Can override benefit of reduced miss rate

#### Cache Misses

- On cache hit, CPU proceeds normally
- On cache miss
  - Stall the CPU pipeline
  - Fetch block from next level of hierarchy
  - Instruction cache miss
    - Restart instruction fetch
  - Data cache miss
    - Complete data access

#### Write-Through

- On data-write hit, could just update the block in cache
  - But then cache and memory would be inconsistent
- Write through: also update memory
- But makes writes take longer
  - e.g., if base CPI = 1, 10% of instructions are stores, write to memory takes 100 cycles
    - Effective CPI = 1 + 0.1×100 = 11
- Solution: write buffer
  - Holds data waiting to be written to memory
  - CPU continues immediately
    - Only stalls on write if write buffer is already full

#### Write-Back

- Alternative: On data-write hit, just update the block in cache
  - Keep track of whether each block is dirty
- When a dirty block is replaced
  - Write it back to memory
  - Can use a write buffer to allow replacing block to be read first

#### Example: Intrinsity FastMATH

- Embedded MIPS processor
  - 12-stage pipeline
  - Instruction and data access on each cycle
- Split cache: separate I-cache and D-cache
  - Each 16KB: 256 blocks × 16 words/block
  - D-cache: write-through or write-back
- SPEC2000 miss rates
  - I-cache: 0.4%
  - D-cache: 11.4%
  - Weighted average: 3.2%

#### Example: Intrinsity FastMATH



# Main Memory Supporting Caches

- Use DRAMs for main memory
  - Fixed width (e.g., 1 word)
  - Connected by fixed-width clocked bus
    - Bus clock is typically slower than CPU clock
- Example cache block read
  - 1 bus cycle for address transfer
  - 15 bus cycles per DRAM access
  - 1 bus cycle per data transfer
- For 4-word block, 1-word-wide DRAM
  - Miss penalty =  $1 + 4 \times 15 + 4 \times 1 = 65$  bus cycles
  - Bandwidth = 16 bytes / 65 cycles = 0.25 B/cycle

# **Increasing Memory Bandwidth**



- Bandwidth = 16 bytes / 17 cycles = 0.94 B/cycle
- 4-bank interleaved memory

a. One-word-wide

memory organization

- Miss penalty = 1 + 15 + 4×1 = 20 bus cycles
- Bandwidth = 16 bytes / 20 cycles = 0.8 B/cycle

# Advanced DRAM Organization

- Bits in a DRAM are organized as a rectangular array
  - DRAM accesses an entire row
  - Burst mode: supply successive words from a row with reduced latency
- Double data rate (DDR) DRAM
  - Transfer on rising and falling clock edges
- Quad data rate (QDR) DRAM
  - Separate DDR inputs and outputs