Back to All Lectures

Lecture 14: Memory Hierarchy and Caching

Lectures on Computer Architecture

Click the thumbnail above to watch the video lecture on YouTube

By Dr. Isuru Nawinne

14.1 Introduction

This lecture marks a crucial transition from CPU-centric topics to memory systems, introducing cache memory as the elegant solution to the fundamental processor-memory speed gap. We begin with historical context, tracing how stored-program concept revolutionized computing, then explore the memory hierarchy that creates the illusion of large, fast memory through careful exploitation of temporal and spatial locality. The direct-mapped cache organization receives detailed treatment, establishing foundational concepts of blocks, tags, indices, and valid bits that underpin all cache designs. Understanding cache memory proves as essential as understanding processor architecture, as memory system performance often determines overall computer system speed in practice.

14.2 Lecture Introduction and Historical Context

14.2.1 Lecture Transition

Previous Topics:

New Focus:

Motivation:

14.2.2 Historical Background

Early Computing Machines (1940s)

Examples:

Characteristics:

Programming Method:

14.2.3 Key Historical Figures

Alan Turing (1936)

John von Neumann (1940s)

14.2.4 First Stored Program Computers

EDVAC (1948)

Von Neumann Architecture

Key Concept:

EDSAC (Cambridge University)

Harvard Architecture (Contrasted)

Modern Computers:

14.3 Memory Technologies: Types and Characteristics

14.3.1 Commonly Used Memory Technologies Today

14.3.2 SRAM (Static RAM)

Property Value/Description
Technology Built using flip-flops
Volatility Volatile (loses content when power lost)
Access Time Less than 1 nanosecond (< 1 ns)
Clock Frequency More than 1 GHz
Cycle Time Less than 1 nanosecond (< 1 ns)
Capacity Kilobytes to Megabytes range
Cost ~$2000 per gigabyte (VERY EXPENSIVE)
Speed Extremely fast
Usage Cache memories (small amounts due to cost)

Note on Cycle Time:

14.3.3 DRAM (Dynamic RAM)

Property Value/Description
Technology Transistors + Capacitors
Volatility Volatile (requires power AND periodic refresh)
Access Time ~25 nanoseconds (50 ns in some contexts)
Cycle Time ~50 nanoseconds (double the access time)
Capacity Gigabytes (8 GB, 16 GB, or more)
Cost ~$10 per gigabyte
Usage Main memory in computers

Key Characteristics:

14.3.4 Flash Memory

Property Value/Description
Technology NAND MOSFET (NAND gate with two gates)
Volatility Non-volatile (retains data without power)
Access Time ~70 nanoseconds
Cycle Time ~70 nanoseconds
Capacity Gigabytes range
Cost Less than $1 per gigabyte
Usage Secondary storage (SSDs - Solid State Devices/Drives)

Limitation:

14.3.5 Magnetic Disk

Property Value/Description
Technology Magnetic (mechanical device)
Access Time 5 to 10 milliseconds (MUCH slower than electronic memory!)
Cycle Time Similar to access time (~5-10 ms)
Capacity Several terabytes
Cost Fraction of a dollar per gigabyte (very cheap)

Usage:

Note: Average numbers; varies by data location on disk. Mechanical: spinning platters, moving read/write heads.

14.4 The Memory Performance Problem

14.4.1 The CPU-Memory Speed Gap

CPU Clock Cycle:

Main Memory (DRAM):

14.4.2 The Problem

Speed Discrepancy:

14.4.3 Impact on Pipelining

The Challenge:

The Contradiction:

14.5 Memory Hierarchy Concept

14.5.1 The Solution: Memory Hierarchy

Core Idea:

14.5.2 Memory Hierarchy Structure

Memory Hierarchy with SRAM Cache, DRAM Main Memory, and Disk Storage

Level 1 (Top): SRAM (Cache)

Level 2: DRAM (Main Memory)

Level 3 (Bottom): Disk

14.5.3 Key Principles

1. CPU Access Restriction

2. CPU's Perception

3. Data Organization

4. Hierarchy Characteristics

14.5.4 The Challenge

What if CPU asks for data NOT in the cache (top level)?

14.6 Analogy: Music Library

14.6.1 Understanding Memory Hierarchy Through Music

Three-Level Music System

1. Mobile Phone (analogous to SRAM/Cache):

2. Computer Hard Disk (analogous to DRAM/Main Memory):

3. Internet (analogous to Disk/Mass Storage):

14.6.2 Usage Scenarios

Scenario 1 (Hit)

Scenario 2 (Miss to Level 2)

Scenario 3 (Miss to Level 3)

14.6.3 Key Parallels

14.7 Memory Hierarchy Terminology

14.7.1 Essential Terms for Memory Access

HIT

Definition: Requested data IS available at the accessed level

MISS

Definition: Requested data is NOT available at the accessed level

HIT RATE

Definition: Ratio/percentage of accesses that result in hits

Formula:

Hit Rate = (Number of Hits) / (Total Accesses)

Example: 100 accesses, 90 hits → Hit Rate = 90% or 0.9

Indicates how often data is found at the accessed level. Higher hit rate = better performance.

MISS RATE

Definition: Ratio/percentage of accesses that result in misses

Formula:

Miss Rate = (Number of Misses) / (Total Accesses)
Miss Rate = 1 - Hit Rate

Example: 100 accesses, 10 misses → Miss Rate = 10% or 0.1

Lower miss rate = better performance.

HIT LATENCY

Definition: Time taken to determine if access is a hit AND serve the data

Components:

MISS PENALTY

Definition: EXTRA time required when access is a miss

Process:

  1. Determine it's a miss (hit latency spent)
  2. Go to next level (DRAM)
  3. Find the data
  4. Copy to cache
  5. Put in appropriate place
  6. Deliver to CPU

Key Points:

14.8 Performance Impact and Requirements

14.8.1 Average Memory Access Time

Formula:

Average Access Time = Hit Latency + (Miss Rate × Miss Penalty)

Explanation:

14.8.2 Example Analysis

Given:

For Pipeline to Work

Required Hit Rate Calculation

If Hit Rate = 99.9% (Miss Rate = 0.1%):

Average Time = 1 ns + (0.001 × 100 ns)
             = 1 ns + 0.1 ns
             = 1.1 ns

Still close to 1 clock cycle!

If Hit Rate = 90% (Miss Rate = 10%):

Average Time = 1 ns + (0.10 × 100 ns)
             = 1 ns + 10 ns
             = 11 ns

Unacceptable! 11× slower than CPU clock!

14.8.3 Critical Requirement

14.8.4 Performance Implications

With 99.9% Hit Rate

With Lower Hit Rate

Conclusion:

14.9 Principles of Locality

14.9.1 Foundation for Memory Hierarchy Success

Nature of Computer Programs:

14.9.2 Temporal Locality (Locality in Time)

Definition

"Recently accessed data are likely to be accessed again soon"

Explanation:

Common Examples in Programs

a) Loop Index Variables:

for (int i = 0; i < 100; i++) {
    // i is accessed every iteration
    // Same memory location for 'i' accessed repeatedly
}

b) Loop-Invariant Data:

for (int i = 0; i < n; i++) {
    result = result + array[i] * constant;
    // 'result' and 'constant' accessed every iteration
}

c) Function/Procedure Calls:

d) Instructions:

Music Analogy

Degree of Temporal Locality

14.9.3 Spatial Locality (Locality in Space)

Definition

"Data located close to recently accessed data are likely to be accessed soon"

Explanation:

Common Examples in Programs

a) Array Traversal:

for (int i = 0; i < 100; i++) {
    sum += array[i];
    // Access array[0], then array[1], then array[2], ...
    // Sequential memory addresses
}

b) Sequential Instruction Execution:

c) Data Structures:

struct Student {
    int id;
    char name[50];
    float gpa;
};
Student s;
// Accessing s.id, then s.name, then s.gpa
// Nearby memory locations

d) String Processing:

char str[] = "Hello";
for (int i = 0; str[i] != '\0'; i++) {
    // Access str[0], str[1], str[2], ...
    // Consecutive bytes in memory
}

Music Analogy

Degree of Spatial Locality

14.9.4 Universal Applicability

14.10 Cache Memory Concept and Block-Based Operation

14.10.1 Cache Memory Overview

Purpose:

14.10.2 Data Organization: BLOCKS

Key Concepts:

14.10.3 Why Blocks? (Spatial Locality)

Instead of Words

Using Blocks

Music Library Analogy

Block Benefits

14.10.4 Cache Management Decisions

1. What to Keep in Cache

2. What to Evict from Cache

14.10.5 Eviction Strategy (Ideal)

Least Recently Used (LRU):

Example:

14.11 Memory Addressing: Bytes, Words, and Blocks

14.11.1 Byte Address

Definition: Address referring to individual byte in memory

Characteristics:

Example Address:

Address: 00000000000000000000000000001010 (binary)
       = 10 (decimal)
Points to: Byte at memory location 10

Memory Structure:

Address 0:  [byte 0]
Address 1:  [byte 1]
Address 2:  [byte 2]
...
Address 10: [byte 10]  ← This byte addressed by example
...

14.11.2 Word Address

Definition: Address referring to a word (multiple bytes) in memory Typical Word Size: 4 bytes (32 bits)

Word Alignment

Word Address Format (32-bit)

[30-bit word identifier][2-bit byte offset]
                         └── Always "00" for word-aligned addresses

Example:

Address: ...00001000 (binary)
- Last 2 bits: 00 → Word-aligned
- Remaining bits: Identify which word
- This is address 8, start of word 2

Byte Within Word

Last 2 bits select byte within word:

Key Points:

14.11.3 Block Address

Definition: Address referring to a block (multiple words) in memory Example Block Size: 8 bytes = 2 words

Block Alignment

Block Address Format (32-bit)

[Block Identifier][3-bit offset]
                   └── Last 3 bits for 8-byte blocks

Example:

Address: 00000000000000000000000000101101 (binary)
       = 45 (decimal)

Block Address Portion:
- Ignore last 3 bits: 101 (offset part)
- Block address: 00000000000000000000000000101 (identifies block)
- This identifies the block containing address 45

Offset Within Block (3 bits for 8-byte blocks)

Offset Within Block (3 bits for 8-byte blocks)

BYTE OFFSET (all 3 bits):

Used to identify individual BYTE within block:

WORD OFFSET (most significant bit of offset):

Used to identify WORD within block (when block has 2 words):

14.11.4 Address Components Summary

For address with 8-byte blocks, 4-byte words:

[Block Address][Word Offset][Byte in Word]
  ^              ^              ^
  |              |              └── 2 bits: Select byte within word
  |              └── 1 bit: Select word within block
  └── Remaining bits: Identify which block

Example Breakdown:

Address: ...00101101
- Last 2 bits (01): Byte offset within word → Byte 1 of word
- 3rd bit from right (1): Word offset → Second word of block
- Remaining bits (...00101): Block address → Block 5

All Bytes in Same Block:

Important Distinctions:

14.12 The Cache Addressing Problem

14.12.1 Problem Statement

In Main Memory

In Cache

14.12.2 The Challenge

14.12.3 Initial Solution Idea: Store Addresses with Data

Approach:

Problems with This Approach:

1. Space Overhead:

2. Search Time:

14.12.4 Need for Better Solution

Requirements:

Requirements for Practical Cache:

  1. Fast access (< 1 ns hit latency)
  2. Minimal storage overhead
  3. Direct or near-direct cache indexing
  4. Efficient tag comparison (if needed)

Solution Preview: Address Mapping Functions

14.13 Direct-Mapped Cache

14.13.1 Direct Mapping Concept

Definition:

Mapping Rule:

Cache Index = Block Address MOD (Number of Blocks in Cache)

Formula:

Cache Index = (Block Address) mod (Cache Size in Blocks)

Example:

14.13.2 Mathematical Properties

Mod Operation with Powers of 2

Hardware Implementation

14.13.3 Direct Mapping Example

Given:

Cache Structure (Initial View):

Index Data Block
0 [64 bits]
1 [64 bits]
2 [64 bits]
3 [64 bits]
4 [64 bits]
5 [64 bits]
6 [64 bits]
7 [64 bits]

Example Addresses:

Address 1:

Binary: ...00000001[011]
        └─ Block address = 0
        └─ Offset = 3 bytes
Cache index = 0 mod 8 = 0
Maps to cache index 0

Address 2 (block address in focus):

Binary: ...00000101[000]
        └─ Block address = 5
        └─ Offset = 0
Cache index = 5 mod 8 = 5
Maps to cache index 5

14.13.4 Address Structure for Direct-Mapped Cache

[Tag][Index][Offset]
  ^     ^       ^
  |     |       └── Identifies byte/word within block
  |     └── Identifies cache location (index)
  └── Remaining bits to differentiate blocks mapping to same index

Bit Allocation (for 8-block cache, 8-byte blocks, 32-bit address)

Index Bits

14.14 The Tag Problem in Direct-Mapped Cache

14.14.1 Conflict Issue

Multiple Blocks → Same Index:

Example Addresses Mapping to Index 5:

Address A:

Block address: ...00000101
Index bits (last 3): 101 → Index 5

Address B:

Block address: ...00001101
Index bits (last 3): 101 → Index 5

Both map to index 5, but different blocks!

14.14.2 The Problem

14.14.3 Solution: TAG FIELD

Tag Definition:

Tag = Block Address (excluding index bits)

Example Address Breakdown

Full Address:

[26-bit Tag][3-bit Index][3-bit Offset]

Address A

Binary representation:

00000000000000000000000000 101 000
Field Bits (binary) Decimal
Tag 00000000000000000000000000 (26 bits) 0
Index 101 5
Offset 000 0

Address B

Binary representation:

00000000000000000000000001 101 000
Field Bits (binary) Decimal
Tag 00000000000000000000000001 (26 bits) 1
Index 101 5
Offset 000 0

Both addresses map to cache index 5 but have different tag values (0 vs 1), so they refer to different memory blocks that conflict at the same cache index.

14.14.4 Cache Structure with Tags

Index Valid Tag Data Block
0 V Tag0 [64 bits]
1 V Tag1 [64 bits]
2 V Tag2 [64 bits]
3 V Tag3 [64 bits]
4 V Tag4 [64 bits]
5 V Tag5 [64 bits]
6 V Tag6 [64 bits]
7 V Tag7 [64 bits]

Storage Requirements Per Cache Entry:

Storage Overhead:

Overhead = (Tag + Valid) / Total
         = (26 + 1) / (26 + 1 + 64)
         = 27 / 91
         ≈ 30% overhead in this small example

Note on Overhead

Example with Larger Cache:

14.14.5 Valid Bit

Purpose:

Initial State:

After Data Loaded:

Uses Beyond Initialization:

14.15 Cache Read Access Operation

Direct-Mapped Cache Read Access Process

14.15.1 Read Access Process

CPU Provides:

  1. Address (word or byte address)
  2. Control Signal: Read/Write indicator (from control unit)

14.15.2 For Read Access

Step 1: ADDRESS BREAKDOWN

Process:

  1. Tag bits
  2. Index bits
  3. Offset bits

Example Address (32-bit):

[26-bit Tag][3-bit Index][3-bit Offset]

Step 2: INDEXING THE CACHE

Hardware:

Step 3: TAG COMPARISON

Comparator Circuit:

Example (4-bit tags):

Stored tag:   1 0 1 1
Address tag:  1 0 1 1
XNOR:         1 1 1 1  → AND = 1 (MATCH!)

Stored tag:   1 0 1 1
Address tag:  1 0 0 1
XNOR:         1 1 0 1  → AND = 0 (NO MATCH)

For N-bit tag:

Step 4: VALID BIT CHECK

Step 5: HIT/MISS DETERMINATION

Logic Circuit:

Tag Match Output ─┐
                   AND ─→ Hit/Miss Signal
Valid Bit ────────┘

Output:

Hit Latency:

Time for steps 2-5, dominated by:

Step 6: DATA EXTRACTION (Parallel with Tag Check)

Data Block:

Step 7: WORD SELECTION (Using Offset)

Multiplexer (MUX):

Example (2 words per block):

Example (4 words per block):

Timing:

Step 8: DECISION BASED ON HIT/MISS

If HIT (signal = 1):

If MISS (signal = 0):

Miss Handling:

14.16 Cache Circuit Components Summary

14.16.1 Key Circuit Elements

1. INDEXING CIRCUITRY

2. TAG COMPARATOR

3. VALID BIT CHECK

4. HIT/MISS LOGIC

5. DATA ARRAY ACCESS

6. WORD SELECTOR (Multiplexer)

7. CONTROL LOGIC (Cache Controller)

14.16.2 Hit Latency Components

Contributing Factors:

Dominant Delays:

Parallelism:

14.17 Next Lecture Preview

14.17.1 Topics to Cover

1. Cache Miss Handling

2. Cache Controller State Machine

3. Write Operations

4. Replacement Policies

5. Performance Analysis

6. Advanced Cache Concepts

14.18 Key Takeaways and Summary

14.18.1 Historical Foundations

14.18.2 Memory Technologies Hierarchy

Technology Speed Size Cost
SRAM Fastest (< 1 ns) Smallest (KB-MB) Most expensive ($2000/GB)
DRAM Medium (~50 ns) Medium (GB) Moderate ($10/GB)
Flash Similar to DRAM Gigabytes Cheap (< $1/GB)
Disk Slowest (5-10 ms) Largest (TB) Cheapest (cents/GB)

14.18.3 The Performance Problem

14.18.4 Memory Hierarchy Solution

14.18.5 Principles of Locality

1. Temporal Locality: Recently accessed data likely accessed again soon

2. Spatial Locality: Data near recently accessed data likely accessed soon

14.18.6 Memory Addressing

14.18.7 Cache Terminology

14.18.8 Cache Organization (Direct-Mapped)

14.18.9 Direct-Mapped Cache Structure

14.18.10 Cache Read Access Process

  1. Extract index from address → Access cache entry
  2. Extract tag from cache entry → Compare with address tag
  3. Check valid bit from entry
  4. Determine hit/miss: (Tag Match) AND (Valid)
  5. In parallel: Extract data block, select word using offset
  6. If HIT: Send word to CPU (done in < 1 ns)
  7. If MISS: Must fetch from memory (will cover next lecture)

14.18.11 Critical Requirements

14.18.12 Average Access Time Formula

Average Access Time = Hit Latency + (Miss Rate × Miss Penalty)

14.18.13 Pending Topics (Next Lectures)

14.18.14 Music Library Analogy Summary

Key Takeaways

  1. Stored-program concept revolutionized computing—programs stored in memory like data, eliminating manual reconfiguration for each algorithm.
  2. Von Neumann architecture established fundamental computer organization—CPU, memory, and I/O with instructions and data sharing same memory.
  3. Processor-memory speed gap creates performance bottleneck—CPU operates at nanosecond scale while main memory requires tens of nanoseconds.
  4. Memory hierarchy provides illusion of large, fast memory—small fast cache near CPU, larger slower DRAM main memory, massive slow disk storage.
  5. Temporal locality: Recently accessed data likely accessed again soon—programs exhibit loops, function calls, and repeated variable access patterns.
  6. Spatial locality: Nearby data likely accessed soon—programs access arrays sequentially and instructions execute in order.
  7. Cache exploits locality to achieve high hit rates—keeping frequently accessed data in fast storage dramatically improves average access time.
  8. Cache organized in blocks, not individual words—exploiting spatial locality by fetching multiple words together.
  9. Direct-mapped cache: Each memory block maps to exactly one cache location—simplest cache organization using modulo arithmetic for mapping.
  10. Address breakdown: Tag + Index + Offset—index selects cache entry, tag identifies specific block, offset selects word within block.
  11. Valid bit indicates cache entry contains meaningful data—essential for distinguishing real data from uninitialized entries at startup.
  12. Cache hit occurs when requested data found in cache—CPU receives data in ~1 nanosecond, avoiding slow main memory access.
  13. Cache miss requires main memory fetch—takes ~100 nanoseconds, replacing cache entry with new block from memory.
  14. Hit rate determines cache effectiveness—even 1% miss rate significantly impacts average memory access time with 100× penalty.
  15. Block size affects performance—larger blocks exploit spatial locality better but reduce total number of blocks, potentially increasing conflicts.
  16. Cache size represents total data storage capacity—typical L1 caches 32-64 KB, L2 caches 256 KB-1 MB.
  17. Tag comparison happens in parallel with data access—enabling fast hit detection and maintaining single-cycle cache access.
  18. Music library analogy clarifies cache concept—phone (cache) holds favorites, computer (DRAM) has main collection, internet (disk) contains everything.
  19. Cache transparent to programmer—software sees uniform memory, hardware manages cache automatically for best performance.
  20. Memory hierarchy only works because programs exhibit locality—without temporal and spatial locality, caching would fail catastrophically.

Summary

The introduction to memory systems and cache memory reveals how the fundamental processor-memory speed gap—with CPUs operating 100× faster than main memory—drives sophisticated cache hierarchy designs that create the illusion of large, fast memory. Historical context from Alan Turing's theoretical foundations through Von Neumann's stored-program architecture establishes how modern computers execute instructions fetched from memory rather than requiring manual reconfiguration. The memory hierarchy concept, with small fast SRAM caches near the CPU, larger slower DRAM main memory, and massive disk storage, exploits two fundamental program properties: temporal locality (recently accessed data likely accessed again soon) and spatial locality (nearby data likely accessed soon). Cache memory, organized in blocks rather than individual words, dramatically improves average access time by maintaining frequently accessed data in fast storage, achieving hit rates often exceeding 95% in practice. Direct-mapped cache organization, the simplest mapping scheme, uses modulo arithmetic to assign each memory block to exactly one cache location, with address bits divided into tag (identifying specific block), index (selecting cache entry), and offset (choosing word within block). The valid bit distinguishes real cached data from uninitialized entries, essential at system startup when cache contains random values. Cache hits deliver data in approximately 1 nanosecond while misses require ~100 nanosecond main memory access, making even small miss rates significant—a 1% miss rate doubles average access time from 1 ns to 2 ns. The music library analogy effectively clarifies concepts: phone storage represents cache (small, fast, always accessible), computer storage represents main memory (larger, slower, main collection), and internet streaming represents disk (unlimited, very slow, backup). This cache transparency—programmer sees uniform memory while hardware automatically manages caching—enables software compatibility across different cache configurations. The critical insight remains that memory hierarchy effectiveness depends entirely on programs exhibiting locality; without these natural access patterns inherent to how we write code, caching would provide no benefit. Understanding cache fundamentals proves essential for both hardware designers optimizing cache architectures and software developers writing cache-friendly code that maximizes hit rates.