Lecture 17: Multi-Level Caching - Lectures on Computer Architecture

Click the thumbnail above to watch the video lecture on YouTube

By Dr. Isuru Nawinne

17.1 Introduction

This lecture explores cache hierarchies in modern computer systems, examining how multiple levels of cache work together to optimize memory access performance through careful balance of hit latency versus hit rate. We analyze real-world implementations including Intel's Skylake architecture, understanding the design decisions behind multi-level cache organizations where L1 caches prioritize speed, L2 caches balance capacity and latency, and L3 caches provide large shared storage across processor cores. The examination of associativity tradeoffs—from direct-mapped through set-associative to fully associative designs—reveals how hardware complexity, power consumption, and performance interact in practical cache systems.

17.2 Recap: Associativity Comparison Results

From the previous lecture's example using a 4-block cache with three different organizations:

17.2.1 Direct Mapped Cache

Result: 5 misses, 0 hits
Cold misses: 3 (compulsory, unavoidable)
Conflict misses: 2 (data evicted then accessed again)
Utilization: Poor - only 2 of 4 slots used
Hit rate: 0% in this example

17.2.2 2-Way Set Associative Cache

Result: 4 misses, 1 hit
Cold misses: 3
Conflict misses: 1
Utilization: 2 of 4 slots used
Hit rate: 20% - better than direct mapped

17.2.3 Fully Associative Cache (4-way)

Result: 3 misses, 2 hits
Cold misses: 3 (only unavoidable misses)
Conflict misses: 0
Utilization: Best - 3 of 4 slots used
Hit rate: 40% - best performance

17.2.4 Key Observations

Higher associativity → better hit rate
Higher associativity → reduced conflict misses
Cold misses occur at program start and when new addresses are accessed
System reaches "steady state" with mostly conflict misses after initial cold misses
Performance improvement comes at cost of complexity and power

17.3 Cache Configuration Parameters

17.3.1 Primary Parameters

1. Block Size

Size of a single block in bytes
Cache deals with memory in blocks
CPU deals with cache in words/bytes

2. Set Size

Number of sets in the cache
Direct mapped: number of sets = number of entries
Fully associative: only 1 set
Can be confusing - refers to number of sets, not size of each set

3. Associativity

Number of ways in a set
Number of blocks that can be stored in one set
1-way = direct mapped
2-way = two-way set associative
N-way = N blocks per set

17.3.2 Cache Size Calculation

Total Cache Size = Block Size × Set Size × Associativity

17.3.3 Secondary Parameters

4. Replacement Policy

LRU (Least Recently Used)
Pseudo-LRU (PLRU)
FIFO (First In First Out)
Others

5. Write Policy

Write-through
Write-back

6. Other Optimization Techniques

Prefetching mechanisms
Write buffer size
Communication protocols

17.3.4 Configuration Definition

Fixing values for all these parameters defines a specific cache configuration
Performance and power consumption are determined by configuration
External factors: memory access patterns from CPU/program

17.4 Improving Cache Performance

17.4.1 Average Access Time Equation

T_avg = Hit Latency + Miss Rate × Miss Penalty

Three main factors can be optimized as below.

17.5 Hit Rate Improvement

17.5.1 Method 1: Increase Cache Size

Approach:

Most obvious and intuitive method
More slots → can hold more data → more likely to get hits

Limitations:

Very expensive (SRAM costs ~$2000/GB)
SRAM uses cutting-edge technology, same as CPU
Must be fast enough to work at CPU speed
Usually located inside CPU core
Practical limit on how much cache can be added

17.5.2 Method 2: Increase Associativity

Benefits:

Higher associativity → better hit rate
Reduces conflict misses
Most popular technique for given cache size

Trade-offs:

Increases hit latency
Increases power consumption
Increases hardware cost

17.5.3 Method 3: Cache Prefetching

Concept:

Fetch data before it's needed
Similar to branch prediction in CPU
Reduces cold misses (compulsory misses)
Can also reduce conflict misses

Types of Prefetching:

Software prefetching (compiler-based)
Hardware prefetching
Hybrid software-hardware approaches

Benefits:

Can predict and fetch data before CPU requests it
Reduces effective miss rate
Can significantly improve performance for predictable access patterns

Limitations:

Not 100% accurate
Wrong predictions waste power and bandwidth
Requires additional hardware
Increases complexity

17.6 Hit Latency Optimization

17.6.1 Relationship with Hit Rate

Fundamental Trade-off:

Hit rate and hit latency are tied together
Improving hit rate often increases hit latency
Improving hit latency often reduces hit rate
Need to find optimal balance

Examples:

Higher associativity → better hit rate BUT higher hit latency
Smaller, simpler cache → lower hit latency BUT worse hit rate

Design Challenge:

Must balance these competing factors
Depends on application requirements
Different trade-offs for different use cases

17.7 Miss Penalty Improvement

17.7.1 Miss Penalty Definition

Time spent servicing a cache miss
Time to fetch missing block from memory

17.7.2 Method 1: Optimize Communication

Improve bus technology between cache and memory
Increase bus width
Increase bus speed
Optimize bus arbitration
Better communication protocols
This assumes best possible communication is already in place

17.7.3 Method 2: Cache Hierarchy (Main Focus)

Use multiple levels of cache
Each level optimized differently
Most effective technique for reducing miss penalty

17.8 Cache Hierarchy (Multi-Level Caches)

17.8.1 Concept

Instead of a single cache between CPU and memory, use multiple cache levels: L1, L2, L3, etc., with each level serving as backup for the level above.

17.8.2 Terminology

L1 (Level 1): Top-level cache, closest to CPU
L2 (Level 2): Second-level cache
L3 (Level 3): Third-level cache (in some systems)
Top-level cache: Fastest, smallest
Last-level cache: Slowest (but still fast), largest

17.8.3 Operation

CPU requests data from L1
L1 miss → request goes to L2 (not directly to memory)
L2 miss → request goes to L3 (if exists)
Last-level miss → request goes to main memory

17.8.4 Benefits

Reduced effective miss penalty for L1
Most L1 misses served by L2 in few cycles (2-4 cycles)
Only L2 misses incur full memory penalty (100+ cycles)
Overall average miss penalty greatly reduced

17.8.5 Effective Miss Penalty

For L1 cache:

Effective Miss Penalty = L2 Hit Latency + L2 Miss Rate × L2 Miss Penalty

If L2 has good hit rate:

L2 miss rate is low
Most L1 misses served quickly by L2
Effective penalty much less than going to memory

17.8.6 Example Calculation

Given:

L1 miss rate: 5%
L2 hit rate: 99.9%
L2 hit latency: 3 cycles
Memory penalty: 100 cycles

L1 effective penalty = 3 + 0.001 × 100 = 3.1 cycles

17.9 Optimization Strategies for Multi-Level Caches

17.9.1 Why Not One Big Cache?

Different levels can be optimized for different goals
Splitting allows specialized optimization
Better overall performance than single large cache

17.10 L1 Cache Optimization - Optimize for Hit Latency

17.10.1 Goal

Minimize hit latency

17.10.2 Rationale

Critical for CPU clock cycle time
Memory access is slowest pipeline stage
Determines overall CPU clock period
Lower L1 hit latency → shorter clock cycle → higher CPU frequency

17.10.3 Characteristics

Small size
Lower associativity (2-way, 4-way, sometimes 8-way)
Fast response time
Accept moderate hit rate (e.g., 95%)

17.10.4 Trade-off

Sacrifice some hit rate for speed
Slightly higher miss rate acceptable
Misses handled by L2

17.11 L2 Cache Optimization - Optimize for Hit Rate

17.11.1 Goal

Maximize hit rate

17.11.2 Rationale

Serve most L1 misses
Minimize accesses to main memory
Reduce effective L1 miss penalty

17.11.3 Characteristics

Larger size
Higher associativity (8-way, 16-way, or even fully associative)
Very high hit rate (99.9% or better)
Can tolerate higher hit latency

17.11.4 Trade-off

Higher latency acceptable
Not on critical path for most accesses
Priority is catching L1 misses

17.12 Associativity Comparison

Question: Which level has higher associativity? Answer: L2 (and L3 if present) have higher associativity

17.12.1 Reasoning

L2 optimized for hit rate
Higher associativity → better hit rate
L1 optimized for latency
Lower associativity → faster access

17.12.2 Combined Effect

L1: Fast but moderate hit rate (e.g., 95-98%)
L2: Slower but excellent hit rate (e.g., 99-99.9%)
Most accesses: L1 hit (fast path)
Most L1 misses: L2 hit (medium path, few cycles)
Very few accesses: Main memory (slow path, 100+ cycles)

Overall result: Much better average performance

17.13 Physical Implementation of Cache Hierarchy

17.13.1 L1 Cache

Almost always on-chip (inside CPU die)
Integrated within CPU core
Smallest but fastest
Typically split into:
- L1 instruction cache (L1-I)
- L1 data cache (L1-D)

17.13.2 L2 Cache

Usually on-chip (same die as CPU)
Can be off-chip in some designs
Larger than L1
May be unified (instruction + data) or split
If multi-core: may be per-core or shared

17.13.3 L3 Cache

Common in multi-processor/multi-core systems
Usually on-chip in modern designs
Can be off-chip in some architectures
Typically unified and shared among all cores
Largest cache level

17.13.4 Design Variations

Different implementations based on:

Performance requirements
Power budget
Cost constraints
Target application
Number of cores

17.14 Real World Example: Intel Skylake Architecture

Source: wikichip.org

17.14.1 Architecture Overview

Mainstream Intel architecture from ~2015
Used in Core i3, i5, i7 processors
Standard desktop/PC processors

17.14.2 Dual-Core Layout Analysis

Execution Units

Two separate processor cores visible
Integer ALUs (arithmetic logic units)
Floating-point units
Multipliers, dividers
Other arithmetic hardware

Pipeline Support Hardware

Takes up as much space as execution units
Out-of-order scheduling logic
Branch prediction units
Multiple issue hardware
Decoding logic
Control logic

17.14.3 Cache Implementation

L1 Data Cache

Separate for each core
Located close to execution units and memory management
8-way set associative
Smaller size (32KB typical)
Close to where addresses are generated

L1 Instruction Cache

Separate for each core
Located close to instruction fetch and decode units
Near out-of-order scheduling hardware
8-way set associative
Smaller size (32KB typical)

L2 Cache

Shared between instruction and data
Larger than L1 (256KB in this example)
4-way set associative (in this design)
Located between L1 and memory
Serves both L1-I and L1-D misses

17.14.4 Memory Hierarchy

Separate buffers for load and store instructions
Buffers before and after cache
Memory management unit
Connection to L3 cache (if present) via bus

17.14.5 Design Observations

Physical placement matches logical function
Data cache near execution units
Instruction cache near fetch/decode
Shared L2 in middle position
Significant die area for cache
Even more area for pipeline optimization

17.14.6 Why Higher L1 Associativity Here?

8-way seems high for L1
But size is small (32KB)
Other pipeline stages may be bottleneck
Clock period limited by other factors
Can afford higher associativity without hurting cycle time
Depends on overall CPU design

17.14.7 Multi-Core Configuration

Each core has own L1-I and L1-D
Each core has own L2
All cores share L3
L3 connects via bus system

17.14.8 Additional Features

Physical register files (integer and vector)
Store/load buffers
Pre-decoding hardware
Complex x86 instruction handling
Many optimizations for real-world performance

17.15 Recommendations for Further Study

17.15.1 Resource: wikichip.org

Content Available:

Detailed CPU architecture information
Real implementation details
Various processor families:
- Intel x86 architectures
- ARM implementations
- AMD processors
- Other architectures

Benefits:

See concepts in real hardware
Understand practical trade-offs
Compare different design approaches
Learn industry practices

Key Takeaways

Cache hierarchies reduce effective miss penalty
Different levels optimized for different goals:
- L1: Hit latency (speed)
- L2/L3: Hit rate (coverage)
Multi-level caches balance competing requirements
Real implementations show concepts in practice
Design decisions depend on:
- Performance targets
- Power budget
- Cost constraints
- Application requirements
Modern CPUs use sophisticated cache hierarchies
Cache takes significant portion of CPU die area
Pipeline optimizations also require substantial hardware

Summary

Cache hierarchies represent one of the most effective techniques for improving memory system performance. By using multiple levels of cache, each optimized for different objectives, modern processors achieve both low latency and high hit rates. The L1 cache prioritizes speed to minimize clock cycle time, while L2 and L3 caches prioritize capacity and hit rate to reduce memory access frequency. Real-world implementations, such as Intel's Skylake architecture, demonstrate these principles in practice, showing how careful cache design enables high-performance computing while managing the constraints of power, cost, and chip area.

← Previous Lecture Next Lecture →