Back to All Lectures

Lecture 19: Multiprocessors

Lectures on Computer Architecture

Click the thumbnail above to watch the video lecture on YouTube

By Dr. Isuru Nawinne

19.1 Introduction

Multiprocessor systems represent a fundamental paradigm shift in computer architecture, using multiple processors on the same chip to execute multiple programs or threads simultaneously when traditional performance improvement techniques—clock frequency scaling and instruction-level parallelism—reached physical and practical limits. This lecture explores the evolution toward multiprocessor architectures driven by power walls and parallelism walls, examines the critical challenge of cache coherence that arises when multiple processors maintain private caches of shared memory, and analyzes solutions including bus snooping protocols like MESI and scalable directory-based coherence schemes. We compare architectural organizations from uniform memory access (UMA) to non-uniform memory access (NUMA), understanding how different designs balance simplicity, performance, and scalability for systems ranging from dual-core smartphones to thousand-processor supercomputers.

19.2 Introduction to Multiprocessors

Multiprocessor systems address performance limitations encountered with single processor systems by employing multiple processors on the same chip to execute multiple programs or threads simultaneously.

19.3 Performance Evolution Background

19.3.1 Historical Performance Improvements

Early Methods: Clock Frequency Scaling

Approach: Limitations Encountered:

Instruction Level Parallelism (ILP)

Techniques: Limitations Encountered:

19.3.2 Moore's Law Context

Observation: Question: How to use abundant transistors? Solution: Multiple processors on same chip

19.4 Multiprocessor Approach

19.4.1 Key Characteristics

19.4.2 Terminology

19.4.3 Key Problem: Communication Between Processors

19.5 Shared Memory Multiprocessors (SMM)

Shared Memory Multiprocessors

19.5.1 Most Common Approach

Architecture:

19.5.2 Operating System Role

Responsibilities:

19.5.3 Workload Balancing

Purpose:

19.6 Memory Contention Problem

19.6.1 Inherent Issue

Challenge:

19.6.2 Effect on Performance

Bottleneck:

19.7 Uniform Memory Access (UMA)

Uniform Memory Access (UMA)

19.7.1 Definition

Characteristics:

19.7.2 Also Known As

19.7.3 Key Properties

19.8 Solution to Contention: Caches

19.8.1 Using Local Caches

Approach:

19.8.2 Benefits

19.8.3 New Problem: Cache Coherence

19.9 Cache Coherence Problem

19.9.1 The Issue

Scenario:

19.9.2 Example Sequence

  1. PE1 reads X (value = 1) → Cached in PE1
  2. PE2 reads X (value = 1) → Cached in PE2
  3. PE1 writes X = 0 → PE1 cache updated
  4. Memory may or may not be updated (depends on write policy)
  5. PE2 still sees X = 1 (stale data)
  6. Inconsistency: Same address, different values

19.9.3 With Write-Through Policy

19.9.4 With Write-Back Policy

19.9.5 Requirement

19.10 Bus Snooping

Common technique for cache coherence in SMP systems.

Bus Snooping

19.10.1 What is Bus Snooping?

Mechanism:

19.10.2 How It Works

  1. Cache controller performs write to address
  2. Broadcasts address information on snoop bus
  3. All cache controllers listen to snoop bus
  4. Controllers check if they have same address cached
  5. If yes, take action based on protocol

19.10.3 Key Feature

19.11 Write Invalidate Protocol

19.11.1 Approach

19.11.2 Mechanism

On Write by Processor

  1. Update own cache
  2. Broadcast write address on snoop bus

On Receiving Write Broadcast

  1. Check if same address in own cache
  2. If yes: Mark block as INVALID (clear valid bit)
  3. Next access will be miss

19.11.3 With Write-Through Policy

19.11.4 With Write-Back Policy

Challenge: Solution: Snoop Read:

19.11.5 Complexity

Trade-offs:

19.12 Write Update Protocol

19.12.1 Alternative Approach

Concept:

19.12.2 Mechanism

On Write by Processor

  1. Update own cache
  2. Broadcast BOTH address AND data on snoop bus

On Receiving Write Broadcast

  1. Check if same address in own cache
  2. If yes: Update own copy with new data
  3. Keep block VALID

19.12.3 Benefits

19.12.4 Costs

19.12.5 Comparison

19.13 Real Protocol Implementations

19.13.1 Historical Protocols

Write Once Protocol

Synapse N+1 Protocol

Berkeley Protocol

Illinois Protocol (MESI)

Firefly Protocol

19.13.2 Most Common Combination

19.14 MESI Protocol Details

Named after four states: Modified, Exclusive, Shared, Invalid MESI

Most popular cache coherency protocol, used in Intel Pentium and IBM PowerPC processors.

19.14.1 Four Block States (Requires 2 Bits)

1. INVALID (I)

2. SHARED (S)

3. EXCLUSIVE (E)

4. MODIFIED (M)

19.15 MESI Protocol State Transitions

19.15.1 Example with PE1, PE2, PE3

Initial State: Variable X = 1 in memory, all cache entries invalid

Step 1: PE1 Reads X

Actions: Result:

Step 2: PE3 Reads X

Actions: State transitions: Result:

Step 3: PE3 Writes X = 0

Actions: Result:

Step 4: PE1 Reads X

Actions: Result:

19.15.2 Key Points

19.16 Scalability of UMA Systems

19.16.1 Limitation

Challenges:

19.16.2 Practical Limit

19.16.3 Alternative Interconnects

Crossbar Switches

Multi-Stage Crossbar Switch Network

19.16.4 Improved Scalability

19.17 Non-Uniform Memory Access (NUMA)

19.17.1 Designed for Even Higher Scalability

Goals:

19.17.2 Key Difference from UMA

Non-Uniform Access Times:

19.17.3 Architecture

Structure:

19.17.4 Access Time Difference

19.17.5 Operating System Role

Optimization Responsibilities:

19.18 Two Types of NUMA

19.18.1 NC-NUMA (Non-Cached NUMA)

Bus Snooping Characteristics:

19.18.2 CC-NUMA (Cache-Coherent NUMA)

Bus Snooping Characteristics:

19.19 Directory-Based Cache Coherence

Used in CC-NUMA systems for scalable cache coherence.

19.19.1 What is Directory?

Definition:

19.19.2 Purpose

Functionality:

19.19.3 Organization

Distributed Structure:

19.19.4 Operation

Access Process:

19.19.5 Write Policy

Key Takeaways

  1. Multiprocessors overcome single-processor performance limitations
  2. Shared memory provides communication mechanism between processors
  3. Cache coherence is essential for correct parallel program execution
  4. Bus snooping works well for small-scale systems (up to ~32 processors)
  5. MESI protocol is widely adopted for cache coherence
  6. UMA systems provide uniform access but limited scalability
  7. NUMA systems enable thousands of processors with non-uniform access
  8. Directory-based coherence enables scalable cache coherence
  9. Operating system plays crucial role in workload balancing and optimization
  10. Trade-offs exist between simplicity, performance, and scalability

Summary

Multiprocessor systems have become the standard in modern computing, from smartphones to supercomputers, enabling the parallel processing power required for contemporary applications while managing the complex interactions between multiple processors sharing memory resources.