Back to All Lectures

Lecture 12: Pipelined Processors

Lectures on Computer Architecture

Click the thumbnail above to watch the video lecture on YouTube

By Dr. Isuru Nawinne

12.1 Introduction

This lecture introduces pipelining as the primary performance enhancement technique in modern processor design, transforming the inefficient single-cycle architecture into a high-throughput execution engine. We explore how pipelining applies assembly-line principles to instruction execution, dramatically improving processor throughput while maintaining individual instruction latency. The lecture examines the three fundamental types of hazards—structural, data, and control—that threaten pipeline efficiency, and discusses practical solutions including forwarding, stalling, and branch prediction that enable real-world pipelined processors to achieve near-ideal performance.

12.2 Recap: Single-Cycle Performance Limitations

12.2.1 Critical Path Problem

Load Word as Bottleneck:

Performance Issue:

Design Principle Violated:

12.2.2 Multi-Cycle as First Improvement

Basic Concept:

Five Stages Identified:

  1. Instruction Fetch (IF)
  2. Register Reading
  3. ALU Operations
  4. Memory Access
  5. Register Writing

Variable Stage Usage:

Clock Period Determination:

Limitation:

12.3 Pipelining Concept: The Laundry Shop Analogy

12.3.1 Non-Pipelined Laundry Shop

Setup:

  1. Washing: 30 minutes
  2. Drying: 30 minutes
  3. Folding/Ironing: 30 minutes
  4. Packaging: 30 minutes

Sequential Processing:

Computer System Abstraction Layers
Metric Value
Total Time 8 hours (6pm to 2am)
Time per Customer 2 hours
Shop Closes 2am

Problems:

12.3.2 Pipelined Laundry Shop

Key Idea:

Pipelined Schedule:

Computer System Abstraction Layers

Timeline Analysis:

Steady State:

12.3.3 Performance Analysis

Time Comparison:

Speedup Calculation:

Speedup = Non-pipelined Time / Pipelined Time
        = 8 hours / 3.5 hours
        = 2.3×

Includes Pipeline Fill Time:

Steady State Analysis (ignoring fill time):

Non-pipelined: 2n hours for n loads (2 hours per load)
Pipelined: 0.5n hours for n loads (0.5 hours per load)

Steady State Speedup = 2n / 0.5n = 4×

Theoretical Maximum Speedup:

12.3.4 Key Performance Terms

Latency:

Throughput:

Observation:

Analogy Summary:

12.4 MIPS Five-Stage Pipeline

12.4.1 Pipeline Stage Definitions

Stage 1: Instruction Fetch (IF)

Stage 2: Instruction Decode / Register Read (ID)

Stage 3: Execution (EX)

Stage 4: Memory Access (MEM)

Stage 5: Write Back (WB)

Workload Distribution Goal:

12.4.2 Stage Timing Example

Assumed Component Delays:

Component Delay (picoseconds)
Instruction Fetch 200 ps
Register Read/Write 100 ps
ALU Operation 200 ps
Data Memory Access 200 ps
Sign Extension negligible
Multiplexers negligible

Single-Cycle Instruction Times:

Instruction Type Stages Used Total Time
Load Word (LW) IF+ID+EX+MEM+WB 800 ps
Store Word (SW) IF+ID+EX+MEM 700 ps
R-type (ADD, etc.) IF+ID+EX+WB 600 ps
Branch (BEQ) IF+ID+EX 500 ps
Load Word Critical Path: 800 ps determines clock period

12.4.3 Pipeline Implementation Details

Clock Cycle Determination:

Register Read/Write Timing:

Stage Alignment to Clock Cycles:

Stage Work Time Cycle Time
IF Instruction Memory read 200 ps 200 ps ✓
ID Decode + Register Read 100 ps 200 ps (space left)
EX ALU operation 200 ps 200 ps ✓
MEM Data Memory access 200 ps 200 ps ✓
WB Register write 100 ps 200 ps (first half only)

Space in ID Stage:

Space in WB Stage:

12.4.4 Load Word Pipeline Example

Instruction Stream: All Load Word instructions

LW $1, 0($10)
LW $2, 4($10)
LW $3, 8($10)
LW $4, 12($10)
...

Pipeline Timing Diagram:

Time (ps):  0-200  200-400  400-600  600-800  800-1000  1000-1200
LW $1:      IF     ID       EX       MEM      WB
LW $2:             IF       ID       EX       MEM       WB
LW $3:                      IF       ID       EX        MEM
LW $4:                               IF       ID        EX

Single-Cycle Comparison:

Throughput Improvement:

Non-pipelined: 1 instruction every 800 ps
Pipelined: 1 instruction every 200 ps

Speedup = 800 / 200 = 4×

Absolute Time per Instruction:

12.4.5 Ideal vs Actual Speedup

Ideal Case (balanced stages):

Time between instructions (pipelined) = Time per instruction (non-pipelined) / Number of stages

Maximum Speedup = Number of Stages

Actual Implementation:

Reasons for Less Than Ideal:

  1. Unbalanced stage delays
  2. Pipeline fill time overhead
  3. Hazards (discussed later)
  4. Added synchronization logic

12.5 MIPS ISA Design for Pipelining

12.5.1 Fixed Instruction Length

MIPS Characteristic:

Benefits for Pipelining:

Alternative (Variable-Length):

12.5.2 Fewer Regular Instruction Formats

MIPS Formats:

Benefits:

Register Field Consistency:

Decoding Simplification:

12.5.3 Separate ALU Operation Field

Function Field (funct):

Design Rationale:

Benefit:

Alternative Design:

12.5.4 Load/Store Addressing Mode

MIPS Addressing:

Pipeline Fit:

Design Philosophy:

MIPS vs Other ISAs:

12.6 Instruction-Level Parallelism (ILP)

12.6.1 Parallel Execution Concept

Definition:

Example at Steady State:

Time Window: 800-1000 ps

Instruction A: WB stage (writing result)
Instruction B: MEM stage (memory access)
Instruction C: EX stage (ALU operation)
Instruction D: ID stage (decode, register read)
Instruction E: IF stage (fetch)

Five instructions active simultaneously!

Instruction-Level Parallelism (ILP):

12.6.2 Levels of Parallelism

Instruction-Level Parallelism:

Thread-Level Parallelism:

Program-Level Parallelism:

Application-Level Parallelism:

ILP Focus:

12.7 Pipeline Hazards: Structural Hazards

12.7.1 Hazard Definition

General Concept:

Three Categories:

  1. Structural Hazards: Hardware resource busy
  2. Data Hazards: Need data from previous instruction
  3. Control Hazards: Decision depends on previous result

12.7.2 Structural Hazard: Single Memory

Scenario:

Conflict Example:

Time:    0-200   200-400  400-600  600-800
LW $1:   IF      ID       EX       MEM
LW $2:           IF       ID       EX
LW $3:                    IF       ID
LW $4:                             IF  ← CONFLICT!

At 600-800 ps:
• LW $1 needs data memory (MEM stage)
• LW $4 needs instruction memory (IF stage)
• Same physical memory device!
• Cannot access simultaneously

Problem:

12.7.3 Pipeline Stall (Bubble)

Solution: Insert Bubble

Time:    0-200   200-400  400-600  600-800  800-1000  1000-1200
LW $1:   IF      ID       EX       MEM      WB
LW $2:           IF       ID       EX       [BUBBLE]  MEM
LW $3:                    IF       ID       EX        [BUBBLE]
LW $4:                             IF       [BUBBLE]  ID

Bubble Characteristics:

Impact:

Bubble Analogy:

12.7.4 Solutions to Structural Hazards

Solution 1: Separate Memories

Solution 2: Separate Caches

Design Recommendation:

12.8 Data Hazards

12.8.1 Data Hazard Definition

Concept:

Example:

ADD $s0, $t0, $t1      # $s0 = $t0 + $t1
SUB $t2, $s0, $t3      # $t2 = $s0 - $t3 (uses $s0 from ADD)

Problem:

12.8.2 Data Hazard Example Analysis

Instruction Sequence:

ADD $s0, $t0, $t1
SUB $t2, $s0, $t3

Pipeline Without Stalls:

Time:    0-200   200-400  400-600  600-800  800-1000
ADD:     IF      ID       EX       MEM      WB
SUB:             IF       ID       EX       MEM
                          ↑
                 Reads $s0 here (old value!)

                          ADD writes $s0 here ↓

Problem Timeline:

SUB reads $s0 at 400-600, but correct value not available until 800-1000!

12.8.3 Solution 1: Pipeline Stalls

Insert Two Bubbles:

Time:    0-200   200-400  400-600  600-800  800-1000  1000-1200  1200-1400
ADD:     IF      ID       EX       MEM      WB
[BUBBLE]                  IF       [BUBBLE] [BUBBLE]
[BUBBLE]                           IF       [BUBBLE]
SUB:                                         IF        ID

Result:

Cost:

Critical Timing:

12.8.4 Solution 2: Forwarding (Bypassing)

Key Observation:

Forwarding Logic:

Time:    0-200   200-400  400-600  600-800  800-1000
ADD:     IF      ID       EX       MEM      WB
SUB:             IF       ID       EX       MEM
                          ↑        ↑
                 Read regs   Use forwarded value!

Implementation:

- Register file output (normal path)
- Forwarded value from previous ALU output

Benefit:

- Forwarding multiplexers
- Forwarding detection logic
- Forwarding paths (wires)
- Pipeline registers to hold values

Complexity:

Result:

12.8.5 Load-Use Data Hazard

Special Case:

LW  $s0, 0($t0)        # Load from memory into $s0
SUB $t2, $s0, $t3      # Use $s0 immediately

Problem:

Timeline:

Time:    0-200   200-400  400-600  600-800  800-1000
LW:      IF      ID       EX       MEM      WB
SUB:             IF       ID       EX       MEM
                          ↑        ↑
                 Need value   Value first available here!

LW result available at 600-800, but SUB's EX at 600-800 (simultaneous!)

Unavoidable Stall:

Time:    0-200   200-400  400-600  600-800  800-1000  1000-1200
LW:      IF      ID       EX       MEM      WB
[BUBBLE]                  IF       [BUBBLE] ID
SUB:                                         IF        ID

One stall bubble required:

12.8.6 Compiler Solution: Code Reordering

C Code Example:

a = b + e;
c = b + f;

Naive Assembly (Load-Use Hazards):

LW   $t1, 0($t0)    # Load b into $t1
LW   $t2, 4($t0)    # Load e into $t2
ADD  $t3, $t1, $t2  # a = b + e ← HAZARD: uses $t2 immediately after LW
SW   $t3, 8($t0)    # Store a

LW   $t4, 12($t0)   # Load f into $t4
ADD  $t5, $t1, $t4  # c = b + f ← HAZARD: uses $t4 immediately after LW
SW   $t5, 16($t0)   # Store c

Total: 7 instructions + 2 stalls = 9 clock cycles

Optimized Assembly (Reordered):

LW   $t1, 0($t0)    # Load b into $t1
LW   $t2, 4($t0)    # Load e into $t2
LW   $t4, 12($t0)   # Load f into $t4 ← Moved here!
ADD  $t3, $t1, $t2  # a = b + e ← No hazard! $t2 available
SW   $t3, 8($t0)    # Store a ← Moved here!
ADD  $t5, $t1, $t4  # c = b + f ← No hazard! $t4 available
SW   $t5, 16($t0)   # Store c

Total: 7 instructions + 0 stalls = 7 clock cycles

Technique:

Savings: 2 clock cycles (22% improvement)

Compiler Responsibility:

Programmer Awareness:

12.9 Control Hazards

12.9.1 Control Hazard Definition

Concept:

Example:

BEQ $1, $2, target     # Branch if $1 == $2
ADD $3, $4, $5         # Next sequential instruction
...
target: SUB $6, $7, $8 # Branch target

Which instruction to fetch after BEQ?

12.9.2 Branch Execution in Pipeline

Branch Instruction:

BEQ $1, $2, 40         # Branch 40 instructions ahead if equal

Pipeline Stages:

  1. IF: Fetch BEQ instruction
  2. ID: Read $1, $2 from register file
  3. EX: ALU compares (subtract $2 from $1, check zero flag)
  4. Result available after EX stage

Problem:

Without Optimization:

Time:    0-200   200-400  400-600  600-800
BEQ:     IF      ID       EX       MEM
???:             IF       ???

Two bubbles required if wait for outcome

12.9.3 Solution 1: Early Branch Resolution

Add Hardware in ID Stage:

Modified Pipeline:

Time:    0-200   200-400  400-600
BEQ:     IF      ID       EX
                 ↑
         Decision here!
Next:            IF

Benefit:

Cost:

Limitation:

12.9.4 Solution 2: Branch Prediction

Static Branch Prediction:

Strategy: Predict Not Taken

Example (Prediction Correct):

ADD  $3, $4, $5
BEQ  $1, $2, 14        # Actually NOT taken
LW   $8, 0($9)         # Fetch this (prediction: not taken)

Timeline:

Time:    0-200   200-400  400-600  600-800
ADD:     IF      ID       EX       MEM
BEQ:             IF      ID       EX
LW:                      IF       ID
                         ↑ Fetched based on prediction

At 400-600 (after BEQ's ID):

Example (Prediction Incorrect):

ADD  $3, $4, $5
BEQ  $1, $2, 14        # Actually IS taken
LW   $8, 0($9)         # Fetched (but shouldn't execute)
...
target: SUB $6, $7, $8 # Should execute this instead

Timeline:

Time:    0-200   200-400  400-600  600-800
ADD:     IF      ID       EX       MEM
BEQ:             IF      ID       EX
LW:                      IF       [DISCARD]
SUB:                              IF

BEQ: IF ID EX

LW: IF [DISCARD]

SUB: IF

`

At 400-600 (after BEQ's ID):

  • Determine branch IS taken
  • Prediction wrong!
  • Discard LW (clear pipeline stage)
  • Fetch SUB from branch target
  • One bubble inserted

Result Analysis:

  • Correct prediction: Save one cycle
  • Incorrect prediction: Same as no prediction (one stall)
  • Net benefit if prediction often correct
  • No additional penalty for wrong guess

12.9.5 Static Branch Prediction Strategies

Simple Static: Always Predict Not Taken

  • Fixed prediction
  • Ignore branch type
  • Ignore branch history
  • Simple hardware

Program Behavior-Based Static:

  • Analyze typical branch patterns
  • Make predictions based on code structure

Backward Branches:

  • Usually taken
  • Example: Loops
loop:
    ...
    BEQ $t0, $zero, loop   # Backward branch
  • Loop iterations: Branch taken many times
  • Loop exit: Branch not taken once
  • Prediction: Taken → Correct most of time

Forward Branches:

  • Usually not taken
  • Example: If statements
BEQ $t0, $zero, skip
...                      # True case
skip:
...                      # After if
  • True case: Branch not taken
  • False case: Branch taken
  • Prediction depends on code style

Strategy: Backward Taken, Forward Not Taken

  • 90%+ accuracy possible
  • Based on empirical program analysis
  • Requires code analysis

12.9.6 Dynamic Branch Prediction

Concept:

  • Hardware learns branch behavior
  • Predicts based on history
  • Adapts to current code execution
  • Not fixed prediction

Branch History Table:

  • Hardware table storing recent branch outcomes
  • Indexed by branch instruction address
  • Each entry: Branch taken or not taken recently
  • Predicts based on recent behavior

Simple 1-Bit Predictor:

  • One bit per branch: Last outcome
  • Predict same as last time
  • Updates after each execution

Example:

Loop iteration 1: Taken → Predict taken next
Loop iteration 2: Taken → Predict taken next
...
Loop iteration 100: Taken → Predict taken next
Loop exit: Not taken → Predict not taken next (wrong for next loop!)

Problem: Wrong twice per loop (entry and exit)

2-Bit Saturating Counter:

  • Two bits per branch: State machine
  • Four states:
- 00: Strongly not taken
- 01: Weakly not taken
- 10: Weakly taken
- 11: Strongly taken
  • Change prediction after two consecutive wrong predictions
  • More stable

Advanced Predictors:

  • Correlating predictors (look at multiple branches)
  • Two-level adaptive predictors
  • Tournament predictors (combine multiple algorithms)
  • Very high accuracy (>95%)

Hardware Cost:

  • Branch history table (memory)
  • Prediction logic (comparators, counters)
  • Update logic
  • Worthwhile for performance gain

12.10 Summary and Key Concepts

12.10.1 Pipelining Benefits

Performance Improvement:

  • Throughput increased by number of stages
  • 5-stage pipeline → 4-5× speedup
  • Latency unchanged or slightly worse
  • Overlapping execution key

Hardware Utilization:

  • All stages active in steady state
  • Parallel processing
  • Maximum efficiency

12.10.2 Pipeline Challenges

Hazards:

  1. Structural: Hardware resource conflicts
  2. Data: Instruction dependencies
  3. Control: Branch/jump decisions

Solutions:

  • Structural: Separate memories/caches
  • Data: Forwarding, stalls, code reordering
  • Control: Early resolution, branch prediction

12.10.3 MIPS Design Philosophy

ISA Designed for Pipelining:

  • Fixed 32-bit instruction length
  • Regular instruction formats
  • Separate funct field
  • Simple addressing modes
  • Balanced pipeline stages

Performance Through Hardware:

  • Pipelining fundamental to MIPS
  • Not optimized for single-cycle
  • Hardware complexity for software simplicity

12.10.4 Key Takeaways

  1. Pipelining improves throughput, not latency
  2. Steady state determines peak performance
  3. Pipeline fill time overhead for small programs
  4. Hazards reduce pipelining efficiency
  5. Forwarding eliminates many data hazards
  6. Load-use hazard always requires one stall
  7. Branch prediction crucial for control flow
  8. Compiler optimization reduces stalls
  9. ISA design significantly impacts pipeline efficiency
  10. ILP fundamental to modern processor performance

12.11 Important Formulas and Metrics

Speedup Calculation

Speedup = Non-pipelined Time / Pipelined Time

Ideal Speedup = Number of Pipeline Stages

Actual Speedup = Number of Stages / (1 + Hazard Impact)

Throughput

Throughput = 1 instruction / Clock Period

Throughput Improvement = Clock Period (non-pipelined) / Clock Period (pipelined)

Pipeline Performance

Time = (Number of Instructions + Stages - 1) × Clock Period

CPI (Cycles Per Instruction) = 1 + Stall Cycles per Instruction

Effective CPI = 1 + (Structural Stalls + Data Stalls + Control Stalls)

Branch Prediction Accuracy

Accuracy = Correct Predictions / Total Branches

Stall Reduction = Accuracy × Cycles Saved per Correct Prediction

Key Takeaways

  1. Pipelining improves throughput, not latency—individual instructions take same or longer time, but more instructions complete per unit time.
  2. Five-stage MIPS pipeline: Instruction Fetch (IF), Instruction Decode (ID), Execute (EX), Memory Access (MEM), Write-Back (WB).
  3. Ideal speedup equals number of stages—five-stage pipeline theoretically achieves 5× speedup over single-cycle design.
  4. Assembly line analogy clarifies concept—like manufacturing, each stage works on different item simultaneously for maximum efficiency.
  5. Pipeline registers store intermediate results between stages, enabling independent operation and preventing data corruption.
  6. Three hazard types threaten pipeline efficiency: Structural (resource conflicts), Data (register dependencies), Control (branch/jump delays).
  7. Structural hazards resolved by hardware duplication—separate instruction and data caches eliminate memory access conflicts.
  8. Data hazards occur when instructions depend on previous results—forwarding (bypassing) allows ALU results to skip write-back stage.
  9. Forwarding paths connect pipeline stages directly, enabling result use before register file write completes.
  10. Load-use hazard requires one-cycle stall—memory data unavailable in time for immediate ALU use even with forwarding.
  11. Compiler code reordering can eliminate some stalls—moving independent instructions into load delay slots maintains pipeline flow.
  12. Control hazards arise from branch/jump instructions—don't know next PC until branch resolves in third cycle.
  13. Branch delay of 3 cycles in basic pipeline—fetch/decode/execute complete before decision known, wasting 3 instruction slots.
  14. Early branch resolution reduces penalty—dedicated comparison hardware in ID stage cuts delay to 1 cycle.
  15. Static branch prediction assumes direction (e.g., always not-taken)—simple but limited effectiveness.
  16. Dynamic branch prediction learns patterns from history—branch target buffer with 2-bit saturating counters achieves >90% accuracy.
  17. Two-bit counters prevent single misprediction disruption—requires two wrong predictions to change direction, handling loop patterns well.
  18. Pipeline performance = 1 CPI + Structural Stalls + Data Stalls + Control Stalls—minimizing hazards approaches ideal throughput.
  19. Modern processors use sophisticated prediction—multi-level predictors, pattern history tables, and return address stacks minimize control hazards.
  20. Pipeline complexity trades off with performance—deeper pipelines increase throughput but amplify hazard penalties and design difficulty.

Summary

Pipelining revolutionizes processor performance by applying manufacturing assembly-line principles to instruction execution, allowing multiple instructions to occupy different pipeline stages simultaneously. The five-stage MIPS pipeline (IF, ID, EX, MEM, WB) theoretically achieves 5× speedup by keeping all hardware components busy every cycle, transforming the inefficient single-cycle design where most hardware sat idle most of the time. However, three hazard types threaten this ideal performance: structural hazards from resource conflicts (solved by hardware duplication like separate instruction and data caches), data hazards from register dependencies (addressed by forwarding paths that bypass results directly between stages, though load-use cases still require one-cycle stalls), and control hazards from branches that don't resolve until the third cycle (mitigated by early branch resolution hardware, static prediction strategies, and sophisticated dynamic branch predictors using two-bit saturating counters that achieve over 90% accuracy). The effectiveness of forwarding demonstrates how careful hardware design can eliminate most data hazard stalls, while compiler optimizations like instruction reordering can fill remaining delay slots with useful work. Branch prediction evolution from simple static schemes to complex dynamic predictors with branch target buffers reflects the critical importance of minimizing control hazards in modern high-performance processors. Pipeline registers between stages serve as the crucial mechanism enabling independent stage operation, storing intermediate results and control signals while preventing data corruption across instruction overlaps. While pipelining introduces significant design complexity compared to single-cycle implementations, the dramatic performance improvements—approaching 5× speedup in practice—justify this added sophistication, making pipelining universal in modern processor architectures from embedded systems to supercomputers. Understanding these hazards and their solutions provides essential foundation for comprehending real-world processor implementations and the tradeoffs between pipeline depth, clock frequency, and hazard penalties that define contemporary computer architecture.