Back to All Lectures

Lecture 13: Pipeline Operation and Timing

Lectures on Computer Architecture

Click the thumbnail above to watch the video lecture on YouTube

By Dr. Isuru Nawinne

13.1 Introduction

This lecture provides comprehensive, cycle-by-cycle analysis of MIPS five-stage pipeline operation, examining how instructions flow through pipeline stages with detailed attention to the pipeline registers that store intermediate results between stages. We explore the critical role of these registers in enabling independent stage operation, trace complete execution sequences for load and store instructions, analyze timing constraints and delay contributions, and work through practical exercises calculating clock frequencies and optimizing pipeline performance. This detailed examination reveals the hardware mechanisms that transform the conceptual pipeline model into functioning silicon.

13.2 Lecture Introduction and Recap

13.2.1 Previous Topics Review

Pipelining Concept:

Performance Metric:

Hazards Covered:

  1. Structural: Hardware resource conflicts
  2. Data: Register/memory dependencies
  3. Control: Branch/jump decision delays

Solutions Discussed:

13.2.2 Today's Focus

Detailed Pipeline Analysis:

13.3 Five-Stage MIPS Pipeline Review

Five-Stage MIPS Pipeline Architecture

13.3.1 Stage 1: Instruction Fetch (IF)

Operations:

Hardware Elements:

Key Point:

13.3.2 Stage 2: Instruction Decode / Register Read (ID)

Operations:

Hardware Elements:

Workload Balancing:

Control Signal Generation:

13.3.3 Stage 3: Execution (EX)

Operations:

Hardware Elements:

Key Characteristics:

13.3.4 Stage 4: Memory Access (MEM)

Operations:

Hardware Elements:

Timing Consideration:

13.3.5 Stage 5: Write Back (WB)

Operations:

Hardware Elements:

Minimal Hardware:

13.4 Pipeline Registers: Necessity and Function

13.4.1 Problem Without Pipeline Registers

Scenario:

Example Issues:

  1. Register file: ID stage reads while WB stage writes
  2. Control signals: Generated in ID, needed in later stages
  3. Data values: Computed in EX, needed in MEM
  4. Overwriting: New instruction data overwrites previous instruction data

Result Without Pipeline Registers:

13.4.2 Pipeline Register Purpose

Pipeline Registers Between Pipeline Stages

Key Function:

Placement:

Exception:

13.4.3 Pipeline Register Contents

IF/ID Pipeline Register:

ID/EX Pipeline Register:

EX/MEM Pipeline Register:

MEM/WB Pipeline Register:

13.4.4 Timing: Writing and Reading Pipeline Registers

At Rising Clock Edge:

  1. Pipeline register write begins
  2. Small hold time delay (~10-30 ps)
  3. Data captured and stored
  4. Writing delay consumed

After Writing:

  1. Reading delay begins
  2. Data propagates to output (~10-30 ps)
  3. Outputs stabilize at new values
  4. Next stage begins operations

Combined Overhead:

Critical Observation:

13.5 Load Word Instruction: Detailed Cycle-by-Cycle Analysis

13.5.1 Load Word Instruction Format

Encoding:

LW $rt, offset($rs)

Opcode: 100011 (bits 26-31)
RS:     Base register (bits 21-25)
RT:     Destination register (bits 16-20)
Offset: 16-bit immediate (bits 0-15)

Operation: $rt = Memory[$rs + offset]

Example: LW $8, 32($9)

13.5.2 Clock Cycle 1: Instruction Fetch (IF)

Start of Cycle:

Operations:

  1. Update PC register (rising edge)
  2. Read PC value (small delay)
  3. Access instruction memory with PC address
  4. Instruction memory read delay: ~200 ps (dominant)
  5. Compute PC + 4 in parallel: ~70 ps

End of Cycle:

Hardware Shading Convention:

Total Stage Time: ~200+ ps (instruction memory dominant)

13.5.3 Clock Cycle 2: Instruction Decode / Register Read (ID)

Start of Cycle (Rising Edge):

  1. IF/ID register write: ~30 ps
  2. IF/ID register read: ~30 ps
  3. Combined delay: ~60 ps

After Pipeline Register:

  1. Instruction word available
  2. Extract fields:

Parallel Operations:

End of Cycle:

Why Read Both Registers:

Total Stage Time: ~60 + 90 = ~150 ps (register read dominant)

13.5.4 Clock Cycle 3: Execution (EX)

Start of Cycle:

  1. ID/EX register write: ~30 ps
  2. ID/EX register read: ~30 ps

ALU Input Preparation:

  1. Input A: Base address (from $9) directly from pipeline register
  2. Input B: Multiplexer selects immediate OR register

ALU Operation:

  1. Add base address + offset
  2. ALU delay: ~90 ps (dominant)
  3. Result: Memory address = $9 + 32

Parallel Operations (for branches, not used here):

End of Cycle:

Total Stage Time: ~30 + 30 + 20 + 90 = ~170 ps (ALU dominant)

13.5.5 Clock Cycle 4: Memory Access (MEM)

Start of Cycle:

  1. EX/MEM register write: ~30 ps
  2. EX/MEM register read: ~30 ps

Memory Access:

  1. ALU result (address) → Data memory address input
  2. MemRead control signal = 1 (enable read)
  3. MemWrite control signal = 0 (disable write)
  4. Data memory read delay: ~250 ps (DOMINANT - slowest operation!)

Parallel Operations (unused for LW):

End of Cycle:

Critical Path:

Total Stage Time: ~30 + 30 + 250 = ~310 ps (memory READ dominant!)

13.5.6 Clock Cycle 5: Write Back (WB)

Start of Cycle:

  1. MEM/WB register write: ~30 ps
  2. MEM/WB register read: ~30 ps

Data Selection:

  1. MemtoReg multiplexer:

Register Write Preparation:

  1. Write data: Memory data from multiplexer
  2. Write address: RT ($8) from pipeline register
  3. RegWrite control signal = 1 (enable write)

CRITICAL ERROR IN TEXTBOOK DIAGRAM:

At Rising Edge (End of Cycle / Start of Next):

  1. Register $8 written with loaded data
  2. Write occurs in first half of cycle
  3. Subsequent ID stage can read in second half (same cycle!)

Register File Timing Trick:

Total Stage Time: ~30 + 30 + 20 = ~80 ps (shortest stage!)

13.5.7 Load Word Complete Pipeline Summary

Cycle Stage Operations Dominant Delay Time
1 IF Fetch instruction, PC+4 Inst Memory 200ps
2 ID Decode, read regs, control Reg Read 150ps
3 EX ALU: base + offset ALU 170ps
4 MEM Read data memory Memory Read 310ps ← CRITICAL!
5 WB Select memory, write register Multiplexer 80ps

Minimum Clock Period: 310 ps (limited by MEM stage)

Maximum Clock Frequency: 1 / 310ps ≈ 3.2 GHz

Pipeline Overhead:

Comparison to Single-Cycle:

13.6 Store Word Instruction: Key Differences

13.6.1 Store Word Instruction Format

Encoding:

SW $rt, offset($rs)

Opcode: 101011 (bits 26-31)
RS:     Base register (bits 21-25)
RT:     Source data register (bits 16-20)
Offset: 16-bit immediate (bits 0-15)

Operation: Memory[$rs + offset] = $rt

Example: SW $8, 32($9)

Key Difference from Load:

Stages IF, ID, EX: Same as Load Word

Instruction Fetch: Identical to LW

Instruction Decode: Identical to LW

Execution: Identical to LW

13.6.2 Memory Access Stage: KEY DIFFERENCE

Start of Cycle:

- Memory address (from ALU)
- RT data value (from register file, preserved through pipeline)

Memory Access:

Operation:

End of Cycle:

Control Signal Critical:

Control Signal Load Store
MemRead 1 0
MemWrite 0 1
RegWrite (WB stage) 1 0 ← CRITICAL!

13.6.3 Write Back Stage: NO OPERATION

Store Word WB Stage:

Why RegWrite MUST Be 0:

- Random data written to random register
- Data corruption
- Program failure

Hardware Still Operates:

Lesson: Control Signals Essential

Store Word Pipeline Summary:

Cycle Stage Operations Notes
1 IF Fetch SW instruction Same as LW
2 ID Decode, read RS, RT RT value USED (not discarded)
3 EX Compute address Same as LW
4 MEM Write RT to memory WRITE instead of read
5 WB Nothing (bubble) RegWrite=0, stage idle

13.7 Common Pipeline Diagram Errors

13.7.1 Error 1: Write Register Address Source

Incorrect Diagram Shows:

Why This Is Wrong:

Example:

Cycle 1: LW $8, 0($10) fetched  (IF)
Cycle 2: LW $9, 4($10) fetched  (IF), LW $8 in ID
Cycle 3: LW $10, 8($10) fetched (IF), LW $8 in EX
Cycle 4: ADD $11, $12, $13 fetched (IF), LW $8 in MEM
Cycle 5: SUB $14, $15, $16 fetched (IF), LW $8 in WB

At Cycle 5:
• IF/ID contains SUB (writes $14)
• WB should write $8 (from LW)
• If using IF/ID: Would write to $14 instead of $8!
• WRONG REGISTER!

Correct Implementation:

Additional Lines Required:

13.7.2 Error 2: Incorrect Memory Access Indication

Diagram Error from Textbook:

Correct Resource Usage:

Instruction IF ID EX MEM WB
LW ✓ Read ✓ Write Reg
SW ✓ Write No action
ADD ✗ No access ✓ Write Reg
BEQ ✗ PC update ✗ No write

Shading Convention:

ADD Instruction Correct:

LW Instruction Correct:

13.7.3 Error 3: Store Word Memory Read

Another Common Error:

Why Wrong:

Correct:

13.8 Multi-Clock-Cycle Pipeline Diagrams

13.8.1 Single-Clock vs Multi-Clock Diagrams

Single-Clock-Cycle Diagram:

Multi-Clock-Cycle Diagram:

13.8.2 Traditional Multi-Cycle Diagram

Format:

Cycle:     1    2    3    4    5    6    7    8    9
Instr 1:   IF   ID   EX   MEM  WB
Instr 2:        IF   ID   EX   MEM  WB
Instr 3:             IF   ID   EX   MEM  WB
Instr 4:                  IF   ID   EX   MEM  WB
Instr 5:                       IF   ID   EX   MEM  WB

Shows:

Does NOT Show:

13.8.3 Enhanced Multi-Cycle Diagram with Resources

Format:

Cycle Instr 1 Instr 2 Instr 3
1 [IM][RF][ ][ ][ ]
2 [ ][IM][RF][ ][ ] [IM][RF][ ][ ][ ]
3 [ ][ ][IM][RF][ ] [ ][IM][RF][ ][ ] [IM][RF][ ][ ][ ]

Legend:

• IM: Instruction Memory (IF)
• RF: Register File (ID)
• ALU: ALU operation (EX)
• DM: Data Memory (MEM)
• WB: Write Back (WB)

Shows:

Benefits:

Textbook Error Example:

13.9 Timing and Clock Frequency Analysis

13.9.1 Component Delays (Typical Values)

Component Delay (picoseconds)
Instruction Memory 200
Register File Read 90
Register File Write 90
ALU Operation 90
Data Memory Read 250
Data Memory Write 250
Sign Extension 10 (negligible)
Multiplexer 20
Adder (PC+4, branch) 70
Shift Left 2 10 (wire routing)
Pipeline Register Write 30
Pipeline Register Read 30

Key Observations:

13.9.2 Stage Timing Calculation

Stage 1: Instruction Fetch (IF)

Pipeline Register Write:   N/A (PC register)
Pipeline Register Read:    N/A
Instruction Memory:        200 ps
PC + 4 Adder:              70 ps (parallel)

Total: 200 ps (memory dominant)

Stage 2: Instruction Decode (ID)

IF/ID Write + Read:        60 ps
Register File Read:        90 ps (dominant)
Control Unit Decode:       50 ps (parallel)
Sign Extension:            10 ps (parallel)

Total: 60 + 90 = 150 ps

Stage 3: Execution (EX)

ID/EX Write + Read:        60 ps
Multiplexer:               20 ps
ALU Operation:             90 ps
Branch Adder:              70 ps (parallel)
Shift Left 2:              10 ps (parallel)

Total: 60 + 20 + 90 = 170 ps

Stage 4: Memory Access (MEM)

EX/MEM Write + Read:       60 ps
Data Memory Access:        250 ps (DOMINANT)

Total: 60 + 250 = 310 ps ← CRITICAL PATH!

Stage 5: Write Back (WB)

MEM/WB Write + Read:       60 ps
MemtoReg Multiplexer:      20 ps

Register File Write: 30 ps (first half of cycle)

Total: 60 + 20 + 30 = 110 ps

13.9.3 Clock Frequency Determination

Minimum Clock Period:

Maximum Clock Frequency:

f_max = 1 / T_min
      = 1 / 310 ps
      = 1 / (310 × 10^-12 s)
      = 3.226 GHz
      ≈ 3.2 GHz

Efficiency Analysis:

Stage Time Utilization Wasted Time
IF 200 65% 110 ps
ID 150 48% 160 ps
EX 170 55% 140 ps
MEM 310 100% 0 ps
WB 110 35% 200 ps

Average utilization: ~60%

Wasted time: ~40% average

13.9.4 Performance Improvement Strategies

Strategy 1: Pipeline Balancing

- Faster memory technology
- Separate instruction/data caches
- Smaller, faster cache
- Multi-ported memory

Strategy 2: Increase ALU Time

Strategy 3: Additional Pipeline Stages

Strategy 4: Cache Memory

Real-World Example:

13.10 Practical Exercises and Solutions

13.10.1 Exercise: Maximum Clock Frequency Calculation

Given Component Delays:

Instruction Memory:      200 ps
Register File (read):    90 ps
Register File (write):   90 ps
ALU:                     90 ps
Data Memory (read):      250 ps
Data Memory (write):     250 ps
Sign Extend:             ~0 ps
Multiplexer:             20 ps
Adder:                   70 ps
Shift Left 2:            10 ps
Pipeline Register:       30 ps (write), 30 ps (read)

Step 1: Calculate each stage timing

Step 2: Identify critical path

Step 3: Calculate maximum frequency

f_max = 1 / 310 ps
      = 3.226 GHz

13.10.2 Exercise: Improving Clock Frequency

Question: Suggest mechanisms to increase clock frequency. Discuss negative impacts.

Suggestion 1: Faster Memory Technology

- Significantly reduces critical path
- New critical path: IF at 260 ps
- Frequency increase: 310→260 (1.2× improvement)
- SRAM very expensive
- Much larger area
- Higher power consumption
- Limited capacity

Suggestion 2: Cache Memory (BEST)

- Cost-effective
- Good performance
- Scalable
- Industry standard
- Cache misses still slow
- Complex cache management
- Additional hardware

Suggestion 3: Split Memory Stage

- More balanced pipeline
- Higher frequency possible
- More pipeline registers (overhead)
- Increased latency
- More complex control

Suggestion 4: Eliminate Pipeline Register Overhead

- Removes 60 ps overhead per stage
- Significant improvement
- Timing more complex
- Clock skew issues
- Less reliable

13.10.3 Exercise: ALU Optimization Impact

Question: ALU time shortened by 25%. Does it affect speedup?

Analysis:

Scenario 1: MEM is Critical Path (Typical)

Conclusion: No improvement when not on critical path

Scenario 2: EX is Critical Path (Hypothetical)

Conclusion: Significant improvement when on critical path

General Principle:

13.10.4 Exercise: Pipeline Speedup Calculation

Given:

Part A: Non-pipelined execution time

Time = Instructions × Time per instruction
     = 10^7 × 100 ps
     = 10^9 ps
     = 1 ms (0.001 seconds)

Part B: Speedup from 20-stage perfect pipeline

Ideal Speedup = Number of stages = 20×

Part C: Time with perfect pipeline

Time = (10^7 × 100 ps) / 20
     = 10^9 / 20 ps
     = 5 × 10^7 ps
     = 0.05 ms

Part D: Real pipeline overhead impact

Answer: BOTH latency and throughput affected

Latency Impact:

Throughput Impact:

13.11 Summary and Key Takeaways

13.11.1 Pipeline Operation Fundamentals

Pipeline Registers Essential:

Timing Critical:

13.11.2 Design Principles

Make Common Case Fast:

Balance Pipeline Stages:

Control Signals Matter:

13.11.3 Common Mistakes to Avoid

Write Register Address:

Control Signal Errors:

Diagram Interpretation:

13.11.4 Performance Considerations

Critical Path Analysis:

Speedup Limitations:

- Pipeline register overhead
- Unbalanced stages
- Hazards and stalls
- Pipeline fill/drain time

13.11.5 Looking Ahead

Memory Hierarchy (Next Topics):

Real-World Pipelines:

13.12 Important Formulas

Clock Period

T_clock = max(T_IF, T_ID, T_EX, T_MEM, T_WB)

Where each T_stage includes:
• Pipeline register write delay
• Pipeline register read delay
• Dominant component delay

Maximum Frequency

f_max = 1 / T_clock

Pipeline Speedup

Speedup = T_non-pipelined / T_pipelined_steady_state
        ≈ Number of stages (ideal)
        < Number of stages (actual)

Stage Timing General Formula

T_stage = T_pipe_write + T_pipe_read + T_dominant_component + T_other_parallel

Where parallel components don't add (take maximum)

Throughput

Throughput = 1 instruction / T_clock (steady state)

Latency

Latency = (Number of stages) × T_clock + Pipeline overhead

Key Takeaways

  1. Four pipeline registers separate five stages: IF/ID, ID/EX, EX/MEM, MEM/WB store all information needed by subsequent stages.
  2. Pipeline registers capture data and control signals—instruction fields, register values, ALU results, memory data, and control bits all propagate through pipeline.
  3. Each register updates on clock edge—enabling clean separation between pipeline stages and preventing data corruption from simultaneous operations.
  4. Load instruction takes 5 cycles to complete—IF (fetch), ID (decode/read), EX (address calc), MEM (read memory), WB (write register).
  5. Store instruction uses 4 active stages—skips WB stage since no register write occurs, but occupies pipeline for 5 cycles.
  6. Instruction and data must travel together—control signals propagate alongside data through pipeline to ensure correct operations at later stages.
  7. Register file has two write ports and three read ports in practice—enabling simultaneous read in ID and write in WB stages.
  8. Forwarding paths bypass pipeline registers—directly connecting EX/MEM and MEM/WB outputs to ALU inputs for data hazard resolution.
  9. Load-use hazard requires pipeline stall—memory data not available until MEM/WB register, too late for immediate ALU use even with forwarding.
  10. Clock frequency = 1 / (Register Delay + Maximum Stage Delay)—pipeline register overhead reduces frequency below ideal calculation.
  11. Pipeline registers introduce 20-50 ps overhead per stage—must account for setup/hold times and propagation delays in timing analysis.
  12. Stage delays must balance for optimal performance—uneven stages waste time as clock period determined by slowest stage.
  13. Separate instruction and data caches essential—prevent structural hazards from simultaneous IF and MEM stage memory access.
  14. Pipeline depth tradeoff: Deeper pipelines increase clock frequency but amplify hazard penalties and register overhead.
  15. Write-back stage coincides with fetch of fifth instruction—demonstrating true parallelism with five instructions in pipeline simultaneously.
  16. Control signals generated in ID stage propagate through pipeline with instruction—EX/MEM/WB stages use stored control bits.
  17. ALU result available in EX stage can forward to dependent instruction in EX stage—eliminating most RAW hazard stalls.
  18. Memory data available in MEM stage can forward to dependent instruction in EX stage—but not soon enough for load-use case.
  19. Throughput approaches 1 instruction per cycle in steady state—achieving near 5× speedup over single-cycle design.
  20. Pipeline timing analysis critical for clock frequency determination—must consider all delay components including registers, logic, and wire delays.

Summary

The detailed examination of MIPS pipeline operation reveals the sophisticated hardware mechanisms that enable efficient instruction-level parallelism through careful staging and register design. Four pipeline registers (IF/ID, ID/EX, EX/MEM, MEM/WB) serve as the critical infrastructure separating five pipeline stages, capturing and propagating not only instruction data but also all control signals needed by downstream stages. The cycle-by-cycle analysis of load and store instructions demonstrates how each pipeline stage performs its designated function while simultaneously handling different instructions—instruction fetch occurring for instruction N while instruction N-1 decodes, N-2 executes, N-3 accesses memory, and N-4 writes back results. This true parallelism, with five instructions simultaneously occupying different pipeline stages, achieves the dramatic throughput improvement that justifies pipeline complexity. The timing analysis introduces crucial practical considerations: pipeline registers add 20-50 picoseconds overhead per stage, stage delays must balance to avoid wasting clock cycles, and clock frequency equals the reciprocal of register delay plus maximum stage delay. Forwarding paths that bypass pipeline registers—connecting EX/MEM and MEM/WB outputs directly to ALU inputs—eliminate most data hazard stalls by making results available before register write-back completes, though load-use hazards still require one-cycle stalls since memory data arrives too late even with forwarding. The register file's dual-port design enables simultaneous reading in ID stage and writing in WB stage, essential for maintaining pipeline flow. Practical exercises in clock frequency calculation reinforce understanding of how component delays, register overhead, and stage balancing determine ultimate processor performance. The separation of instruction and data caches emerges as non-negotiable requirement, preventing structural hazards from simultaneous memory access in IF and MEM stages. This comprehensive pipeline view—from register-level mechanisms through timing analysis to performance optimization—provides essential foundation for understanding real processor implementations and the engineering tradeoffs between pipeline depth, clock frequency, hazard penalties, and design complexity that characterize modern computer architecture.