Lecture 13: Pipeline Operation and Timing - Lectures on Computer Architecture

Click the thumbnail above to watch the video lecture on YouTube

By Dr. Isuru Nawinne

13.1 Introduction

This lecture provides comprehensive, cycle-by-cycle analysis of MIPS five-stage pipeline operation, examining how instructions flow through pipeline stages with detailed attention to the pipeline registers that store intermediate results between stages. We explore the critical role of these registers in enabling independent stage operation, trace complete execution sequences for load and store instructions, analyze timing constraints and delay contributions, and work through practical exercises calculating clock frequencies and optimizing pipeline performance. This detailed examination reveals the hardware mechanisms that transform the conceptual pipeline model into functioning silicon.

13.2 Lecture Introduction and Recap

13.2.1 Previous Topics Review

Pipelining Concept:

Instruction-level parallelism exploitation
Five-stage MIPS pipeline: IF, ID, EX, MEM, WB
Staggered instruction execution
All hardware utilized simultaneously

Performance Metric:

Throughput improved (not latency)
Instructions/unit time increased
Individual instruction latency same or worse
Overall system performance dramatically better

Hazards Covered:

Structural: Hardware resource conflicts
Data: Register/memory dependencies
Control: Branch/jump decision delays

Solutions Discussed:

Structural: Separate I-cache and D-cache
Data: Forwarding, code reordering
Control: Early branch resolution, prediction

13.2.2 Today's Focus

Detailed Pipeline Analysis:

Cycle-by-cycle operation walkthrough
Pipeline register requirements
Timing and delay analysis
Load/Store instruction examples
Common implementation errors
Practical exercises

13.3 Five-Stage MIPS Pipeline Review

Five-Stage MIPS Pipeline Architecture

13.3.1 Stage 1: Instruction Fetch (IF)

Operations:

PC value determines instruction address
Access instruction memory
Fetch 32-bit instruction word
Calculate PC + 4 for next sequential instruction

Hardware Elements:

Program Counter register
Instruction Memory
PC + 4 Adder

Key Point:

Both operations (memory read, PC+4 calculation) occur in parallel

13.3.2 Stage 2: Instruction Decode / Register Read (ID)

Operations:

Decode opcode (6 bits)
Determine instruction type
Identify register fields
Read register file (RS, RT)
Sign-extend immediate value (16→32 bits)
Generate control signals

Hardware Elements:

Instruction decoder (combinational logic)
Register file (read ports)
Sign extension unit
Control unit

Workload Balancing:

Decode + register read fit in one cycle
Even distribution of work
Register read dominates timing

Control Signal Generation:

9-10 control signal bits generated
Based on opcode
Used by subsequent stages
Must be preserved through pipeline

13.3.3 Stage 3: Execution (EX)

Operations:

ALU performs computation OR address calculation
Multiplexer selects second operand (register vs immediate)
Branch: Compare registers, compute target address

Hardware Elements:

ALU (Arithmetic Logic Unit)
Input multiplexer (register/immediate selection)
Branch target adder (parallel to ALU)
Shift left 2 unit (for branch offset)

Key Characteristics:

ALU operation dominates timing
Branch hardware operates in parallel
Multiple functions depending on instruction type

13.3.4 Stage 4: Memory Access (MEM)

Operations:

Load: Read from data memory
Store: Write to data memory
Other instructions: Skip (no memory access)
Branch: PC update decision

Hardware Elements:

Data Memory
PC source multiplexer (for branches)

Timing Consideration:

Memory access slowest operation
Dominates stage timing
Critical path component

13.3.5 Stage 5: Write Back (WB)

Operations:

Select data source (ALU result OR memory data)
Write to destination register
Load: Memory data → register
Arithmetic: ALU result → register

Hardware Elements:

MemtoReg multiplexer
Register file (write port)

Minimal Hardware:

Mostly multiplexer visible
Register file shared with ID stage
Shortest stage conceptually

13.4 Pipeline Registers: Necessity and Function

13.4.1 Problem Without Pipeline Registers

Scenario:

Multiple instructions in different stages
All sharing same hardware components
Data from different instructions

Example Issues:

Register file: ID stage reads while WB stage writes
Control signals: Generated in ID, needed in later stages
Data values: Computed in EX, needed in MEM
Overwriting: New instruction data overwrites previous instruction data

Result Without Pipeline Registers:

Data hazards everywhere
Control hazards from signal conflicts
Structural hazards from resource contention
Pipeline cannot function correctly

13.4.2 Pipeline Register Purpose

Pipeline Registers Between Pipeline Stages

Key Function:

Store information from previous stage
Make data available to next stage
Synchronize operations across clock cycles
Prevent interference between instructions

Placement:

One between each pair of consecutive stages
IF/ID: Between instruction fetch and decode
ID/EX: Between decode and execution
EX/MEM: Between execution and memory
MEM/WB: Between memory and write back

Exception:

No register between WB and IF
PC register serves this purpose
Register file contains storage
No additional register needed

13.4.3 Pipeline Register Contents

IF/ID Pipeline Register:

32-bit instruction word
32-bit PC+4 value
Total: 64 bits

ID/EX Pipeline Register:

Two 32-bit register values (from register file)
32-bit sign-extended immediate
32-bit PC+4 value (for branches)
5-bit write register address
9-10 control signal bits
Total: ~140+ bits (largest pipeline register)

EX/MEM Pipeline Register:

32-bit ALU result
32-bit register value (for stores)
32-bit branch target address
1-bit ALU zero flag
5-bit write register address
Control signals for MEM/WB stages
Total: ~105+ bits

MEM/WB Pipeline Register:

32-bit memory read data
32-bit ALU result
5-bit write register address
Control signals for WB stage
Total: ~75+ bits

13.4.4 Timing: Writing and Reading Pipeline Registers

At Rising Clock Edge:

Pipeline register write begins
Small hold time delay (~10-30 ps)
Data captured and stored
Writing delay consumed

After Writing:

Reading delay begins
Data propagates to output (~10-30 ps)
Outputs stabilize at new values
Next stage begins operations

Combined Overhead:

Write delay + read delay = ~20-60 ps
Occurs at start of every stage
Reduces time available for actual computation
Pipelining overhead cost

Critical Observation:

These delays don't exist in single-cycle
Pipelining adds latency overhead
But throughput gain outweighs latency cost

13.5 Load Word Instruction: Detailed Cycle-by-Cycle Analysis

13.5.1 Load Word Instruction Format

Encoding:

LW $rt, offset($rs)

Opcode: 100011 (bits 26-31)
RS:     Base register (bits 21-25)
RT:     Destination register (bits 16-20)
Offset: 16-bit immediate (bits 0-15)

Operation: $rt = Memory[$rs + offset]

Example: LW $8, 32($9)

Base address in $9
Add offset 32
Load from memory into $8

13.5.2 Clock Cycle 1: Instruction Fetch (IF)

Start of Cycle:

New PC value available (from previous cycle)
PC write delay: ~10-30 ps
PC read delay: ~10-30 ps

Operations:

Update PC register (rising edge)
Read PC value (small delay)
Access instruction memory with PC address
Instruction memory read delay: ~200 ps (dominant)
Compute PC + 4 in parallel: ~70 ps

End of Cycle:

32-bit LW instruction available
PC + 4 value available
Both ready to write to IF/ID register

Hardware Shading Convention:

Right side shaded: Device READ
Left side shaded: Device WRITTEN
Example: Instruction Memory right-side shaded (read)
IF/ID register left-side shaded (written to)

Total Stage Time: ~200+ ps (instruction memory dominant)

13.5.3 Clock Cycle 2: Instruction Decode / Register Read (ID)

Start of Cycle (Rising Edge):

IF/ID register write: ~30 ps
IF/ID register read: ~30 ps
Combined delay: ~60 ps

After Pipeline Register:

Instruction word available
Extract fields:

Opcode: bits 26-31 → Control Unit
RS (bits 21-25) → Register file address 1
RT (bits 16-20) → Register file address 2 AND write address
Offset (bits 0-15) → Sign extender

Parallel Operations:

Control Unit: Decode opcode → Generate control signals (~50 ps)
Register File: Read RS ($9) and RT ($8 address, value not needed)
- Read delay: ~90 ps (dominant)
Sign Extender: Extend 32 to 32 bits (~10 ps, negligible)

End of Cycle:

Base address value (from $9) available
RT address read (discarded for LW)
Sign-extended offset (32) available
PC + 4 value forwarded
Control signals generated
All ready for ID/EX register

Why Read Both Registers:

Hardware simplicity: Always read both
Multiplexer decides usage later
Store would need RT value
Simpler than conditional reading

Total Stage Time: ~60 + 90 = ~150 ps (register read dominant)

13.5.4 Clock Cycle 3: Execution (EX)

Start of Cycle:

ID/EX register write: ~30 ps
ID/EX register read: ~30 ps

ALU Input Preparation:

Input A: Base address (from $9) directly from pipeline register
Input B: Multiplexer selects immediate OR register

Control signal ALUSrc = 1 (select immediate)
Multiplexer delay: ~20 ps
Sign-extended offset (32) selected

ALU Operation:

Add base address + offset
ALU delay: ~90 ps (dominant)
Result: Memory address = $9 + 32

Parallel Operations (for branches, not used here):

Shift left 2: Offset × 4 (~10 ps)
Branch target adder: PC+4 + (offset×4) (~70 ps)
Zero flag generation

End of Cycle:

Memory address available at ALU output
Branch target available (unused)
Zero flag available (unused)
RT value forwarded (unused for LW)
Control signals forwarded
Write register address (RT) forwarded
All ready for EX/MEM register

Total Stage Time: ~30 + 30 + 20 + 90 = ~170 ps (ALU dominant)

13.5.5 Clock Cycle 4: Memory Access (MEM)

Start of Cycle:

EX/MEM register write: ~30 ps
EX/MEM register read: ~30 ps

Memory Access:

ALU result (address) → Data memory address input
MemRead control signal = 1 (enable read)
MemWrite control signal = 0 (disable write)
Data memory read delay: ~250 ps (DOMINANT - slowest operation!)

Parallel Operations (unused for LW):

Zero flag + Branch → PCSrc decision
Branch target → PC multiplexer

End of Cycle:

Loaded data available from memory
ALU result (address) forwarded for R-type instructions
Write register address (RT) forwarded
Control signals forwarded
All ready for MEM/WB register

Critical Path:

Load Word determines minimum clock period
Memory access slowest component
All other instructions wait for this

Total Stage Time: ~30 + 30 + 250 = ~310 ps (memory READ dominant!)

13.5.6 Clock Cycle 5: Write Back (WB)

Start of Cycle:

MEM/WB register write: ~30 ps
MEM/WB register read: ~30 ps

Data Selection:

MemtoReg multiplexer:

Control signal MemtoReg = 1 (select memory data)
Input 0: ALU result (not used for LW)
Input 1: Memory read data (SELECTED)
Multiplexer delay: ~20 ps

Register Write Preparation:

Write data: Memory data from multiplexer
Write address: RT ($8) from pipeline register
RegWrite control signal = 1 (enable write)

CRITICAL ERROR IN TEXTBOOK DIAGRAM:

Many diagrams show write address from IF/ID register
WRONG! IF/ID has current instruction (4 cycles later!)
Correct: Write address propagated through ALL pipeline registers
Must use write address from MEM/WB register

At Rising Edge (End of Cycle / Start of Next):

Register $8 written with loaded data
Write occurs in first half of cycle
Subsequent ID stage can read in second half (same cycle!)

Register File Timing Trick:

Write: First half of clock cycle
Read: Second half of clock cycle
Enables read-after-write in adjacent cycles
Critical for data forwarding

Total Stage Time: ~30 + 30 + 20 = ~80 ps (shortest stage!)

13.5.7 Load Word Complete Pipeline Summary

Cycle	Stage	Operations	Dominant Delay	Time
1	IF	Fetch instruction, PC+4	Inst Memory	200ps
2	ID	Decode, read regs, control	Reg Read	150ps
3	EX	ALU: base + offset	ALU	170ps
4	MEM	Read data memory	Memory Read	310ps ← CRITICAL!
5	WB	Select memory, write register	Multiplexer	80ps

Minimum Clock Period: 310 ps (limited by MEM stage)

Maximum Clock Frequency: 1 / 310ps ≈ 3.2 GHz

Pipeline Overhead:

Pipeline register delays: ~(30+30) × 5 stages = 300ps
Actual useful work: ~(200+90+90+250) = 630ps
Total latency: ~930ps
Overhead: ~32% of execution time

Comparison to Single-Cycle:

Single-cycle latency: ~800 ps (no pipeline register overhead)
Pipelined latency: ~930 ps (with overhead)
But pipelined throughput: 5× better (ideally)

13.6 Store Word Instruction: Key Differences

13.6.1 Store Word Instruction Format

Encoding:

SW $rt, offset($rs)

Opcode: 101011 (bits 26-31)
RS:     Base register (bits 21-25)
RT:     Source data register (bits 16-20)
Offset: 16-bit immediate (bits 0-15)

Operation: Memory[$rs + offset] = $rt

Example: SW $8, 32($9)

Base address in $9
Add offset 32
Store $8 value to memory

Key Difference from Load:

RT is SOURCE (not destination)
RT value needed for memory write

Stages IF, ID, EX: Same as Load Word

Instruction Fetch: Identical to LW

Instruction Decode: Identical to LW

Read both RS and RT
RT value NOW IMPORTANT (not discarded)
Sign-extend offset

Execution: Identical to LW

Compute memory address: base + offset
ALU operation same

13.6.2 Memory Access Stage: KEY DIFFERENCE

Start of Cycle:

EX/MEM register contains:

- Memory address (from ALU)
- RT data value (from register file, preserved through pipeline)

Memory Access:

Address → Data memory address input
RT value → Data memory write data input
MemWrite = 1 (ENABLE write)
MemRead = 0 (DISABLE read)

Operation:

Write RT value to computed address
Memory write delay: ~250 ps

End of Cycle:

Data written to memory
Memory read output INVALID (MemRead=0)
Not used by subsequent stage

Control Signal Critical:

Control Signal	Load	Store
MemRead	1	0
MemWrite	0	1
RegWrite (WB stage)	1	0 ← CRITICAL!

13.6.3 Write Back Stage: NO OPERATION

Store Word WB Stage:

NO register write needed
Store wrote to MEMORY (not register)
RegWrite = 0 (DISABLE)

Why RegWrite MUST Be 0:

Pipeline registers still contain data
MemtoReg multiplexer produces output
If RegWrite = 1: DISASTER!

- Random data written to random register
- Data corruption
- Program failure

Hardware Still Operates:

Multiplexer produces output (garbage)
Write address present (RT from pipeline)
Write data present (memory output = invalid, or ALU result)
But RegWrite = 0 prevents write

Lesson: Control Signals Essential

Must prevent unwanted operations
Hardware runs in parallel
Only control signals prevent corruption

Store Word Pipeline Summary:

Cycle	Stage	Operations	Notes
1	IF	Fetch SW instruction	Same as LW
2	ID	Decode, read RS, RT	RT value USED (not discarded)
3	EX	Compute address	Same as LW
4	MEM	Write RT to memory	WRITE instead of read
5	WB	Nothing (bubble)	RegWrite=0, stage idle

13.7 Common Pipeline Diagram Errors

13.7.1 Error 1: Write Register Address Source

Incorrect Diagram Shows:

Write register address from IF/ID pipeline register
Connected directly to register file write port

Why This Is Wrong:

IF/ID contains CURRENT instruction (just fetched)
Write back for instruction 4 cycles ago
Wrong register would be written!

Example:

Cycle 1: LW $8, 0($10) fetched  (IF)
Cycle 2: LW $9, 4($10) fetched  (IF), LW $8 in ID
Cycle 3: LW $10, 8($10) fetched (IF), LW $8 in EX
Cycle 4: ADD $11, $12, $13 fetched (IF), LW $8 in MEM
Cycle 5: SUB $14, $15, $16 fetched (IF), LW $8 in WB

At Cycle 5:
• IF/ID contains SUB (writes $14)
• WB should write $8 (from LW)
• If using IF/ID: Would write to $14 instead of $8!
• WRONG REGISTER!

Correct Implementation:

Propagate write address through ALL pipeline registers
ID/EX stores it
EX/MEM stores it
MEM/WB stores it
WB uses address from MEM/WB register

Additional Lines Required:

5-bit write address bus through each pipeline register
Increases pipeline register size
Essential for correctness

13.7.2 Error 2: Incorrect Memory Access Indication

Diagram Error from Textbook:

ADD instruction shown accessing data memory (wrong!)
LW instruction shown NOT accessing data memory (wrong!)

Correct Resource Usage:

Instruction	IF	ID	EX	MEM	WB
LW	✓	✓	✓	✓ Read	✓ Write Reg
SW	✓	✓	✓	✓ Write	No action
ADD	✓	✓	✓	✗ No access	✓ Write Reg
BEQ	✓	✓	✓	✗ PC update	✗ No write

Shading Convention:

Shaded box: Resource USED
Unshaded box: Resource NOT USED (idle)

ADD Instruction Correct:

MEM stage: No memory access, stage mostly idle
Just forwards ALU result

LW Instruction Correct:

MEM stage: Read from data memory
Memory data forwarded to WB

13.7.3 Error 3: Store Word Memory Read

Another Common Error:

Store instruction shown with MemRead = 1
Memory output shown as valid

Why Wrong:

Store WRITES to memory (MemWrite = 1)
Should NOT read (MemRead = 0)
Memory read output undefined/invalid

Correct:

MemWrite = 1, MemRead = 0
Memory input: Address and write data
Memory output: Ignored (invalid)

13.8 Multi-Clock-Cycle Pipeline Diagrams

13.8.1 Single-Clock vs Multi-Clock Diagrams

Single-Clock-Cycle Diagram:

Shows ONE stage at ONE clock cycle
Detailed resource usage
Specific delays visible
Good for understanding individual stage

Multi-Clock-Cycle Diagram:

Shows MULTIPLE instructions at MULTIPLE cycles
Cross-sectional view of pipeline
Parallel execution visible
Good for understanding overall flow

13.8.2 Traditional Multi-Cycle Diagram

Format:

Cycle:     1    2    3    4    5    6    7    8    9
Instr 1:   IF   ID   EX   MEM  WB
Instr 2:        IF   ID   EX   MEM  WB
Instr 3:             IF   ID   EX   MEM  WB
Instr 4:                  IF   ID   EX   MEM  WB
Instr 5:                       IF   ID   EX   MEM  WB

Shows:

Staggered execution
Steady state (cycle 5: all stages busy)
Pipeline fill time (cycles 1-4)
Pipeline drain time (cycles 7-9)

Does NOT Show:

Resource usage details
Hardware components used
Delays and timing

13.8.3 Enhanced Multi-Cycle Diagram with Resources

Format:

Cycle	Instr 1	Instr 2	Instr 3
1	[IM][RF][ ][ ][ ]	—	—
2	[ ][IM][RF][ ][ ]	[IM][RF][ ][ ][ ]	—
3	[ ][ ][IM][RF][ ]	[ ][IM][RF][ ][ ]	[IM][RF][ ][ ][ ]

Legend:

• IM: Instruction Memory (IF)
• RF: Register File (ID)
• ALU: ALU operation (EX)
• DM: Data Memory (MEM)
• WB: Write Back (WB)

Shows:

Which resources used when
Parallel resource usage
Resource conflicts (if any)
Detailed pipeline state

Benefits:

Visualize structural hazards
Understand resource contention
See idle hardware
Verify correctness

Textbook Error Example:

ADD instruction marked with DM (wrong!)
LW instruction NOT marked with DM (wrong!)
Always verify diagrams carefully

13.9 Timing and Clock Frequency Analysis

13.9.1 Component Delays (Typical Values)

Component	Delay (picoseconds)
Instruction Memory	200
Register File Read	90
Register File Write	90
ALU Operation	90
Data Memory Read	250
Data Memory Write	250
Sign Extension	10 (negligible)
Multiplexer	20
Adder (PC+4, branch)	70
Shift Left 2	10 (wire routing)
Pipeline Register Write	30
Pipeline Register Read	30

Key Observations:

Memory operations slowest (200-250 ps)
ALU and register file moderate (90 ps)
Small combinational logic fast (10-20 ps)
Pipeline register overhead (60 ps per stage)

13.9.2 Stage Timing Calculation

Stage 1: Instruction Fetch (IF)

Pipeline Register Write:   N/A (PC register)
Pipeline Register Read:    N/A
Instruction Memory:        200 ps
PC + 4 Adder:              70 ps (parallel)

Total: 200 ps (memory dominant)

Stage 2: Instruction Decode (ID)

IF/ID Write + Read:        60 ps
Register File Read:        90 ps (dominant)
Control Unit Decode:       50 ps (parallel)
Sign Extension:            10 ps (parallel)

Total: 60 + 90 = 150 ps

Stage 3: Execution (EX)

ID/EX Write + Read:        60 ps
Multiplexer:               20 ps
ALU Operation:             90 ps
Branch Adder:              70 ps (parallel)
Shift Left 2:              10 ps (parallel)

Total: 60 + 20 + 90 = 170 ps

Stage 4: Memory Access (MEM)

EX/MEM Write + Read:       60 ps
Data Memory Access:        250 ps (DOMINANT)

Total: 60 + 250 = 310 ps ← CRITICAL PATH!

Stage 5: Write Back (WB)

MEM/WB Write + Read:       60 ps
MemtoReg Multiplexer:      20 ps

Register File Write: 30 ps (first half of cycle)

Total: 60 + 20 + 30 = 110 ps

13.9.3 Clock Frequency Determination

Minimum Clock Period:

Determined by SLOWEST stage
MEM stage: 310 ps
All stages must use this period

Maximum Clock Frequency:

f_max = 1 / T_min
      = 1 / 310 ps
      = 1 / (310 × 10^-12 s)
      = 3.226 GHz
      ≈ 3.2 GHz

Efficiency Analysis:

Stage	Time	Utilization	Wasted Time
IF	200	65%	110 ps
ID	150	48%	160 ps
EX	170	55%	140 ps
MEM	310	100%	0 ps
WB	110	35%	200 ps

Average utilization: ~60%

Wasted time: ~40% average

13.9.4 Performance Improvement Strategies

Strategy 1: Pipeline Balancing

Reduce MEM stage delay (dominant)
Options:

- Faster memory technology
- Separate instruction/data caches
- Smaller, faster cache
- Multi-ported memory

Strategy 2: Increase ALU Time

Question: If ALU shortened by 25%, does it help?
Answer: Depends on critical path
If MEM is critical (usual case): NO improvement
If EX is critical (rare): YES, improves throughput

Strategy 3: Additional Pipeline Stages

Subdivide long stages (especially MEM)
Memory access in 2-3 sub-stages
Shorter clock period possible
More stages = more overhead
Diminishing returns beyond certain point

Strategy 4: Cache Memory

Fast cache between CPU and main memory
Cache hit: Fast access (~10-20 ps)
Cache miss: Slow access (~250 ps)
High hit rate → effective fast memory
(Covered in next lectures)

Real-World Example:

Intel Atom processors: ~30 pipeline stages
Achieved by extreme subdivision
Very short clock period
High frequency possible
But diminishing returns and hazard complexity

13.10 Practical Exercises and Solutions

13.10.1 Exercise: Maximum Clock Frequency Calculation

Given Component Delays:

Instruction Memory:      200 ps
Register File (read):    90 ps
Register File (write):   90 ps
ALU:                     90 ps
Data Memory (read):      250 ps
Data Memory (write):     250 ps
Sign Extend:             ~0 ps
Multiplexer:             20 ps
Adder:                   70 ps
Shift Left 2:            10 ps
Pipeline Register:       30 ps (write), 30 ps (read)

Step 1: Calculate each stage timing

IF: 200 + 60 (pipeline reg) = 260 ps
ID: 60 + 90 = 150 ps
EX: 60 + 20 + 90 = 170 ps
MEM: 60 + 250 = 310 ps ← CRITICAL
WB: 60 + 30 = 90 ps

Step 2: Identify critical path

Longest stage: MEM at 310 ps

Step 3: Calculate maximum frequency

f_max = 1 / 310 ps
      = 3.226 GHz

13.10.2 Exercise: Improving Clock Frequency

Question: Suggest mechanisms to increase clock frequency. Discuss negative impacts.

Suggestion 1: Faster Memory Technology

Use SRAM instead of DRAM
Reduce memory access time to ~100 ps
Pros:

- Significantly reduces critical path
- New critical path: IF at 260 ps
- Frequency increase: 310→260 (1.2× improvement)

Cons:

- SRAM very expensive
- Much larger area
- Higher power consumption
- Limited capacity

Suggestion 2: Cache Memory (BEST)

Add small, fast cache
Cache access: ~50-100 ps
Most accesses hit cache
Pros:

- Cost-effective
- Good performance
- Scalable
- Industry standard

Cons:

- Cache misses still slow
- Complex cache management
- Additional hardware

Suggestion 3: Split Memory Stage

Divide MEM into MEM1 and MEM2
Each sub-stage: 185 ps
Total stages: 6
Pros:

- More balanced pipeline
- Higher frequency possible

Cons:

- More pipeline registers (overhead)
- Increased latency
- More complex control

Suggestion 4: Eliminate Pipeline Register Overhead

Use transparent latches
Reduce write+read delay
Pros:

- Removes 60 ps overhead per stage
- Significant improvement

Cons:

- Timing more complex
- Clock skew issues
- Less reliable

13.10.3 Exercise: ALU Optimization Impact

Question: ALU time shortened by 25%. Does it affect speedup?

Analysis:

Current ALU delay: 90 ps
Reduced ALU delay: 90 × 0.75 = 67.5 ps
Savings: 22.5 ps

Scenario 1: MEM is Critical Path (Typical)

Current EX stage: 60 + 20 + 90 = 170 ps
Optimized EX stage: 60 + 20 + 67.5 = 147.5 ps
Current critical path: MEM at 310 ps
New critical path: Still MEM at 310 ps
Clock period: Still 310 ps
Speedup: NONE

Conclusion: No improvement when not on critical path

Scenario 2: EX is Critical Path (Hypothetical)

Assume faster memory: MEM = 200 ps
Current EX stage: 170 ps (critical)
Optimized EX stage: 147.5 ps
New critical path: EX at 147.5 ps
Clock period: 170 → 147.5 ps
Improvement: 1.15× faster

Conclusion: Significant improvement when on critical path

General Principle:

Only optimizing critical path improves throughput
Non-critical optimizations: No throughput benefit
May reduce latency slightly (instruction-by-instruction)

13.10.4 Exercise: Pipeline Speedup Calculation

Given:

10^7 instructions (10 million)
Non-pipelined: 100 ps per instruction
Perfect 20-stage pipeline

Part A: Non-pipelined execution time

Time = Instructions × Time per instruction
     = 10^7 × 100 ps
     = 10^9 ps
     = 1 ms (0.001 seconds)

Part B: Speedup from 20-stage perfect pipeline

Ideal Speedup = Number of stages = 20×

Part C: Time with perfect pipeline

Time = (10^7 × 100 ps) / 20
     = 10^9 / 20 ps
     = 5 × 10^7 ps
     = 0.05 ms

Part D: Real pipeline overhead impact

Overheads affect both latency AND throughput
Pipeline register delays: Add to latency
Unbalanced stages: Reduce throughput
Hazards and stalls: Reduce throughput further

Answer: BOTH latency and throughput affected

Latency Impact:

Pipeline register overhead adds to per-instruction time
100 ps → ~130 ps per instruction (with overhead)

Throughput Impact:

Unbalanced stages reduce effective speedup
Perfect 20× becomes ~15-17× in reality
Critical path limits clock speed

13.11 Summary and Key Takeaways

13.11.1 Pipeline Operation Fundamentals

Pipeline Registers Essential:

Synchronize operations across stages
Store intermediate values
Prevent data interference
Enable parallel execution

Timing Critical:

Write delay + read delay at every stage
Pipeline register overhead significant
Critical path determines clock period
Throughput limited by slowest stage

13.11.2 Design Principles

Make Common Case Fast:

Memory accesses most critical
Optimize memory access time first
Cache memory industry solution

Balance Pipeline Stages:

Even workload distribution
Minimize wasted time
Maximize efficiency

Control Signals Matter:

Prevent unwanted operations
Propagate through pipeline
Essential for correctness

13.11.3 Common Mistakes to Avoid

Write Register Address:

Must propagate through ALL pipeline registers
Cannot use current instruction's address
4-cycle delay between fetch and write back

Control Signal Errors:

RegWrite must be 0 for store/branch
MemRead/MemWrite must be mutually exclusive
Incorrect signals cause data corruption

Diagram Interpretation:

Verify resource usage carefully
Textbooks contain errors
Understand shading conventions

13.11.4 Performance Considerations

Critical Path Analysis:

Identify slowest stage
Optimize critical path components
Non-critical optimizations don't help throughput

Speedup Limitations:

Ideal speedup = number of stages
Actual speedup < ideal
Reasons:

- Pipeline register overhead
- Unbalanced stages
- Hazards and stalls
- Pipeline fill/drain time

13.11.5 Looking Ahead

Memory Hierarchy (Next Topics):

Cache memory introduction
Memory performance optimization
Cache design and organization
Virtual memory
Performance bottleneck solutions

Real-World Pipelines:

10-30 stages common
Superscalar (multiple issue)
Out-of-order execution
Speculative execution
Branch prediction sophistication

13.12 Important Formulas

Clock Period

T_clock = max(T_IF, T_ID, T_EX, T_MEM, T_WB)

Where each T_stage includes:
• Pipeline register write delay
• Pipeline register read delay
• Dominant component delay

Maximum Frequency

f_max = 1 / T_clock

Pipeline Speedup

Speedup = T_non-pipelined / T_pipelined_steady_state
        ≈ Number of stages (ideal)
        < Number of stages (actual)

Stage Timing General Formula

T_stage = T_pipe_write + T_pipe_read + T_dominant_component + T_other_parallel

Where parallel components don't add (take maximum)

Throughput

Throughput = 1 instruction / T_clock (steady state)

Latency

Latency = (Number of stages) × T_clock + Pipeline overhead

Key Takeaways

Four pipeline registers separate five stages: IF/ID, ID/EX, EX/MEM, MEM/WB store all information needed by subsequent stages.
Pipeline registers capture data and control signals—instruction fields, register values, ALU results, memory data, and control bits all propagate through pipeline.
Each register updates on clock edge—enabling clean separation between pipeline stages and preventing data corruption from simultaneous operations.
Load instruction takes 5 cycles to complete—IF (fetch), ID (decode/read), EX (address calc), MEM (read memory), WB (write register).
Store instruction uses 4 active stages—skips WB stage since no register write occurs, but occupies pipeline for 5 cycles.
Instruction and data must travel together—control signals propagate alongside data through pipeline to ensure correct operations at later stages.
Register file has two write ports and three read ports in practice—enabling simultaneous read in ID and write in WB stages.
Forwarding paths bypass pipeline registers—directly connecting EX/MEM and MEM/WB outputs to ALU inputs for data hazard resolution.
Load-use hazard requires pipeline stall—memory data not available until MEM/WB register, too late for immediate ALU use even with forwarding.
Clock frequency = 1 / (Register Delay + Maximum Stage Delay)—pipeline register overhead reduces frequency below ideal calculation.
Pipeline registers introduce 20-50 ps overhead per stage—must account for setup/hold times and propagation delays in timing analysis.
Stage delays must balance for optimal performance—uneven stages waste time as clock period determined by slowest stage.
Separate instruction and data caches essential—prevent structural hazards from simultaneous IF and MEM stage memory access.
Pipeline depth tradeoff: Deeper pipelines increase clock frequency but amplify hazard penalties and register overhead.
Write-back stage coincides with fetch of fifth instruction—demonstrating true parallelism with five instructions in pipeline simultaneously.
Control signals generated in ID stage propagate through pipeline with instruction—EX/MEM/WB stages use stored control bits.
ALU result available in EX stage can forward to dependent instruction in EX stage—eliminating most RAW hazard stalls.
Memory data available in MEM stage can forward to dependent instruction in EX stage—but not soon enough for load-use case.
Throughput approaches 1 instruction per cycle in steady state—achieving near 5× speedup over single-cycle design.
Pipeline timing analysis critical for clock frequency determination—must consider all delay components including registers, logic, and wire delays.

Summary

The detailed examination of MIPS pipeline operation reveals the sophisticated hardware mechanisms that enable efficient instruction-level parallelism through careful staging and register design. Four pipeline registers (IF/ID, ID/EX, EX/MEM, MEM/WB) serve as the critical infrastructure separating five pipeline stages, capturing and propagating not only instruction data but also all control signals needed by downstream stages. The cycle-by-cycle analysis of load and store instructions demonstrates how each pipeline stage performs its designated function while simultaneously handling different instructions—instruction fetch occurring for instruction N while instruction N-1 decodes, N-2 executes, N-3 accesses memory, and N-4 writes back results. This true parallelism, with five instructions simultaneously occupying different pipeline stages, achieves the dramatic throughput improvement that justifies pipeline complexity. The timing analysis introduces crucial practical considerations: pipeline registers add 20-50 picoseconds overhead per stage, stage delays must balance to avoid wasting clock cycles, and clock frequency equals the reciprocal of register delay plus maximum stage delay. Forwarding paths that bypass pipeline registers—connecting EX/MEM and MEM/WB outputs directly to ALU inputs—eliminate most data hazard stalls by making results available before register write-back completes, though load-use hazards still require one-cycle stalls since memory data arrives too late even with forwarding. The register file's dual-port design enables simultaneous reading in ID stage and writing in WB stage, essential for maintaining pipeline flow. Practical exercises in clock frequency calculation reinforce understanding of how component delays, register overhead, and stage balancing determine ultimate processor performance. The separation of instruction and data caches emerges as non-negotiable requirement, preventing structural hazards from simultaneous memory access in IF and MEM stages. This comprehensive pipeline view—from register-level mechanisms through timing analysis to performance optimization—provides essential foundation for understanding real processor implementations and the engineering tradeoffs between pipeline depth, clock frequency, hazard penalties, and design complexity that characterize modern computer architecture.

← Previous Lecture Next Lecture →