Back to All Lectures

Lecture 11: Single-Cycle Execution

Lectures on Computer Architecture

Click the thumbnail above to watch the video lecture on YouTube

By Dr. Isuru Nawinne

11.1 Introduction

This lecture completes the single-cycle MIPS processor design by providing comprehensive analysis of control signals for all instruction types (R-type, Branch, Load, Store, Jump), introducing detailed timing analysis with concrete delay values, and demonstrating the fundamental performance limitations that motivate the evolution toward multi-cycle and pipelined implementations. We build upon previous datapath and control unit knowledge to create a functioning processor while understanding why single-cycle design, though conceptually simple, proves inefficient in practice.

11.2 Lecture Overview and Context

11.2.1 Recap from Previous Lectures

The foundational work completed in previous lectures includes:

Completed Topics:

Current Focus:

11.2.2 Instruction Subset Review

Selected Instructions for Study:

Coverage:

11.3 Control Unit Inputs and Outputs

11.3.1 Control Unit Inputs

Total Input Bits: 12 bits

Primary Input - Opcode (6 bits):

Secondary Input - Funct Field (6 bits):

Usage Pattern:

11.3.2 Control Unit Outputs

Total Output Bits: 9 bits (8 signals, one is 2-bit)

Control Signals Generated:

  1. RegDst (1 bit): Select register write address
  2. Branch (1 bit): Instruction is branch type
  3. MemRead (1 bit): Enable memory read
  4. MemtoReg (1 bit): Select register write data source
  5. MemWrite (1 bit): Enable memory write
  6. ALUSrc (1 bit): Select ALU second operand source
  7. RegWrite (1 bit): Enable register file write
  8. ALUOp (2 bits): ALU operation category

Additional Signal for Jump:

  1. Jump (1 bit): Select jump target for PC

Implementation:

11.4 R-Type Instruction Detailed Analysis

11.4.1 Instruction Format

Encoding Structure (32 bits):

Example: ADD $1, $2, $3

Encoding: 000000 00010 00011 00001 00000 100000
         |Opcode| RS | RT | RD |SHAMT| Funct |
         |   0  |  2 |  3 |  1 |  0  |  32   |

Operation: $1 = $2 + $3

11.4.2 Datapath Elements Used

R-Type Instruction Datapath

Active Elements (shown in black):

Inactive Elements (grayed out):

11.4.3 Control Signal Values for R-Type

Exercise Example: ADD $1, $2, $3

Signal Value Reason
RegDst 1 Write to RD (bits 11-15), not RT
Branch 0 Not a branch instruction
MemRead 0 Not reading from memory
MemtoReg 0 Write ALU result (not memory data)
ALUOp 10 R-type: Consult funct field
MemWrite 0 Not writing to memory
ALUSrc 0 Second operand from register RT (not immediate)
RegWrite 1 Write result to destination register

Detailed Explanations:

RegDst = 1:

Branch = 0:

MemRead = 0, MemWrite = 0:

MemtoReg = 0:

ALUOp = 10 (binary):

ALUSrc = 0:

RegWrite = 1:

11.4.4 Execution Steps for R-Type

Step 1: Instruction Fetch

Step 2: Control Signal Generation

Step 3: Register Read

Step 4: ALU Operation

Step 5: Register Write Preparation

Step 6: Clock Edge Actions

11.5 Branch If Equal Instruction Detailed Analysis

11.5.1 Instruction Format

Encoding Structure (32 bits):

Example: BEQ $1, $2, 100

Encoding: 000100 00001 00010 0000000001100100
         |Opcode|  RS  |  RT  |    Immediate    |
         |   4  |   1  |   2  |       100       |

Operation: If ($1 == $2) then PC = PC + 4 + (100 × 4)

11.5.2 Datapath Elements Used

Branch If Equal Instruction Datapath

Active Elements:

Inactive Elements:

11.5.3 Control Signal Values for BEQ

Exercise Example: BEQ $1, $2, 100

Signal Value Reason
RegDst X Don't care (not writing to register)
Branch 1 This IS a branch instruction
MemRead 0 Not reading from memory
MemtoReg X Don't care (not writing to register)
ALUOp 01 Perform SUBTRACT for comparison
MemWrite 0 Not writing to memory
ALUSrc 0 Compare two register values (not immediate)
RegWrite 0 Not writing to register file

Detailed Explanations:

RegDst = X (Don't Care):

Branch = 1:

MemRead = 0, MemWrite = 0:

MemtoReg = X (Don't Care):

ALUOp = 01:

ALUSrc = 0:

RegWrite = 0:

11.5.4 Branch Target Calculation

Word Offset to Byte Offset:

Branch Target Address:

Example:

PCSrc Selection:

PCSrc = Branch AND Zero
      = 1 AND (RS == RT ? 1 : 0)

If PCSrc = 1: PC ← Branch Target (1404)
If PCSrc = 0: PC ← PC + 4 (1004)

11.6 Load Word Instruction Detailed Analysis

11.6.1 Instruction Format

Encoding Structure (32 bits):

Example: LW $8, 32($9)

Encoding: 100011 01001 01000 0000000000100000
         |Opcode|  RS  |  RT  |    Immediate    |
         |  35  |   9  |   8  |       32        |

Operation: $8 = Memory[$9 + 32]

11.6.2 Datapath Elements Used

Load Word Instruction Datapath

Active Elements:

Inactive Elements:

11.6.3 Control Signal Values for LW

Exercise Example: LW $8, 32($9)

Signal Value Reason
RegDst 0 Write to RT (bits 16-20), not RD
Branch 0 Not a branch instruction
MemRead 1 Reading from data memory
MemtoReg 1 Write memory data (not ALU result)
ALUOp 00 Perform ADD for address calculation
MemWrite 0 Not writing to memory (reading only)
ALUSrc 1 Add immediate offset (not register)
RegWrite 1 Write loaded data to destination register

Detailed Explanations:

RegDst = 0:

Branch = 0:

MemRead = 1:

MemtoReg = 1:

ALUOp = 00:

MemWrite = 0:

ALUSrc = 1:

RegWrite = 1:

11.6.4 Critical Path for Load Word

Longest Delay in Single-Cycle:

  1. Instruction Memory read
  2. Register File read (base address)
  3. Sign Extension
  4. ALU address calculation
  5. Data Memory read
  6. Register write setup

Load Word is the slowest instruction!

11.7 Store Word Instruction Detailed Analysis

11.7.1 Instruction Format

Encoding Structure (32 bits):

Example: SW $8, 32($9)

Encoding: 101011 01001 01000 0000000000100000
         |Opcode|  RS  |  RT  |    Immediate    |
         |  43  |   9  |   8  |       32        |

Operation: Memory[$9 + 32] = $8

Note: Fixed error in lecture (was "$32", should be "32")

11.7.2 Datapath Elements Used

Active Elements:

Inactive Elements:

Key Difference from Load:

11.7.3 Control Signal Values for SW

Exercise Example: SW $8, 32($9)

Signal Value Reason
RegDst X Don't care (not writing to register)
Branch 0 Not a branch instruction
MemRead 0 Not reading from memory (writing)
MemtoReg X Don't care (not writing to register)
ALUOp 00 Perform ADD for address calculation
MemWrite 1 Writing to data memory
ALUSrc 1 Add immediate offset
RegWrite 0 Not writing to register file

Detailed Explanations:

RegDst = X (Don't Care):

CRITICAL: RegWrite = 0:

Why It Matters:

MemRead = 0, MemWrite = 1:

MemtoReg = X (Don't Care):

ALUOp = 00:

ALUSrc = 1:

11.7.4 Important Lesson: Don't Care vs Zero

Student Confusion:

"RegDst = 0 is not wrong, but best answer is X"

Clarification:

However:

11.8 Jump Instruction Integration

11.8.1 Instruction Format

Encoding Structure (32 bits):

Alternative: JAL (Jump and Link)

Example: J 100

Encoding: 000010 00000000000000000001100100
         |Opcode|     Target Address       |
         |   2  |          100             |

Operation: PC = {PC+4[31:28], Address, 2'b00}

11.8.2 Jump Target Address Calculation

Word Address to Byte Address:

Upper 4 Bits:

Concatenation:

PC+4:        [31:28]  [27:2]  [1:0]
                ↓                (ignored)
Jump Target: [31:28] [Target×4] [00]
                ↑         ↑        ↑
              From     From    Append
              PC+4  instruction zeros

Example:

Limitation:

11.8.3 Additional Datapath Hardware

Jump Instruction Datapath with Additional Hardware

New Components:

Shift Left 2 (for jump):

Concatenation Logic:

New Multiplexer:

Original PC Source Mux:

New Jump Mux (outer):

11.8.4 Jump Control Signal

Jump Signal:

Values:

Other Control Signals for Jump:

Signal Value Reason
RegDst X Don't care
Branch 0 Not a branch (different mechanism)
MemRead 0 Not accessing memory
MemtoReg X Don't care
ALUOp XX Don't care (ALU not used)
MemWrite 0 Not writing memory
ALUSrc X Don't care
RegWrite 0 Not writing register (J instruction)
Jump 1 This IS a jump instruction

Note: JAL (Jump and Link) different:

11.8.5 Complete Datapath with Jump

All Instruction Types Supported:

Coverage:

Datapath Completeness:

11.9 Timing Analysis with Concrete Delays

11.9.1 Assumed Component Delays

Delay Values (in nanoseconds):

Component Delay Notes
Instruction Memory 2 ns Read instruction at PC address
Register File (Read) 1 ns Output data after address change
Register File (Write) 1 ns At clock edge (next cycle)
Sign Extender ~0 ns Negligible (wire replication)
Multiplexers ~0 ns Negligible compared to other delays
ALU Operation 2 ns Arithmetic/logic/comparison
Data Memory (Read) 2 ns Output data after address provided
Data Memory (Write) 2 ns At clock edge (next cycle)
PC+4 Adder 2 ns Simple addition
Branch Target Adder 2 ns Addition with offset

Assumptions:

11.9.2 Critical Path Analysis

Definition:

Single-Cycle Constraint:

11.9.3 Load Word Instruction Timing

Step-by-Step Delay Calculation:

Step 1: Instruction Fetch (2 ns)

Step 2: Register Read (1 ns)

Step 3: Sign Extension (~0 ns)

Step 4: ALU Address Calculation (2 ns)

Step 5: Memory Read (2 ns)

Step 6: Register Write Setup (~0 ns)

Clock Edge: Register Write (next cycle)

Minimum Clock Period: 7 nanoseconds

Maximum Clock Frequency: 1/7 ns ≈ 143 MHz

Load Word is Critical Path!

11.9.4 Store Word Instruction Timing

Step-by-Step Delay:

  1. Instruction Fetch: 2 ns (total: 2 ns)
  2. Register Read: 1 ns (total: 3 ns)
    • Read RS (base) AND RT (data)
  3. Sign Extension: ~0 ns (total: 3 ns)
  4. ALU Address Calculation: 2 ns (total: 5 ns)
  5. Memory Write Setup: ~0 ns (total: 5 ns)
    • Address and data ready at memory inputs

Clock Edge: Memory Write (end of cycle)

Minimum Time Required: 5 nanoseconds

Note:

11.9.5 Arithmetic Instruction Timing (ADD, SUB, AND, OR)

Step-by-Step Delay:

  1. Instruction Fetch: 2 ns (total: 2 ns)
  2. Register Read: 1 ns (total: 3 ns)
    • Read RS and RT
  3. ALU Operation: 2 ns (total: 5 ns)
    • Perform arithmetic/logic operation
  4. Register Write Setup: ~0 ns (total: 5 ns)
    • ALU result ready at register write data input

Clock Edge: Register Write

Minimum Time Required: 5 nanoseconds

Efficiency Loss:

11.9.6 Branch Instruction Timing

Step-by-Step Delay:

  1. Instruction Fetch: 2 ns (total: 2 ns)
  2. Register Read: 1 ns (total: 3 ns)
    • Read RS and RT for comparison
  3. ALU Comparison: 2 ns (total: 5 ns)
    • Subtract RS - RT
    • Generate Zero flag
  4. Branch Target Calculation: 2 ns (parallel with ALU)
    • Sign extend offset: ~0 ns
    • Shift left 2: ~0 ns (wire routing)
    • Add to PC+4: 2 ns
    • Can happen in parallel with ALU operation!
  5. PC Update Setup: ~0 ns (total: 5 ns)
    • Zero flag + Branch → PCSrc
    • Multiplexer selects next PC
    • Ready for clock edge

Minimum Time Required: 5 nanoseconds

Key Insight:

11.9.7 Jump Instruction Timing

Step-by-Step Delay:

  1. Instruction Fetch: 2 ns (total: 2 ns)
    • Also calculates PC+4 in parallel
  2. Jump Target Calculation: ~0 ns
    • Extract 26-bit target
    • Shift left 2: Wire routing, ~0 ns
    • Concatenate with PC+4[31:28]: Wire connection, ~0 ns
    • No ALU, no memory, no registers!
  3. PC Update Setup: ~0 ns (total: 2 ns)

Minimum Time Required: 2 nanoseconds

Fastest Instruction:

11.9.8 Timing Summary Table

Instruction Type Time Required Wasted Time Efficiency
Load Word (LW) 7 ns 0 ns 100%
Store Word (SW) 5 ns 2 ns 71.4%
R-type (ADD, etc.) 5 ns 2 ns 71.4%
Branch (BEQ) 5 ns 2 ns 71.4%
Jump (J) 2 ns 5 ns 28.6%

Clock Period (Single-Cycle): 7 ns (determined by LW)

Clock Frequency: ~143 MHz

Performance Impact:

11.10 Performance Analysis

11.10.1 Program Composition Example

Typical MIPS Program Profile:

Instruction Type Percentage Time if Variable Time (Fixed 7ns)
Arithmetic 48% 5 ns 7 ns
Load Word 22% 7 ns 7 ns
Store Word 11% 5 ns 7 ns
Branch 19% 5 ns 7 ns

11.10.2 Average Time Calculation

Variable Time (Ideal):

Average = (0.48 × 5) + (0.22 × 7) + (0.11 × 5) + (0.19 × 5)
        = 2.40 + 1.54 + 0.55 + 0.95
        = 5.44 ns per instruction

Single-Cycle (Actual):

Average = 7 ns per instruction (all instructions)

Performance Loss:

Overhead = 7 - 5.44 = 1.56 ns per instruction
Efficiency = 5.44 / 7 = 77.7%
Waste = 22.3% of time

11.10.3 Critical Path Problem

Critical Path Determination:

  1. Instruction Memory
  2. Register File
  3. ALU
  4. Data Memory
  5. (Register Write in next cycle)

Design Principle Violation:

11.10.4 Clock Period Inflexibility

Single-Cycle Constraint:

Implications:

Efficiency by Instruction:

Instruction Efficiency Waste
Jump 28.6% 71.4%
Arithmetic 71.4% 28.6%
Store 71.4% 28.6%
Branch 71.4% 28.6%
Load 100.0% 0%

11.11 Path to Better Performance: Multi-Cycle Design

11.11.1 Multi-Cycle Concept

Basic Idea:

Advantages:

11.11.2 Stage Division

Typical Stages:

Stage 1: Instruction Fetch (IF)

Stage 2: Instruction Decode (ID)

Stage 3: Execute (EX)

Stage 4: Memory Access (MEM)

Stage 5: Write-Back (WB)

Not All Instructions Use All Stages:

11.11.3 Clock Period in Multi-Cycle

Determining Clock Period:

Example Stage Delays:

Stage Delay
IF (Instr Memory) 2 ns
ID (Register Read) 1 ns
EX (ALU) 2 ns
MEM (Data Memory) 2 ns
WB (Register Write) 1 ns

Longest Stage: 2 ns

Clock Period: 2 ns (vs 7 ns single-cycle)

Clock Frequency: 500 MHz (vs 143 MHz single-cycle)

11.11.4 Performance Comparison

Single-Cycle:

All instructions: 1 cycle × 7 ns = 7 ns

Multi-Cycle (with 2 ns clock):

Instruction Cycles Time
Arithmetic 4 8 ns
Load 5 10 ns
Store 4 8 ns
Branch 3 6 ns
Jump 2 4 ns

Weighted Average (same program profile):

Average = (0.48 × 8) + (0.22 × 10) + (0.11 × 8) + (0.19 × 6)
        = 3.84 + 2.20 + 0.88 + 1.14
        = 8.06 ns per instruction

Wait, That's Worse!

Resolution:

Ideal Multi-Cycle (balanced 1.4 ns stages):

Instruction Cycles Time
Arithmetic 4 5.6 ns
Load 5 7.0 ns
Store 4 5.6 ns
Branch 3 4.2 ns
Average = (0.48 × 5.6) + (0.22 × 7.0) + (0.11 × 5.6) + (0.19 × 4.2)
        = 2.69 + 1.54 + 0.62 + 0.80
        = 5.65 ns per instruction

Speedup = 7 / 5.65 = 1.24× faster

11.11.5 Design Challenge

Stage Balancing:

Resource Reuse:

11.12 Preview: Pipelining

11.12.1 Next Step Beyond Multi-Cycle

Pipelining Concept:

Benefits:

Challenges (Covered Next Lecture):

11.12.2 Coming Next

Topics:

Key Takeaways

  1. Single-cycle design executes each instruction in one clock cycle, with clock period determined by the slowest instruction (Load Word at 7 ns).
  2. Control unit generates signals based on opcode, orchestrating datapath operations for R-type, Load, Store, Branch, and Jump instructions.
  3. Load Word is the critical path (Instruction Fetch → Register Read → ALU → Memory Read → Register Write), determining minimum clock period.
  4. Jump instruction uses PC[31:28] concatenated with shifted immediate to form 32-bit target address, enabling 256 MB jump range.
  5. Control signals must prevent data corruption, with RegWrite=0 for Store and Branch to avoid unintended register modifications.
  6. "Don't care" values (X) simplify control logic, allowing optimization when signals don't affect instruction outcome.
  7. Hardware operates concurrently, not sequentially—multiple operations happen simultaneously within each clock cycle.
  8. Performance inefficiency drives design evolution, as most instructions finish early but must wait for full clock period.
  9. Resource utilization varies dramatically, with arithmetic instructions using ~43% of clock period while Load uses 100%.
  10. Timing analysis reveals optimization opportunities, showing that memory access dominates critical path (4 ns of 7 ns total).
  11. Write operations occur at clock edge, ensuring data stability and preventing race conditions in sequential logic.
  12. Branch target calculation happens in parallel with ALU comparison, optimizing branch instruction timing.
  13. Sign extension is effectively instantaneous (combinational logic), adding negligible delay to critical path.
  14. Clock period sets maximum frequency (~143 MHz for 7 ns period), directly impacting overall processor performance.
  15. Common case (arithmetic) runs slowly, violating fundamental design principle of making common case fast.
  16. Stage division concept emerges from timing analysis, suggesting multi-cycle implementation could improve efficiency.
  17. Control signal truth tables systematically define behavior, mapping each instruction to specific control patterns.
  18. PC update mechanisms vary by instruction type, using PC+4, branch target, or jump target based on control signals.
  19. Data memory access only for Load/Store, with MemRead and MemWrite controlling when memory participates in execution.
  20. Performance analysis quantifies inefficiency, providing concrete motivation for pipelined processor designs in subsequent lectures.

Summary

The single-cycle MIPS processor represents a complete, functioning implementation where each instruction executes in exactly one clock cycle. While conceptually straightforward and easy to understand, the design reveals fundamental performance limitations that drive modern processor architecture evolution. The critical path analysis shows Load Word requiring 7 nanoseconds while simpler instructions like arithmetic operations complete in just 3 nanoseconds, forcing all instructions to wait for the slowest operation. This inefficiency—with most instructions utilizing less than half the available clock period—violates the crucial design principle of "making the common case fast." The systematic control signal analysis demonstrates how the control unit orchestrates datapath operations for different instruction types (R-type, Load, Store, Branch, Jump), with careful attention to preventing data corruption through proper RegWrite and MemWrite signals. The jump instruction introduces pseudo-direct addressing, concatenating PC upper bits with shifted immediate for 256 MB addressability. While the single-cycle design provides essential conceptual foundation for understanding processor operation, the detailed timing analysis and resource utilization metrics clearly motivate the need for more sophisticated approaches—multi-cycle processors that divide execution into variable-length stages, and pipelined processors that overlap instruction execution for dramatically improved throughput. These performance limitations aren't flaws but rather inevitable consequences of the single-cycle constraint, establishing why modern processors universally adopt pipelining despite the additional complexity it introduces.