Back to All Lectures

Lecture 3: Understanding Performance

Lectures on Computer Architecture

Click the thumbnail above to watch the video lecture on YouTube

By Dr. Isuru Nawinne

3.1 Introduction

Understanding computer performance is fundamental to computer architecture and system design. This lecture explores how performance is measured, the factors that influence it, and the principles that guide performance optimization. We examine the metrics used to evaluate systems, the mathematical relationships between performance factors, and Amdahl's Law—a critical principle for understanding the limits of performance improvements.

3.2 Defining and Measuring Performance

3.2.1 Response Time vs. Throughput

Response Time (Execution Time)

Throughput (Bandwidth)

Relationship Between Metrics

3.2.2 Performance Definition

Mathematical Definition

$$ \text{Performance} = \frac{1}{\text{Execution Time}} $$

Performance Comparison

Relative Performance

$$ \frac{\text{Performance}_A}{\text{Performance}_B} = \frac{\text{Execution Time}_B}{\text{Execution Time}_A} $$

Example: If System A is 2× faster than System B:

3.3 CPU Time and Performance Factors

3.3.1 Components of Execution Time

Total Execution Time

CPU Time Focus

3.3.2 The CPU Time Equation

Basic Formula

$$ \text{CPU Time} = \text{Clock Cycles} \times \text{Clock Period} $$

Or equivalently:

$$ \text{CPU Time} = \frac{\text{Clock Cycles}}{\text{Clock Rate}} $$

Key Relationships

Example Calculation

Program requires 10 billion cycles
Processor runs at 4 GHz (4 × 109 Hz)

$$ \begin{align*} \text{CPU Time} &= \frac{10 \times 10^9 \text{ cycles}}{4 \times 10^9 \text{ cycles/sec}} \\ &= 2.5 \text{ seconds} \end{align*} $$

3.3.3 Instruction Count and CPI

Cycles Per Instruction (CPI)

Extended CPU Time Equation

$$ \text{CPU Time} = \text{Instruction Count} \times \text{CPI} \times \text{Clock Period} $$

Or:

$$ \text{CPU Time} = \frac{\text{Instruction Count} \times \text{CPI}}{\text{Clock Rate}} $$

Three Performance Factors

  1. Instruction Count: Number of instructions executed
  2. CPI: Average cycles per instruction
  3. Clock Rate: Speed of the processor clock

Factor Dependencies

3.4 Understanding CPI in Detail

3.4.1 CPI Variability

Different Instructions, Different CPIs

Calculating Average CPI

$$ \text{Average CPI} = \frac{\sum (\text{CPI}_i \times \text{Instruction Count}_i)}{\text{Total Instruction Count}} $$

Where:

3.4.2 CPI Example Calculation

Given:

Calculation:

$$ \begin{align*} \text{Total Cycles} &= (50{,}000 \times 1) + (30{,}000 \times 3) + (20{,}000 \times 2) \\ &= 50{,}000 + 90{,}000 + 40{,}000 \\ &= 180{,}000 \text{ cycles} \end{align*} $$
$$ \text{Average CPI} = \frac{180{,}000}{100{,}000} = 1.8 $$

3.4.3 Instruction Classes

Common Instruction Categories

  1. Integer arithmetic: ADD, SUB, AND, OR
  2. Data transfer: LOAD, STORE
  3. Control flow: BRANCH, JUMP, CALL
  4. Floating-point: FADD, FMUL, FDIV

CPI Characteristics by Class

3.5 Performance Optimization Principles

3.5.1 Make the Common Case Fast

Core Principle

Examples

Application in Design

3.5.2 Amdahl's Law

The Fundamental Principle

The speedup that can be achieved by improving a particular part of a system is limited by the fraction of time that part is used.

Mathematical Formula

$$ \text{Speedup}_{\text{overall}} = \frac{1}{(1 - P) + \frac{P}{S}} $$

Where:

Alternative Formulation

$$ \text{Execution Time}_{\text{new}} = \text{Execution Time}_{\text{old}} \times \left[(1 - P) + \frac{P}{S}\right] $$

3.5.3 Amdahl's Law Examples

Example 1: Multiply Operation Speedup

Given:

Calculation:

P = 0.80 (80% can be improved)

S = 10 (10× speedup)

Speedup_overall = 1 / [(1 - 0.80) + (0.80 / 10)]

= 1 / [0.20 + 0.08]

= 1 / 0.28

= 3.57×

Key Insight: Despite 10× improvement in multiplies, overall speedup is only 3.57× because 20% of time is unaffected.

Example 2: Limited Improvement Fraction

Given:

Calculation:

P = 0.30

S = 100

Speedup_overall = 1 / [(1 - 0.30) + (0.30 / 100)]

= 1 / [0.70 + 0.003]

= 1 / 0.703

= 1.42×

Key Insight: Even with 100× improvement, overall speedup is only 1.42× because only 30% of execution benefits.

3.5.4 Implications of Amdahl's Law

Limitations of Parallelization

Optimization Strategy

Example: Multicore Scaling

If 90% of program parallelizes perfectly:

Cores Speedup
2 cores 1.82×
4 cores 3.08×
8 cores 4.71×
16 cores 6.40×
∞ cores 10.00× (maximum possible)

The 10% serial portion ultimately limits speedup to 10×.

3.6 Complete Performance Analysis

3.6.1 The Complete Performance Equation

Bringing It All Together

$$ \text{CPU Time} = \text{Instruction Count} \times \text{CPI} \times \text{Clock Period} $$

Expanded:

$$ \text{CPU Time} = (\text{Instructions}) \times \left(\frac{\text{Cycles}}{\text{Instruction}}\right) \times \left(\frac{\text{Seconds}}{\text{Cycle}}\right) $$

What Affects Each Factor

Instruction Count:

CPI:

Clock Period (or Clock Rate):

3.6.2 Performance Comparison Example

Scenario:

Compare two implementations of the same ISA

System A:

CPU Time_A = (1 × 10^6 instructions) × (2.0 cycles/instruction) / (2 × 10^9 cycles/sec)

= 2 × 10^6 cycles / (2 × 10^9 cycles/sec)

= 0.001 seconds = 1 millisecond

System B:

CPU Time_B = (1 × 10^6 instructions) × (3.0 cycles/instruction) / (3 × 10^9 cycles/sec)

= 3 × 10^6 cycles / (3 × 10^9 cycles/sec)

= 0.001 seconds = 1 millisecond

Result: Both systems have identical performance despite different clock rates and CPIs.

3.6.3 Trade-offs in Design

Clock Rate vs. CPI Trade-off

Instruction Count vs. CPI Trade-off

Power vs. Performance

3.7 Practical Performance Considerations

3.7.1 Benchmarking

Purpose of Benchmarks

Types of Benchmarks

Benchmark Pitfalls

3.7.2 Performance Metrics in Practice

MIPS (Million Instructions Per Second)

$$ \text{MIPS} = \frac{\text{Instruction Count}}{\text{Execution Time} \times 10^6} = \frac{\text{Clock Rate}}{\text{CPI} \times 10^6} $$

Limitations of MIPS:

Better Metrics:

3.7.3 Power and Energy Considerations

Power Wall

Dynamic Power Equation

$$ \text{Power} = \text{Capacitance} \times \text{Voltage}^2 \times \text{Frequency} $$

Energy Equation

$$ \text{Energy} = \text{Power} \times \text{Time} $$

Implications:

Key Takeaways

  1. Performance is the inverse of execution time - faster systems have shorter execution times and higher performance values.
  2. Three key factors determine CPU performance:
    • Instruction Count (algorithm, compiler, ISA)
    • CPI (microarchitecture, instruction mix)
    • Clock Rate (technology, organization)
  3. Amdahl's Law limits speedup - the potential speedup from improving any part of a system is limited by how much time that part is used.
  4. "Make the common case fast" - optimize frequently executed operations for maximum impact on overall performance.
  5. CPI varies by instruction type - average CPI depends on the mix of instructions and their individual costs.
  6. Trade-offs are fundamental - improvements in one area (e.g., clock rate) may harm another (e.g., CPI or power consumption).
  7. Benchmarking is essential - real workloads provide the most meaningful performance measurements.
  8. Power is a critical constraint - modern performance optimization must consider power and energy efficiency, not just speed.
  9. Multiple factors must be optimized together - focusing on only one aspect (like clock rate) can be counterproductive.
  10. Understanding performance equations enables rational design decisions and accurate performance predictions.

Summary

Performance analysis is central to computer architecture, providing the foundation for making informed design decisions. By understanding the relationship between instruction count, CPI, and clock rate, architects can identify optimization opportunities and predict the impact of changes. Amdahl's Law reminds us that the benefit of any improvement is constrained by what fraction of execution time it affects, emphasizing the importance of focusing on the common case. As we design systems, we must balance competing factors—clock rate, CPI, power consumption, and cost—to achieve the best overall performance for target applications. The principles covered in this lecture provide the analytical framework for evaluating processor designs and optimization strategies throughout the study of computer architecture.