3.1 Introduction
Understanding computer performance is fundamental to computer architecture and system design. This lecture explores how performance is measured, the factors that influence it, and the principles that guide performance optimization. We examine the metrics used to evaluate systems, the mathematical relationships between performance factors, and Amdahl's Law—a critical principle for understanding the limits of performance improvements.
3.2 Defining and Measuring Performance
3.2.1 Response Time vs. Throughput
Response Time (Execution Time)
- Time to complete a single task
- Includes all overhead and waiting time
- User-perceived performance metric
- Example: Time for a program to run from start to finish
Throughput (Bandwidth)
- Number of tasks completed per unit time
- Measures system capacity
- Important for servers and data centers
- Example: Number of transactions processed per second
Relationship Between Metrics
- Improving response time often improves throughput
- Improving throughput doesn't always improve response time
- Different optimization strategies for each metric
- System design must balance both considerations
3.2.2 Performance Definition
Mathematical Definition
Performance Comparison
- If System A is faster than System B:
- Execution TimeA < Execution TimeB
- PerformanceA > PerformanceB
Relative Performance
Example: If System A is 2× faster than System B:
- PerformanceA / PerformanceB = 2
- Execution TimeB / Execution TimeA = 2
- System A takes half the time of System B
3.3 CPU Time and Performance Factors
3.3.1 Components of Execution Time
Total Execution Time
- CPU time: Time CPU spends computing the task
- I/O time: Time waiting for input/output operations
- Other system activities: OS overhead, other programs
CPU Time Focus
- Primary metric for processor performance
- Excludes I/O and system effects
- Directly reflects processor and memory system performance
- Most relevant for comparing processor architectures
3.3.2 The CPU Time Equation
Basic Formula
Or equivalently:
Key Relationships
- Clock Period = 1 / Clock Rate
- Clock Rate measured in Hz (cycles/second)
- Clock Cycles = total cycles to execute program
- Higher clock rate → shorter clock period → faster execution
Example Calculation
Program requires 10 billion cycles
Processor runs at 4 GHz (4 × 109 Hz)
3.3.3 Instruction Count and CPI
Cycles Per Instruction (CPI)
- Average number of clock cycles per instruction
- Varies by instruction type and implementation
- Key microarchitecture metric
Extended CPU Time Equation
Or:
Three Performance Factors
- Instruction Count: Number of instructions executed
- CPI: Average cycles per instruction
- Clock Rate: Speed of the processor clock
Factor Dependencies
- Instruction Count: Determined by algorithm, compiler, ISA
- CPI: Determined by processor implementation (microarchitecture)
- Clock Rate: Determined by hardware technology and organization
3.4 Understanding CPI in Detail
3.4.1 CPI Variability
Different Instructions, Different CPIs
- Simple operations: May complete in 1 cycle (ADD, AND)
- Memory operations: May take multiple cycles (LOAD, STORE)
- Branch instructions: Variable cycles (depends on prediction)
- Multiply/Divide: Often take many cycles
Calculating Average CPI
Where:
- CPIi = cycles per instruction for instruction type i
- Instruction Counti = number of times instruction i executed
3.4.2 CPI Example Calculation
Given:
- Program executes 100,000 instructions
- 50,000 ALU operations (CPI = 1)
- 30,000 load instructions (CPI = 3)
- 20,000 branch instructions (CPI = 2)
Calculation:
3.4.3 Instruction Classes
Common Instruction Categories
- Integer arithmetic: ADD, SUB, AND, OR
- Data transfer: LOAD, STORE
- Control flow: BRANCH, JUMP, CALL
- Floating-point: FADD, FMUL, FDIV
CPI Characteristics by Class
- Integer arithmetic: Usually 1 cycle
- Data transfer: 1-3 cycles (cache hit) or more (cache miss)
- Control flow: 1-2 cycles (correct prediction) or more (misprediction)
- Floating-point: 2-20+ cycles depending on operation
3.5 Performance Optimization Principles
3.5.1 Make the Common Case Fast
Core Principle
- Optimize frequent operations rather than rare ones
- Greater impact on overall performance
- Focus resources where they matter most
Examples
- Optimize ALU operations (common) over division (rare)
- Fast cache for recent data (commonly accessed)
- Branch prediction for likely paths
- Simple instructions execute quickly
Application in Design
- Identify common operations through profiling
- Allocate hardware resources accordingly
- Accept slower performance for rare cases
- Trade-offs guided by usage patterns
3.5.2 Amdahl's Law
The Fundamental Principle
The speedup that can be achieved by improving a particular part of a system is limited by the fraction of time that part is used.
Mathematical Formula
Where:
- P = Proportion of execution time that can be improved
- S = Speedup of the improved portion
- (1 - P) = Proportion that cannot be improved
Alternative Formulation
3.5.3 Amdahl's Law Examples
Example 1: Multiply Operation Speedup
Given:
- Multiply operations take 80% of execution time
- New hardware makes multiplies 10× faster
Calculation:
P = 0.80 (80% can be improved)
S = 10 (10× speedup)
Speedup_overall = 1 / [(1 - 0.80) + (0.80 / 10)]
= 1 / [0.20 + 0.08]
= 1 / 0.28
= 3.57×
Key Insight: Despite 10× improvement in multiplies, overall speedup is only 3.57× because 20% of time is unaffected.Example 2: Limited Improvement Fraction
Given:
- Only 30% of execution can be improved
- Improvement is 100× faster
Calculation:
P = 0.30
S = 100
Speedup_overall = 1 / [(1 - 0.30) + (0.30 / 100)]
= 1 / [0.70 + 0.003]
= 1 / 0.703
= 1.42×
Key Insight: Even with 100× improvement, overall speedup is only 1.42× because only 30% of execution benefits.3.5.4 Implications of Amdahl's Law
Limitations of Parallelization
- Serial portions limit parallel speedup
- As parallelism increases, serial portion dominates
- Cannot achieve infinite speedup regardless of cores
Optimization Strategy
- Focus on largest contributors to execution time
- Consider what fraction can realistically be improved
- Multiple small improvements may beat one large improvement
- Balance improvements across components
Example: Multicore Scaling
If 90% of program parallelizes perfectly:
| Cores | Speedup |
|---|---|
| 2 cores | 1.82× |
| 4 cores | 3.08× |
| 8 cores | 4.71× |
| 16 cores | 6.40× |
| ∞ cores | 10.00× (maximum possible) |
The 10% serial portion ultimately limits speedup to 10×.
3.6 Complete Performance Analysis
3.6.1 The Complete Performance Equation
Bringing It All Together
Expanded:
What Affects Each Factor
Instruction Count:
- Algorithm: Efficient algorithms execute fewer instructions
- Programming language: High-level vs low-level
- Compiler: Optimization quality
- ISA: Instruction complexity and capabilities
CPI:
- ISA: Instruction complexity
- Microarchitecture: Pipeline depth, branch prediction
- Cache performance: Hit rates affect memory access CPI
- Instruction mix: Distribution of instruction types
Clock Period (or Clock Rate):
- Technology: Transistor speed (nm process)
- Organization: Pipeline depth, critical path length
- Power constraints: Higher frequency requires more power
- Cooling limitations: Heat dissipation capacity
3.6.2 Performance Comparison Example
Scenario:
Compare two implementations of the same ISA
- System A: Clock Rate = 2 GHz, CPI = 2.0
- System B: Clock Rate = 3 GHz, CPI = 3.0
- Same program with 1 million instructions
System A:
CPU Time_A = (1 × 10^6 instructions) × (2.0 cycles/instruction) / (2 × 10^9 cycles/sec)
= 2 × 10^6 cycles / (2 × 10^9 cycles/sec)
= 0.001 seconds = 1 millisecond
System B:
CPU Time_B = (1 × 10^6 instructions) × (3.0 cycles/instruction) / (3 × 10^9 cycles/sec)
= 3 × 10^6 cycles / (3 × 10^9 cycles/sec)
= 0.001 seconds = 1 millisecond
Result: Both systems have identical performance despite different clock rates and CPIs.3.6.3 Trade-offs in Design
Clock Rate vs. CPI Trade-off
- Higher clock rate may require deeper pipeline
- Deeper pipeline often increases CPI (more stalls)
- Must balance frequency gains against CPI losses
Instruction Count vs. CPI Trade-off
- Complex instructions reduce instruction count
- But complex instructions may increase CPI
- CISC vs RISC architecture debate
Power vs. Performance
- Higher clock rate increases power consumption
- Power = Capacitance × Voltage² × Frequency
- Mobile systems prioritize power over peak performance
3.7 Practical Performance Considerations
3.7.1 Benchmarking
Purpose of Benchmarks
- Measure real-world performance
- Compare different systems objectively
- Standard workloads for reproducibility
Types of Benchmarks
- Synthetic: Artificial programs (e.g., Dhrystone, Whetstone)
- Application: Real programs (e.g., SPEC CPU, databases)
- Workload: Representative task mixes
Benchmark Pitfalls
- May not represent your workload
- Can be optimized for unfairly
- Need multiple benchmarks for complete picture
3.7.2 Performance Metrics in Practice
MIPS (Million Instructions Per Second)
Limitations of MIPS:
- Doesn't account for instruction complexity
- Different ISAs have different instruction capabilities
- Higher MIPS doesn't guarantee better performance
- "Meaningless Indication of Processor Speed"
Better Metrics:
- Execution time for specific workloads
- Throughput for server applications
- Energy efficiency (performance per watt)
- Performance per dollar
3.7.3 Power and Energy Considerations
Power Wall
- Cannot increase clock rate indefinitely
- Power consumption limits frequency scaling
- Led to multi-core era
Dynamic Power Equation
Energy Equation
Implications:
- Lowering voltage reduces power dramatically (squared effect)
- Higher frequency increases power linearly
- Faster execution may save energy overall (less time)
- Energy efficiency increasingly important metric
Key Takeaways
- Performance is the inverse of execution time - faster systems have shorter execution times and higher performance values.
- Three key factors determine CPU performance:
- Instruction Count (algorithm, compiler, ISA)
- CPI (microarchitecture, instruction mix)
- Clock Rate (technology, organization)
- Amdahl's Law limits speedup - the potential speedup from improving any part of a system is limited by how much time that part is used.
- "Make the common case fast" - optimize frequently executed operations for maximum impact on overall performance.
- CPI varies by instruction type - average CPI depends on the mix of instructions and their individual costs.
- Trade-offs are fundamental - improvements in one area (e.g., clock rate) may harm another (e.g., CPI or power consumption).
- Benchmarking is essential - real workloads provide the most meaningful performance measurements.
- Power is a critical constraint - modern performance optimization must consider power and energy efficiency, not just speed.
- Multiple factors must be optimized together - focusing on only one aspect (like clock rate) can be counterproductive.
- Understanding performance equations enables rational design decisions and accurate performance predictions.
Summary
Performance analysis is central to computer architecture, providing the foundation for making informed design decisions. By understanding the relationship between instruction count, CPI, and clock rate, architects can identify optimization opportunities and predict the impact of changes. Amdahl's Law reminds us that the benefit of any improvement is constrained by what fraction of execution time it affects, emphasizing the importance of focusing on the common case. As we design systems, we must balance competing factors—clock rate, CPI, power consumption, and cost—to achieve the best overall performance for target applications. The principles covered in this lecture provide the analytical framework for evaluating processor designs and optimization strategies throughout the study of computer architecture.