On-Chip Offline Neuromorphic Computing

Team

Supervisors

Table of Contents

  1. Abstract
  2. Related Works
  3. Methodology
  4. Experiment Setup and Implementation
  5. Results and Analysis
  6. Conclusion
  7. Publications
  8. Links

Abstract

Neuromorphic computing offers a brain-inspired paradigm for energy-efficient machine intelligence, using sparse, event-driven spike signals instead of dense floating-point activations. However, most existing neuromorphic systems rely on off-chip training or static, pre-trained weight deployment. This project presents a fully on-chip, offline-capable Spiking Neural Network (SNN) training system implemented on an FPGA SoC.

The system integrates a custom hardware inference accelerator based on Leaky Integrate-and-Fire (LIF) neurons, an on-chip surrogate gradient lookup table, and a RISC-V processor extended with six custom backpropagation instructions. Together, these components execute the complete learning loop — forward inference, surrogate gradient computation, and weight updates via backpropagation — entirely in hardware without any host-side involvement.

The system is validated on MNIST digit classification using a 784 → 200 → 10 SNN architecture with Q16.16 fixed-point arithmetic. The custom hardware extensions achieve a 6.69× speedup over a pure-software baseline (89 cycles vs 595 cycles per weight update) and reach 86.50% classification accuracy after five training epochs on the 60,000-sample MNIST training set.


Spiking Neural Networks have been studied extensively as biological neural models and energy-efficient alternatives to rate-coded artificial neural networks. Key related areas include:


Methodology

Neuron Model

The system uses the Leaky Integrate-and-Fire (LIF) neuron model. At each discrete timestep:

V_mem[t] = β × V_mem[t-1] + Σ(w_i × spike_i[t])
if V_mem[t] ≥ V_threshold:
    spike_out = 1,  V_mem = 0  (reset)
else:
    spike_out = 0

where β = 192/256 ≈ 0.75 is the leak factor and all values are Q16.16 fixed-point.

Surrogate Gradient

Because the Heaviside spiking function is non-differentiable, a 256-entry Q16.16 ROM LUT stores pre-computed surrogate gradient values (smooth approximation of the sigmoid derivative). The LUT index is derived from the neuron’s membrane potential V_mem.

Backpropagation

Weight updates use a surrogate-gradient backpropagation rule with momentum:

δ = ((error + 0.95 × δ_prev) × surrogate_grad × spike_status) >> 8
W' = W − (LR × δ) >> 8      (LR = 150/256 ≈ 0.586)

This computation is accelerated by the custom RISC-V backprop unit, which includes two PISO LIFO buffers (one for spike status, one for gradients) and a DMA-style Memory-to-LIFO Loader FSM.

Three-State On-Chip Pipeline

The full on-chip learning loop operates as three sequential states per training sample:

SoC Architecture

The SoC connects all major blocks over a shared Wishbone bus: the RISC-V control CPU with its Custom Backprop Unit, the Neuron Accelerator (LIF inference hardware), the Surrogate Gradient LUT (256-entry ROM), and Shared Memory (dual-port BRAM). An Encode/Decode block bridges the internal bus to external I/O peripherals (I2C, GPIO, SPI).

State Executor Function
1 — Inference Hardware accelerator (RTL) LIF neuron forward pass; spike and V_mem written to shared BRAM
2 — Surrogate Substitution RISC-V CPU Reads V_mem from BRAM, queries LUT, writes surrogate gradients back
3 — Learning RISC-V CPU + custom ISA Backpropagation weight updates via 6 custom hardware instructions

Custom RISC-V ISA Extensions

Six new instructions (opcode 7'b0001011) were added to the RV32IM pipeline:

Instruction Purpose
LIFOPUSH Push spike/gradient data into hardware LIFO buffers
LIFOPOP Pop LIFOs, load weight and error, start computation
BKPROP Trigger backprop computation without reloading weight
LOADWT Load a new weight mid-computation
LIFOPUSHM DMA-style BRAM-to-LIFO data transfer
LIFOWB Write computed updated weight from HW unit back to register file

Experiment Setup and Implementation

Hardware Platform

SNN Network Configuration

Parameter Hardware Trainer Python Trainer
Architecture 784 → 200 → 10 784 → 16 → 10
Neuron model LIF24 (β ≈ 0.75) LIF2 (β = 0.5)
Timesteps 16–25 16
Epochs 5 10
Training samples 60,000 (MNIST) 60,000 (MNIST)

Hardware Resource Estimate

Component Resource
LIFO Buffers ~2 KB
Custom Backprop Unit ~500 LUTs
Memory Loader FSM ~200 LUTs
Surrogate LUT ROM 256 × 32-bit
Total overhead over base RV32IM ~10% additional

RTL Test Suite

Level Test Result
L2a Neuron cluster spike + V_mem (8 checks) PASS
L2b Cluster v_pre_spike port wiring (4 checks) PASS
L4 Accelerator known-value dump (6/7 checks) PASS
L5 SNN inter-cluster propagation + dump (8 checks) PASS
L6 Accelerator + real Wishbone BRAM (10 checks) PASS
L7 STATE 2 surrogate substitution (8 checks) PASS
L8 Full pipeline: CPU + accelerator + BRAM + LUT In progress

Results and Analysis

Benchmark: Cycle Count Comparison

The custom hardware backprop unit was benchmarked against an equivalent pure-software RV32I implementation for a 16-timestep weight update:

Benchmark Comparison

Clock Cycle Counts

Execution Breakdown

Implementation Total Cycles Time @ 100 MHz
Custom Accelerator 89 890 ns
Standard RV32I 595 5,950 ns
Speedup 6.69×

Phase-by-Phase Breakdown

Phase Custom Accelerator Standard RV32I
CPU Initialization 12 cycles 15 cycles
Data Load (DMA / Gradient Loads) 62 cycles 48 cycles
Delta Computation 13 cycles 224 cycles
Weight Update 2 cycles 228 cycles
Loop Control 80 cycles

The primary speedup comes from the hardware delta computation and weight update stages, which are reduced by 17× and 114× respectively.

MNIST Classification Accuracy

Hardware-Matched C Trainer (784 → 200 → 10)

Epoch Accuracy
1 67.69%
2 ~83.0%
5 85.74%
Peak 86.50%

Python SNN Trainer (784 → 16 → 10)

Epoch Accuracy
1 ~33.0%
10 78.50%
Peak 80.90%

On-Chip Output-Layer Learning: Softmax vs Perceptron

After the SNN’s hidden layers are trained, two lightweight output-layer learning strategies are evaluated for on-chip adaptation:

Softmax Learning converts the 10 output neuron responses (accumulated spike counts over all timesteps) into a probability distribution using the softmax function, then minimises cross-entropy loss via gradient descent. This provides smooth, calibrated class probability estimates and is mathematically equivalent to training a soft linear classifier on top of the SNN’s spike representations.

Perceptron Learning applies a hard, error-driven weight update directly to the output weights: if the predicted class is wrong, weights for the correct class are nudged upwards and weights for the wrong class are nudged downwards in proportion to the input spike rate. No probability computation is required — the rule needs only comparisons and additions, making it extremely compact and well suited for on-chip execution with minimal hardware overhead.

Accuracy Comparison: Softmax vs Perceptron Learning

Starting from a pre-trained baseline of 72.7% holdout accuracy, Softmax Learning improves to 74.5% (+1.8 pp) while Perceptron Learning reaches 78.8% (+6.1 pp), demonstrating that the simpler hardware-friendly rule achieves greater adaptation benefit.

Perceptron Learning Progress Across Rounds

The Perceptron learning curve shows rapid convergence: accuracy rises sharply from 72.2% at Round 0 to 81.56% by Round 3, then plateaus — indicating that only a small number of on-chip replay passes are needed to fully exploit the learned spike representations.

Weight Update Stability

A 32-sample replay learning step on the converged model showed:

Weight Matrix Weights Changed Max Absolute Diff Notes
W1 (156,800 weights) 0 0 Stable / frozen
W2 (2,000 weights) 450 1 Fine-grained adaptation

This confirms the system converges stably and that the custom weight update hardware produces correct, bounded updates.


Conclusion

This project demonstrates that a complete SNN training loop — inference, surrogate gradient computation, and backpropagation — can be executed fully on-chip on an FPGA SoC with no host-side involvement. Key achievements:

Future work includes completing the L8 full-pipeline integration test, scaling to larger SNN architectures, and targeting an actual FPGA deployment.


Publications