Optimizing Communication Efficiency and Latency in Blockchain-Enabled Federated Learning Systems

Team

E/20/148, Hewawasam A.K.L, e20148@eng.pdn.ac.lk
E/20/285, Perera B.B.M.R, e20285@eng.pdn.ac.lk
E/20/316, Ranasinghe R.A.W.L, e20316@eng.pdn.ac.lk

Supervisors

Dr. Suneth Namal Karunarathna, namal@eng.pdn.ac.lk
Dr. Upul Jayasinghe, upuljm@eng.pdn.ac.lk

Abstract
Research Problem
Objectives
Methodology
System Architecture
Experimental Setup and Implementation
Results and Analysis
Impact and Limitations
Conclusion
Publications
Links

1. Abstract

Credit card fraud detection (CCFD) in the banking sector faces a fundamental tension between the need for collaborative model training and strict data privacy regulations. Existing federated learning (FL) approaches that integrate blockchain for auditability treat all participants equally and do not address the hierarchical nature of real-world banking organisations, leading to suboptimal communication efficiency and round latency. This research presents the Hierarchical Clustered Federated Learning (HCFL) framework — a two-tier privacy-preserving FL system for credit card fraud detection that integrates Hyperledger Fabric blockchain, IPFS-based off-chain model storage, Differential Privacy (DP), and a novel Consensus-Based Federated Trust (CBFT) consensus protocol.

The HCFL framework achieves an average PR-AUC of 0.778, ROC-AUC of 0.974, and F1 Score of 0.830 over 10 federated rounds on the European Credit Card Fraud dataset (284,807 transactions, 0.17% fraud rate). Communication overhead is held constant at 0.490 MB/round, and end-to-end latency averages 3.55 seconds/round in simulation. The system demonstrably reduces inter-bank communication overhead relative to flat FL baselines while delivering competitive detection performance and providing a cryptographically verifiable audit trail suitable for regulatory compliance. Comparative evaluation against Baabdullah et al. (2024) and Aljunaid et al. (2025) confirms that HCFL achieves superior precision (0.905 average) and significantly lower communication-per-round while maintaining comparable recall.

Keywords: Federated Learning, Credit Card Fraud Detection, Blockchain, Differential Privacy, Hierarchical Aggregation, Communication Efficiency, Hyperledger Fabric, IPFS.

2. Research Problem

Background Credit card fraud imposes billions of dollars in losses annually on financial institutions worldwide. Accurate fraud detection requires machine learning models trained on large volumes of transaction data. However, banking transaction data is among the most sensitive personal data categories — sharing raw records across organisations violates customer privacy agreements and breaches data protection regulations such as the General Data Protection Regulation (GDPR) and the Payment Card Industry Data Security Standard (PCI-DSS).

Federated Learning (FL) emerged as a solution to this dilemma: instead of sharing raw data, banks train models locally and share only model gradients or weight updates. A central aggregator combines these updates into a shared global model without ever seeing individual transaction records. Yet vanilla FL in the financial domain introduces its own set of challenges:

Trust and Integrity: When banks compete commercially, there is no guarantee that a submitted model update is honest or accurate. A malicious participant could submit deliberately degraded or backdoored updates — a model poisoning attack. Communication Overhead: As the number of participating branches grows, the volume of model updates transmitted per round increases linearly. For large banking networks with dozens of branches per institution, this becomes a bottleneck. Latency: Synchronous FL rounds stall when even a single slow (“straggler”) participant fails to submit on time, rendering the system unsuitable for time-sensitive fraud alerting pipelines. Auditability: Regulators require tamper-proof audit logs of model training decisions. Standard FL systems provide no mechanism to record, verify, or dispute how the global model was produced. Gap in Existing Literature Existing blockchain-integrated FL systems for fraud detection (e.g., Baabdullah et al., 2024; Aljunaid et al., 2025) address either privacy or auditability in isolation but do not systematically co-optimise communication efficiency, round latency, and Byzantine fault tolerance within a unified hierarchical framework. Specifically:

Baabdullah et al. (2024) propose an FL + blockchain system using an LSTM model and FedAvg aggregation, achieving strong ROC-AUC performance. However, their flat FL topology does not reduce inter-bank communication, and they do not apply differential privacy or cross-bank model verification. Communication overhead grows with participant count. Aljunaid et al. (2025) propose an Explainable FL (XFL) framework integrating SHAP and LIME for decision transparency. While commendable for interpretability, their best-model-selection aggregation strategy introduces growing communication overhead (model size increases monotonically per round due to GBM warm-start), and average per-round latency exceeds 19 seconds. Neither baseline addresses the hierarchical structure inherent to banking organisations, where regional branches report to headquarters before engaging with an inter-bank consortium. Neither applies a mathematically rigorous privacy mechanism (DP) to the training step. Neither provides a Byzantine fault-tolerant consensus mechanism that prevents rogue banks from influencing the global model.

Research Questions This project addresses three core research questions:

RQ1: Can a two-tier hierarchical FL architecture reduce inter-bank communication overhead compared to flat FL baselines while maintaining competitive fraud detection performance?

RQ2: Can blockchain-enforced Clustered Byzantine Fault Tolerance (CBFT) provide model integrity guarantees that prevent poisoning attacks in a multi-bank consortium setting without incurring prohibitive latency?

RQ3: Is it possible to integrate differential privacy into the local training step of a hierarchical FL system without unacceptable degradation of fraud detection metrics on a highly class-imbalanced dataset?

3. Objectives

To design, implement, and evaluate a Hierarchical Clustered Federated Learning (HCFL) framework for credit card fraud detection that simultaneously optimises:

Inter-bank communication efficiency (minimise bytes exchanged per federated round)
End-to-end latency per round (minimise wall-clock time from training start to global model update)
Detection performance (maximise PR-AUC and F1 Score on imbalanced fraud data)
Privacy guarantees (provide Differential Privacy for local training updates)
Security and auditability (provide Byzantine fault tolerance and immutable audit records via blockchain)

#	Objective	Success Criterion
O1	Design a two-tier FL hierarchy that mirrors real banking organisational structure	Architecture functionally demonstrated with 3 banks × n branches
O2	Integrate Hyperledger Fabric as the consortium blockchain layer	Blockchain functional with Go chaincode for trust tracking and model CID recording
O3	Implement IPFS-based off-chain model storage to decouple large binary payloads from ledger	All model weights stored on IPFS; only compact CID + hash recorded on-chain
O4	Implement Differential Privacy in local training (DP-SGD)	DP gradient clipping + Gaussian noise applied per mini-batch; no utility collapse
O5	Design and implement three-phase CBFT consensus protocol	CBFT functional with quorum-2-of-3 voting; prevents self-verification
O6	Achieve PR-AUC ≥ 0.75 on the European Credit Card Fraud dataset within 10 FL rounds	Demonstrated with reproduced experimental results
O7	Achieve communication overhead ≤ 0.50 MB/round	Demonstrated with precise per-transfer accounting
O8	Achieve end-to-end latency ≤ 120 s/round (SLA bound) in live mode	Demonstrated in simulation (3.55 s) and live mode (~33–37 s)
O9	Compare HCFL performance against two published baseline frameworks	Quantitative comparison with Baabdullah et al. (2024) and Aljunaid et al. (2025)

4. Methodology

4.1 Dataset

All experiments use the European Credit Card Fraud Detection dataset (Kaggle), the de-facto benchmark for tabular fraud detection research:

Property	Value
Total transactions	284,807
Features	30 (V1–V28 PCA-anonymised + Amount + Time; Time dropped in training)
Fraud rate (global)	~0.17% (highly class-imbalanced)
Training partitions	BankA: 151,642 · BankB: 80,980 · BankC: 7,896
Global test set	43,208 rows · fraud rate 0.407%

The dataset’s PCA-anonymised features (V1–V28) simulate the kind of pre-processed, anonymised data that banks would realistically contribute in a privacy-preserving consortium — raw transaction identifiers are not present, aligning with GDPR Article 89 anonymisation provisions.

Data Partitioning: A non-IID (non-independent, non-identically distributed) heterogeneous split is applied. BankC holds only 7,896 samples with a 2.33% local fraud rate; BankB holds 80,980 samples with 0.13% fraud; BankA is the largest partition with 0.005% local fraud. This replicates real-world distribution skew — different bank branches serve demographically distinct customer bases with different spending patterns and fraud exposure profiles.

4.2 Model Architecture

The fraud detector at all levels of the hierarchy is a LSTMTabular — a single-layer Long Short-Term Memory network adapted for tabular transaction data.

Loss function: BCEWithLogitsLoss with positive class weighting (pos_weight = n_negatives / n_positives) to address the severe class imbalance without oversampling. Model size: ~50,000 parameters (~0.0489 MB serialised), deliberately minimal to minimise communication overhead per FL round. Optimizer: Adam, lr = 1×10⁻³, batch_size = 256, epochs = 1 per local round. LSTM was chosen over simpler feedforward networks because it naturally captures sequential dependencies within transaction streams. Unlike tree-based methods, LSTM is differentiable and compatible with both FedAvg weight averaging and gradient-based Differential Privacy.

4.3 Evaluation Metrics

Given the severe class imbalance (0.17% fraud), standard accuracy is uninformative. The following metrics are computed at the optimal threshold (sweep of the precision-recall curve to maximise F1):

Metric	Definition	Why used
PR-AUC	Area under Precision-Recall curve	Primary: threshold-independent; optimal for severe imbalance
ROC-AUC	Area under ROC curve	Secondary: overall discrimination ability
F1 Score	Harmonic mean of Precision and Recall	Key decision metric at optimal threshold
Precision	TP / (TP + FP)	Rate of true fraud among flagged transactions
Recall	TP / (TP + FN)	Rate of actual fraud that is caught
Comm (MB)	Total bytes transferred per round across all nodes	Communication efficiency
E2E (sec)	Wall-clock time from training start to global model stored	Latency efficiency

5. System Architecture

The proposed architecture follows a two-tier hierarchical design.

Tier 1 – Institutional / Intra-Cluster Layer

Each participating financial institution is treated as an independent cluster. A cluster contains:

multiple branch nodes,
one HQ node,
and one Backup node.

Branch nodes perform local model training on private transaction datasets. The HQ collects branch updates and generates a cluster-level aggregated model. The Backup node remains synchronized with the HQ and can take over in the event of HQ failure, reducing the risk of a single point of failure.

Tier 2 – Blockchain / Inter-Cluster Layer

The second tier consists of a permissioned blockchain network connecting only the HQ nodes of participating institutions. This layer is responsible for:

validating cluster-level model updates,
executing consensus,
maintaining immutable audit records,
managing trust scores,
and coordinating global aggregation.

By limiting blockchain participation to HQ nodes instead of all branch nodes, the architecture significantly reduces consensus complexity, communication overhead, and latency.

Off-Chain Storage

To avoid storing large model files directly on-chain, model artifacts are uploaded to IPFS, and only compact metadata such as CID, hash, validation status, and trust information are stored on the blockchain ledger.

6. Experimental Setup and Implementation

The framework is implemented using the following technologies:

PyTorch for model training and federated learning logic
Hyperledger Fabric for permissioned blockchain coordination
Go chaincode for trust score management and update verification
IPFS (Kubo) for off-chain model storage
FastAPI for the middleware API layer
Docker for reproducible multi-service deployment

Experiments use the Credit Card Fraud Detection dataset, partitioned across multiple simulated institutions and branches. The training setup reflects a realistic non-IID multi-institution environment, where each branch has access only to its local transaction subset.

The experimental setup consists of:

3 financial institutions
3 branches per institution
HQ aggregation per institution
permissioned blockchain coordination across HQs

Key Implemented Components

Local branch training with DP-inspired clipping and Gaussian noise
LSTM-based fraud detection model
Intra-institute FedAvg aggregation
Fast validation gate
Backup recovery and failover support
Trust-aware weighted global aggregation
Fabric-based CBFT-style verification
IPFS-based off-chain storage
REST API integration layer

7. Results and Analysis

Proposed HCFL Framework — Per-Round Results

Training over 10 federated rounds on 240,000 training samples across 3 banks, evaluated on a held-out global test set of 43,208 transactions:

Round	PR-AUC	ROC-AUC	F1	Precision	Recall	Comm (MB)	E2E (sec)
1	0.6621	0.9642	0.6798	0.6722	0.6875	0.4896	4.50
2	0.7632	0.9633	0.8365	0.9366	0.7557	0.4896	3.50
3	0.7766	0.9714	0.8428	0.9437	0.7614	0.4896	3.48
4	0.7885	0.9740	0.8447	0.9315	0.7727	0.4896	3.42
5	0.7912	0.9745	0.8464	0.9441	0.7670	0.4896	3.44
6	0.7938	0.9758	0.8464	0.9441	0.7670	0.4896	3.39
7	0.8003	0.9779	0.8483	0.9320	0.7784	0.4896	3.45
8	0.8015	0.9784	0.8485	0.9091	0.7955	0.4896	3.38
9	0.8029	0.9784	0.8519	0.9324	0.7841	0.4896	3.47
10	0.8002	0.9791	0.8494	0.9038	0.8011	0.4896	3.38

PR-AUC jumps from 0.662 (Round 1) to 0.763 (Round 2, +15.3%) once global model blending activates, then stabilises above 0.800 from Round 7 onwards. Communication overhead is constant at 0.4896 MB/round. Total evaluation time: 35.5 seconds.

Baseline Results Summary

Baabdullah et al. (2024) — Flat FL + LSTM, 40% dataset sample (~113,922 training / ~34,177 test)

Metric	Round 1	Round 10	Average
PR-AUC	0.8919	0.8569	0.8699
ROC-AUC	0.9995	0.9991	0.9992
F1 Score	0.5534	0.7534	0.7261
Precision	0.3878	0.6322	0.6095
Recall	0.9661	0.9322	0.9436
Comm (MB)	0.1863	0.1863	0.1863
E2E (sec)	7.51	7.55	7.54

Blockchain integrity: VERIFIED (10 blocks).

Aljunaid et al. (2025) — XFL + GBM, best-model selection; model size grows +20 trees/round

Metric	Round 1	Round 10	Average
PR-AUC	0.7806	0.8791	0.8633
ROC-AUC	0.9925	0.9806	0.9871
F1 Score	0.2572	0.5187	0.4092
Precision	0.1491	0.3604	0.2696
Recall	0.9333	0.9250	0.9278
Comm (MB)	0.8298	2.1559	1.497
E2E (sec)	69.28	14.79	19.93

Blockchain integrity: VERIFIED (10 blocks).

Cross-Framework Comparison

Detection Performance

Framework	Model	PR-AUC	ROC-AUC	F1	Precision	Recall
HCFL (Proposed)	LSTM + DP-FedAvg	0.778	0.974	0.830	0.905	0.767
Baabdullah (2024)	LSTM + FedAvg	0.870*	0.999	0.726	0.610	0.944
Aljunaid (2025)	GBM + Best-Select	0.863*	0.987	0.409	0.270	0.928

*Baselines evaluated at the default 0.5 threshold on smaller test sets; HCFL uses the optimal threshold (PR-curve sweep) on the full dataset. Higher baseline Recall reflects threshold-driven over-flagging rather than superior detection.

Communication Overhead and Latency

Framework	Avg Comm/round	Growth	Avg E2E/round
HCFL (Proposed)	0.490 MB	None (constant)	3.55 s
Baabdullah (2024)	0.186 MB	None (constant)	7.54 s
Aljunaid (2025)	1.497 MB	+0.147 MB/round	19.93 s

HCFL’s extra 0.304 MB vs Baabdullah is entirely the CBFT cross-verification cost — the price of Byzantine fault tolerance absent in the baseline.

HCFL delivers a 3.1× lower average communication overhead and 5.6× lower latency than Aljunaid, with no growth over rounds. At 50 rounds, Aljunaid would need ~11 MB/round vs HCFL’s constant 0.490 MB (22× gap).

In live mode (Fabric + IPFS), HCFL measured ~33–37 s/round, within the 120-second SLA bound.

Comprehensive Feature Comparison

Feature	HCFL (Proposed)	Baabdullah (2024)	Aljunaid (2025)
FL Topology	Two-tier hierarchical	Flat	Flat
Aggregation	Trust-weighted FedAvg	FedAvg	Best-model selection
Differential Privacy	Yes (DP-SGD)	No	No
Byzantine Fault Tolerance	Yes (CBFT 3-phase)	No	No
Cross-bank Verification	Yes	No	No
Trust Score System	Yes (on-chain)	No	No
Explainability (XAI)	No	No	Yes (SHAP + LIME)
Live Blockchain	Hyperledger Fabric	Hash-chain (sim)	Hash-chain (sim)
Replay Attack Protection	Yes	No	No
Comm/round	0.490 MB (fixed)	0.186 MB (fixed)	1.497 MB (growing)
E2E Latency	3.55 s sim / ~35 s live	7.54 s	19.93 s

8. Impact and Limitations

Research Impact

Communication Efficiency: The HCFL two-tier hierarchy is the primary architectural contribution to communication efficiency. By ensuring that only one consolidated cluster model per bank reaches the global tier (rather than all branch-level models), inter-bank communication is reduced by a factor of the number of branches per bank. As banks scale to dozens of regional branches, this compression benefit grows proportionally.

Security and Auditability: The CBFT consensus protocol, implemented as on-chain chaincode, provides the first blockchain-enforced cross-bank model verification in the FL-for-CCFD domain (among the compared works). This directly addresses the model poisoning threat that is absent from both baseline frameworks. The immutable audit trail produced by Hyperledger Fabric provides regulators with cryptographically verifiable evidence of model provenance.

Privacy Compliance: Differential Privacy at the branch level provides an (ε, δ)-DP guarantee for individual transaction records — a regulatory requirement under GDPR. Neither Baabdullah nor Aljunaid implement DP. The results demonstrate (F1 = 0.830, PR-AUC = 0.778) that DP does not prohibitively degrade detection quality at the chosen noise multiplier (σ = 0.05).

Scalability in Communication: HCFL’s constant 0.490 MB/round communication cost — independent of the number of training rounds — makes it suitable for long-horizon training and production deployment. The growing communication cost inherent to Aljunaid’s GBM warm-start approach limits its practical scalability.

Precision for Banking Operations: HCFL’s average Precision of 0.905 means that 90.5% of transactions flagged as fraudulent are genuinely fraudulent. This directly translates to operational value: fewer false alarms, reduced analyst burden, and maintained customer experience for legitimate cardholders.

Limitations

Simulation vs. Live Mode Gap: The standalone evaluation (3.55 s/round) significantly underestimates live mode latency (~33–37 s/round), which is dominated by the 30-second Raft ordering settlement wait in Hyperledger Fabric. In production banking, this round latency would be further influenced by network topology, endorsement policy complexity, and peer count. The system meets the 120-second SLA bound, but real-time fraud alerting (sub-second decisions) requires a parallel scoring service, not a federated training loop.
Fixed Number of Banks: The current CBFT protocol is calibrated for exactly 3 banks (quorum = 2). Adding a fourth bank requires reconfiguring chaincode constants and re-deploying the network. Dynamic consortium membership — where banks can join or leave between rounds — is not currently supported.
Non-IID Skew: While the non-IID data split realistically models heterogeneous fraud distributions, BankC’s shard (7,896 samples, 2.33% fraud rate) is substantially smaller than BankA (151,642 samples). This extreme size disparity means BankA dominates trust-weighted aggregation by data volume, potentially over-representing its fraud patterns in the global model. More sophisticated data heterogeneity strategies (e.g., Dirichlet partitioning) could address this.
Limited DP Accounting: The current implementation applies DP noise per mini-batch but does not perform formal (ε, δ)-budget accounting across all rounds (privacy loss accumulates across rounds — the composition theorem means 10 rounds of DP training provides weaker privacy than 1 round). Tools like the RDP (Rényi Differential Privacy) accountant (as in Opacus) would provide rigorous budget tracking.
No Explainability: Unlike Aljunaid’s XFL framework, HCFL provides no mechanisms for explaining individual fraud predictions to end-users or auditors. While the system provides model provenance via the blockchain, per-sample decision transparency (why was this transaction flagged?) is absent. Integration of SHAP or LIME post-hoc explanations at the inference stage is a natural extension.
Simulated IPFS Network: The IPFS node in the experimental setup is a local daemon, not a distributed network. In a real banking deployment, IPFS would operate over an organisation-internal CDN or a permissioned IPFS cluster. Performance characteristics (upload/download times) would differ from the simulation.
Threshold Comparison Inequality: The results comparison is complicated by the fact that HCFL reports metrics at the optimal threshold while baselines are at 0.5. A complete comparison would require rerunning all baselines with optimal-threshold evaluation, which was not possible within the scope of this project given the baseline implementations.

9. Conclusion

This paper presented the Hierarchical Clustered Federated Learning (HCFL) framework — a two-tier, privacy-preserving, blockchain-enforced federated learning system for credit card fraud detection in banking consortia. The framework addresses three research questions posed at the outset:

RQ1 (Communication Efficiency): The two-tier hierarchy reduces inter-bank communication to a constant 0.490 MB/round, independent of round count, compared to Aljunaid et al.’s growing overhead (1.497 MB average, 2.156 MB at Round 10). The communication growth problem is structurally solved by using a fixed-size LSTM model rather than accumulating GBM trees. The per-round overhead is approximately 3× that of the flat Baabdullah baseline, an acceptable trade-off given that HCFL’s 0.294 MB excess is attributable entirely to the CBFT cross-verification transfers that provide Byzantine fault tolerance absent in the baseline.

RQ2 (Blockchain-Enforced CBFT): The three-phase CBFT protocol (Propose → Verify → Commit) provides quorum-based Byzantine fault tolerance with O(N) message complexity (vs. O(N²) for classical PBFT), implemented as auditable on-chain chaincode. The protocol prevents self-certification, detects model tampering via SHA-256 double-hash verification (CID + independent ledger hash), and provides replay attack protection via CID state tracking. All of this runs within the live-mode latency budget of <120 s/round.

RQ3 (DP + Hierarchical FL): Differential Privacy (clip=1.0, noise_multiplier=0.05) is successfully integrated into the local training step without collapsing model utility. The system achieves an average F1 Score of 0.830 and PR-AUC of 0.778 over 10 rounds — competitive with or superior to both baselines in precision-critical metrics — demonstrating that DP and hierarchical FL are compatible at the chosen noise scale.

The primary differentiating contribution of HCFL over existing work is the simultaneous and co-optimised treatment of communication efficiency, latency, Byzantine security, differential privacy, and blockchain auditability in a unified, production-oriented framework. No existing published work in the FL-for-CCFD domain combines all five of these properties in a single system.