Bridging the gap between encrypted traffic analysis, explainable AI, and automated zero-trust policy enforcement.
CO425 — Final Year Project II | Department of Computer Engineering | University of Peradeniya
Modern cybersecurity is shifting toward encryption to protect data privacy, but this often blinds traditional Intrusion Detection Systems (IDS) that rely on payload inspection. Concurrently, the rise of cloud computing and remote work has made perimeter-based security obsolete, leading to the adoption of Zero-Trust Architecture (ZTA), which requires continuous verification of every entity.
While Deep Learning models can detect anomalies in encrypted traffic without decryption by analyzing metadata, their "black-box" nature creates a trust deficit that hinders automated policy enforcement. This project proposes a framework integrating Encrypted Traffic Analysis (ETA) with Explainable AI (XAI) using SHAP and Deep Dictionary Learning to provide real-time, human-readable rationales for security decisions.
The proposed framework utilizes a multi-stage zero-trust pipeline:
Raw PCAP streams are processed through an AES-256-GCM encryption simulator. A hybrid DPKT + NFStream extractor yields 15 CIC-IDS-2017 behavioral features — all from unencrypted metadata (no payload inspection).
A lightweight Decision Tree classifier (entropy criterion, class weight 50:1 attack bias) performs an initial fast check. Flows classified as Normal are forwarded immediately; flagged flows proceed to deep analysis.
A two-layer DDL model with ISTA sparse coding learns atomic dictionary representations of normal traffic. Flows with high reconstruction error (exceeding the learned threshold) are classified as anomalous.
For every anomaly, two explanation methods fire: DDL-native per-feature reconstruction error decomposition, and SHAP KernelExplainer perturbation-based attributions — producing human-readable rationales for SOC analysts.
Anomalous flows are held in an OpenFlow-style SDN buffer while the explanation is computed. After analysis: DROP confirmed threats, FORWARD cleared flows. Decisions feed back into the ZTA Policy Engine.
A Streamlit dashboard provides real-time visibility into detection events, explanation summaries, and pipeline statistics (TP/TN/FP/FN, latency, throughput).
BCC v2: 28 features (Stage 1 gatekeeper) | DDL: 40 features (Stage 2, superset of BCC's 28)
| Directory | Purpose | Status |
|---|---|---|
BaseCheckClassifier/ | Decision Tree classifier, feature extraction, encryption sim, dashboard | Active |
DDLModel/ | Deep Dictionary Learning anomaly detector (two-layer ISTA) | Active |
XAIExplainer/ | SHAP + DDL-native reconstruction explanation | Active |
SDNBuffer/ | OpenFlow-style SDN flow buffer simulation | Active |
ZeroTrustPipeline/ | Pipeline orchestrator — ties all modules together | Active |
tests/ | Integration tests (27 sub-tests) | Active |
ObsoleteExperiments/ | Archived early experiments (see below) | Archived |
docs/ | This GitHub Pages site | Active |
The current DDL-based pipeline is the result of several iterative experiments. Each attempt taught us something that shaped the final design. Below is the complete timeline of approaches tried, their results, and why we moved on.
Approach: Ensemble of Isolation Forest + Autoencoder for unsupervised anomaly detection on the BCCC Darknet dataset. Pseudo-labels were generated via agreement voting — confidence = 1.0 when both models agreed a sample was anomalous. Top 50 features selected by variance from 467 numeric features.
| Metric | Value |
|---|---|
| Accuracy | 0.9949 |
| Precision | 0.7451 |
| Recall | 1.0000 |
| F1-Score | 0.8539 |
| ROC-AUC | 0.9996 |
| 5-Fold CV (ROC-AUC) | 0.9991 ± 0.0003 |
Confusion Matrix: TN=5,006 FP=26 FN=0 TP=76
Dataset: 25,588 samples — only 379 high-confidence anomalies (1.48%), class imbalance ratio 1:66.4
Approach: 3-stage notebook pipeline — (1) IF+AE pseudo-labeling, (2) Random Forest classifier (300 trees, max_depth=15, balanced class weights), (3) SHAP TreeExplainer for explainability. Same BCCC Darknet dataset and 50 anonymous features.
| Metric | Value |
|---|---|
| Accuracy | 0.9945 |
| Precision | 0.7300 |
| Recall | 0.9865 |
| F1-Score | 0.8391 |
| ROC-AUC | 0.9994 |
| 5-Fold CV (ROC-AUC) | 0.9991 ± 0.0003 |
Confusion Matrix: TN=5,007 FP=27 FN=1 TP=73
SHAP outputs: bar, beeswarm, waterfall, and dependence plots on anonymous features
Approach: Switched datasets to CIC-IDS-2017 with 15 hand-selected behavioral features. Trained a Decision Tree (entropy criterion, max_depth=15, 50:1 attack class weight) as a "zero-leak" classifier. Integrated into a full simulation pipeline with encryption, topology simulation, and a Streamlit dashboard.
Approach: Two-layer Deep Dictionary Learning with ISTA sparse coding as the anomaly detection backbone, combined with DDL-native reconstruction decomposition + SHAP KernelExplainer for dual-mode explainability. The Decision Tree serves as a lightweight pre-filter, and an SDN buffer holds suspicious flows during analysis.
| Aspect | Exp 1: IF+AE | Exp 2: RF+SHAP | Exp 3: DT (CIC) | Current: DDL+XAI |
|---|---|---|---|---|
| Dataset | BCCC Darknet | BCCC Darknet | CIC-IDS-2017 | CIC-IDS-2017 |
| Features | 50 anonymous | 50 anonymous | 15 named | 15 named |
| Labels | Pseudo (unsupervised) | Pseudo (unsupervised) | Ground truth | Ground truth |
| Core Model | Isolation Forest + AE | Random Forest | Decision Tree | DDL (ISTA) |
| Explainability | None | SHAP (anonymous) | Inherent (DT rules) | DDL-native + SHAP |
| Zero-Trust Integration | No | No | Yes (DT only) | Yes (full pipeline) |
| SDN Buffer | No | No | No | Yes |
| Precision | 0.7451 | 0.7300 | 0.875 (BCC) / 0.699 (DDL) | 0.9363 (full pipeline) |
| Recall | 1.0000 | 0.9865 | 0.999 (BCC / Sandaru data) | 0.4537 (DDL standalone) |
| FPR | — | — | 0.78% | 0.25% (full pipeline) |
| Status | Archived | Archived | Pre-filter (Stage 1) | Active ✓ Tested |
The current zero-trust pipeline is fully implemented as modular Python packages, with 27/27 integration tests passing. Below are the technical specifics of each component.
| Component | Specification |
|---|---|
| Input | 40 CIC-IDS-2017 features — superset of BCC's 28 (Z-score normalised) |
| Layer 1 Dictionary | D₁ ∈ ℝ40×64 — captures coarse flow patterns |
| Layer 2 Dictionary | D₂ ∈ ℝ64×128 — captures subtle micro-patterns |
| Sparse Coding | ISTA (Iterative Shrinkage-Thresholding Algorithm), 50 iterations per layer |
| Sparsity Penalty (λ) | 0.1 (L1 regularisation on sparse codes) |
| Dictionary Update | Mini-batch gradient descent with column-wise unit-norm projection |
| Training Epochs | 150 (batch size = 512, GPU: RTX 6000 Ada, ~1h 45min) |
| Training samples | 1,682,457 normal flows (CIC-IDS-2017 TRAIN) |
| Anomaly Threshold | 0.7597 (95th percentile of training reconstruction error) |
The model trains only on benign traffic — an unsupervised approach that avoids the need for labelled attack samples. During training, dictionaries D₁ and D₂ learn to efficiently represent normal flow patterns via alternating sparse coding (ISTA) and dictionary update (gradient descent with column normalisation). After training, an anomaly threshold is set at the 95th percentile of reconstruction errors on the training set. At inference time, any flow whose reconstruction error exceeds this threshold is flagged as anomalous.
Every anomaly decision is accompanied by a detailed explanation through two complementary strategies:
Both strategies produce a composite report suitable for SOC analyst dashboards and automated policy audit trails. The human-readable interpretation includes feature rankings, deviation magnitudes, and a recommended action (DROP / FORWARD).
| Parameter | Value | Description |
|---|---|---|
| Max Buffer Size | 1,000 streams | Max concurrent flows held for analysis |
| Timeout | 5,000 ms | Auto-release if DDL analysis takes too long |
| Actions | BUFFER → RELEASE / DROP | Mirrors OpenFlow OFPT_PACKET_IN / OFPT_FLOW_MOD |
| Expiry Policy | Auto-release with warning | Fail-open on timeout to prevent denial of service |
In a real deployment, this module would be replaced by actual SDN controller commands (e.g., OpenDaylight or ONOS). The simulation faithfully tracks buffer state, hold times, and capacity — providing realistic latency measurements for the pipeline evaluation.
The ZeroTrustPipeline ties all components together with a fail-closed design:
ThreadPoolExecutorrun_batch() processes multiple pcaps and exports a full JSON log for analysisFeatures were selected from CIC-IDS-2017's 78 columns using three criteria: (1) not null across all 5 days, (2) high ANOVA F-score between Normal vs Attack classes, (3) computable from raw network metadata (no payload inspection — works on encrypted traffic). The two feature sets are cumulative — DDL's 40 is a superset of BCC's 28.
| # | Feature | Category | Used in | Rationale |
|---|---|---|---|---|
| 1 | Packet Length Variance | Packet Size | BCC + DDL | High variance → unusual payload distribution (DDoS uses uniform sizes) |
| 2 | Fwd Packet Length Max | Packet Size | BCC + DDL | Abnormally large fwd packets → exfiltration |
| 3 | Fwd Header Length | Header | BCC + DDL | Header padding/manipulation is a common evasion technique |
| 4 | Init_Win_bytes_forward | TCP Window | BCC + DDL | Unusual initial window sizes → scanning tools (e.g. nmap) |
| 5 | Bwd Header Length | Header | BCC + DDL | Asymmetric header sizes → protocol misuse / C2 |
| 6 | Total Length of Fwd Packets | Volume | BCC + DDL | Abnormal volume → flooding or data exfiltration |
| 7 | Init_Win_bytes_backward | TCP Window | BCC + DDL | Mismatched backward window → C2 traffic signature |
| 8 | Bwd Packets/s | Rate | BCC + DDL | High backward rate → DDoS or amplification attack |
| 9 | Flow IAT Min | Timing | BCC + DDL | Machine-generated traffic has unnaturally regular timing |
| 10 | Fwd IAT Min | Timing | BCC + DDL | Forward inter-arrival times reveal automated scanning |
| 11 | Flow Bytes/s | Throughput | BCC + DDL | Sudden throughput spikes → data exfiltration or flooding |
| 12 | Active Min | Activity | BCC + DDL | Short active bursts → bot behaviour / beaconing |
| 13 | Bwd IAT Total | Timing | BCC + DDL | Total backward inter-arrival → response pattern analysis |
| 14 | Flow IAT Max | Timing | BCC + DDL | Long idle gaps between bursts → C2 beaconing |
| 15 | Flow Duration | Duration | BCC + DDL | Abnormally short or long flows → scanning or tunnelling |
| 16 | Total Fwd Packets | Volume | BCC + DDL | Packet count asymmetry reveals scanning patterns |
| 17 | Total Backward Packets | Volume | BCC + DDL | Low backward count with high forward → one-way flooding |
| 18 | Fwd Packet Length Mean | Packet Size | BCC + DDL | Average size compared to variance reveals payload consistency |
| 19 | Bwd Packet Length Mean | Packet Size | BCC + DDL | Backward size profile distinguishes scan responses |
| 20 | Fwd Packet Length Std | Packet Size | BCC + DDL | Low Std + high rate = tool-generated uniform traffic |
| 21 | Bwd Packet Length Max | Packet Size | BCC + DDL | Oversized backward packets → data theft response |
| 22 | Flow IAT Mean | Timing | BCC + DDL | Average timing between packets reveals automation |
| 23 | Flow IAT Std | Timing | BCC + DDL | Very low Std = machine-generated, very high = irregular |
| 24 | Fwd IAT Total | Timing | BCC + DDL | Total forward idle time — long = slow-rate attacks |
| 25 | Fwd Packets/s | Rate | BCC + DDL | High fwd rate without proportional payload = flooding |
| 26 | Down/Up Ratio | Asymmetry | BCC + DDL | Unusual download:upload ratio → exfiltration or scanning |
| 27 | SYN Flag Count | TCP Flags | BCC + DDL | Flood of SYN packets = SYN flood DDoS |
| 28 | RST Flag Count | TCP Flags | BCC + DDL | High RST count = port scanning resets |
| 29 | Bwd Packet Length Min | Packet Size | DDL only | Minimum backward pkt size — DDoS sends identical tiny ACKs |
| 30 | Bwd Packet Length Max | Packet Size | DDL only | Full backward size profile for dictionary reconstruction quality |
| 31 | Flow IAT Mean | Timing | DDL only | Provides mean for DDL's pattern reconstruction |
| 32 | Flow IAT Std | Timing | DDL only | IAT variability is key for DDL to encode normal timing patterns |
| 33 | Fwd IAT Total | Timing | DDL only | Forward idle time integral — slow-rate attacks show elevated values |
| 34 | Bwd IAT Min | Timing | DDL only | Fast backward bursts = scanning or amplification signatures |
| 35 | Fwd Packets/s | Rate | DDL only | Forward packet rate for DDL's flow-speed dictionary |
| 36 | Bwd Packets/s | Rate | DDL only | Backward packet rate — amplification attacks show extreme ratio |
| 37 | Fwd Header Length.1 | Header | DDL only | Cumulative header overhead — anomalous for tunnelling |
| 38 | Active Min | Activity | DDL only | Shortest active period — bots have very short minimum active windows |
| 39 | ACK Flag Count | TCP Flags | DDL only | ACK flood is a common DDoS variant — DDL needs this for flag profiling |
| 40 | URG Flag Count | TCP Flags | DDL only | Urgent flags rarely appear in normal traffic — clear anomaly indicator |
All 40 features are metadata-only — no payload inspection, fully compatible with TLS-encrypted traffic. White rows = shared by BCC and DDL. Green rows = DDL-only features added to improve reconstruction fidelity.
| Component | Tests | Status |
|---|---|---|
| DDL model training (150 epochs, GPU) | — | ✓ Completed |
| DDL + IF standalone CSV inference | 531K rows | ✓ Completed |
| BCC on Sandaru's test_raw.csv | 52K rows | ✓ Completed — 99.89% recall |
| Full two-stage pipeline (CSV) | 50K rows | ✓ Completed — 93.6% precision |
| PCAP pipeline evaluation | 128 labeled flows | ✓ Completed — see Results |
| XAI (LIME + SHAP) explanations | 5 anomaly flows | ✓ Completed |
| Live switch / physical hardware | — | Planned |
Comprehensive evaluation conducted on CIC-IDS-2017 dataset — both CSV (bulk inference)
and real labeled PCAP flows. All tests run on ada.ce.pdn.ac.lk server.
| Model | Accuracy | Precision | Recall | F1 | FPR | Latency/flow |
|---|---|---|---|---|---|---|
| BCC v2 (raw CSV) | 83.24% | 87.50% | 21.25% | 34.19% | 0.78% | 0.05 µs |
| BCC v2 (Sandaru's test data) | 98.65% | 96.31% | 99.89% | 98.07% | 2.00% | 0.05 µs |
| DDL-40 (standalone) | 84.81% | 69.94% | 45.37% | 55.04% | 5.03% | 133 µs |
| Isolation Forest (standalone) | 82.06% | 62.59% | 31.00% | 41.46% | 4.77% | 2.83 µs |
| Full Pipeline (BCC → DDL+IF) | 82.19% | 93.63% | 14.05% | 24.44% | 0.25% | ~8 µs avg |
Note: BCC recall is 99.89% on Sandaru's preprocessed data format (the format it was trained on). On raw CIC-IDS-2017 CSV, recall drops to 21.25% due to different feature scaling — the model itself is correct. The full pipeline achieves 93.6% precision with only 0.25% FPR — every DROP is very likely a real attack.
Only 20 attacks leaked through BCC
98 false blocks / 39,754 normal flows = 0.25% FPR
| Stage | CSV Mode (µs) | PCAP Mode (µs) | Applies To | Notes |
|---|---|---|---|---|
| Feature Extraction | ~0 (pre-extracted) | 3,257 µs (3.26 ms) | All flows | PCAP parsing overhead (dpkt) |
| BCC Inference | 0.05 µs | 122 µs | All flows | Decision Tree predict_proba |
| DDL Inference | 133 µs | 4,317 µs (4.3 ms) | Flagged only (~5%) | 2-layer ISTA reconstruction |
| IF Inference | 2.83 µs | 5,616 µs (5.6 ms) | Flagged only (~5%) | Isolation Forest scoring |
| Total Pipeline | ~8 µs avg | ~3,853 µs avg (3.9 ms) | All flows | PCAP mode dominated by parsing |
Key insight: In a real SDN deployment, features would be extracted directly from the OpenFlow PacketIN event (not from a PCAP file) — bringing the feature extraction time much closer to the CSV mode values. The PCAP evaluation shows worst-case latency when reading stored captures.
Both LIME and SHAP independently explain the same flow. The convergence of explanations provides high confidence in the detection.
Cross-validation: DDL and IF both point to fwd_iat_total as the top suspicious
feature — this convergence is the rationale for dual-XAI verification.
XAI timing: DDL-LIME = 44ms | IF-LIME = 20ms per flow.
| Parameter | Plan |
|---|---|
| Training Data | CIC-IDS-2017 "Monday — WorkingHours" (benign-only, ~529K flows) |
| Validation Split | 80/20 train/validation on benign data |
| Feature Normalisation | Z-score from training set (stored in model) |
| Convergence Criterion | Validation reconstruction error plateau (< 0.1% improvement over 20 epochs) |
| Threshold Tuning | Sweep 90th, 95th, 97th, 99th percentile on validation set |
Command: python -m DDLModel.train_ddl --csv data/Monday-WorkingHours.csv --output models/ddl_cic.pkl
Test the trained DDL against all CIC-IDS-2017 attack categories:
| Attack Category | CIC-IDS-2017 Day | Expected Signature |
|---|---|---|
| Brute Force (FTP, SSH) | Tuesday | High Fwd IAT Min, abnormal Init_Win_bytes |
| DoS / DDoS (Hulk, Slowloris, GoldenEye) | Wednesday | Extreme Bwd Packets/s, Flow Bytes/s spikes |
| Web Attacks (XSS, SQL Injection) | Thursday AM | Unusual Fwd Packet Length Max, Header anomalies |
| Infiltration | Thursday PM | Long Flow Duration, irregular IAT patterns |
| Botnet (ARES) | Friday AM | Short Active Min bursts, C2 beaconing in Flow IAT Max |
| Port Scan | Friday PM | Very short flows, low Packet Length Variance |
| DDoS (LOIT) | Friday PM | Extreme volume in Total Length of Fwd Packets |
For each attack type, we will compute per-class detection rate and verify that the XAI explanations correctly identify the distinguishing features listed above.
Primary evaluation metrics with target values:
| Metric | Formula | Target | Rationale |
|---|---|---|---|
| Accuracy | (TP + TN) / N | ≥ 0.95 | Overall correctness |
| Precision | TP / (TP + FP) | ≥ 0.85 | Minimise false alarms for SOC analysts |
| Recall | TP / (TP + FN) | ≥ 0.95 | Zero-trust: never miss an attack (critical) |
| F1-Score | 2 · P · R / (P + R) | ≥ 0.90 | Balance between precision and recall |
| ROC-AUC | Area under ROC curve | ≥ 0.97 | Threshold-independent discrimination |
| False Positive Rate | FP / (FP + TN) | ≤ 0.05 | Usability in production SDN |
| Experiment | Variable | Range |
|---|---|---|
| Dictionary Size | n_atoms_l1 / n_atoms_l2 | {32, 64, 128} × {64, 128, 256} |
| Sparsity Weight (λ) | sparsity_weight | {0.01, 0.05, 0.1, 0.5} |
| Threshold Percentile | threshold_percentile | {90, 95, 97, 99} |
| ISTA Iterations | n_iter | {20, 50, 100} |
| Training Epochs | n_epochs | {50, 100, 200} |
| DT Pre-filter Impact | With vs. without DT | Binary comparison |
Each combination will be evaluated on the Phase 3 metrics. Results will be presented as heatmaps showing the precision-recall trade-off across parameter settings.
| Measurement | Target | Method |
|---|---|---|
| DT pre-check latency | < 1 ms | Average over 10K samples |
| DDL inference latency | < 50 ms | Per-sample, including both layers + ISTA |
| SHAP explanation latency | < 500 ms | KernelExplainer on single sample (15 features) |
| End-to-end pipeline latency | < 600 ms | pcap → feature extraction → DT → DDL + SHAP → policy |
| SDN buffer hold time | < 1,000 ms | Time from BUFFER to RELEASE/DROP |
| Throughput | ≥ 100 flows/sec | Batch processing rate on CIC-IDS-2017 |
| Phase | Task | Target Date | Deliverable |
|---|---|---|---|
| 1 | Train DDL on CIC-IDS-2017 benign data | Week 1 | Trained model + convergence curves |
| 2 | Attack detection per category | Week 2 | Per-class detection rates + confusion matrix |
| 3 | Full metrics computation | Week 2 | Accuracy, Precision, Recall, F1, ROC-AUC |
| 4 | Ablation studies | Weeks 3–4 | Hyperparameter sensitivity heatmaps |
| 5 | Explainability assessment | Week 4 | Faithfulness report + rank correlation |
| 6 | Latency benchmarks | Week 5 | Performance report + throughput analysis |
| Test Suite | Tests | Status |
|---|---|---|
| DDL Model — training, prediction, save/load | 7 | ✓ Pass |
| Intermediate Representations | 2 | ✓ Pass |
| XAI Explainer — DDL-native | 6 | ✓ Pass |
| SHAP Integration — KernelExplainer | 4 | ✓ Pass |
| Pipeline Flow — end-to-end with PCAPs | 4 | ✓ Pass |
| SDN Buffer — add/release/drop | 4 | ✓ Pass |
| Metric | Value |
|---|---|
| True Positives (attacks correctly dropped) | 2 |
| True Negatives (normal correctly forwarded) | 0 |
| False Positives (normal incorrectly dropped) | 2 |
| False Negatives (attacks missed) | 0 |
| Recall | 1.000 |
| F1-Score | 0.667 |
Note: The DDL model was trained on synthetic data in the demo. FP rate is expected to improve significantly when trained on real CIC-IDS-2017 benign traffic. The zero false-negative rate aligns with the zero-trust "never miss an attack" philosophy.
Decision: Anomaly
The DDL reconstruction error is 22,965,194x the normal threshold.
Primary anomalous features: Active Min,
Init_Win_bytes_backward,
Total Length of Fwd Packets
Recommendation: DROP stream and alert SOC analyst.
This project identifies that Explainable AI is the missing piece needed to make AI-based detection usable in automated Zero-Trust systems. Through iterative experimentation — from pseudo-labeled BCCC Darknet data with anonymous features, through Random Forest + SHAP, to the current Deep Dictionary Learning architecture — we arrived at a design that provides:
Every anomaly comes with a per-feature reconstruction error breakdown and SHAP attributions, making the detection rationale auditable by security teams.
The DT pre-filter handles normal traffic instantly; only flagged flows undergo DDL + XAI analysis, keeping the pipeline feasible for high-throughput networks.
SDN buffer decisions feed directly into ZTA policy enforcement — enabling automated block, throttle, or step-up authentication without human intervention.
📝 Perera, C., Wanasinghe, J., Wijewardhana, S. et al. "Explainable AI-Driven Zero Trust Anomaly Detection for Encrypted Traffic" (2025/26). In preparation.