🧬 Final Year Project β€” University of Peradeniya

Data-Driven Methods For
Comparative Metagenomics

Systematic benchmarking of dimensionality reduction techniques and GNN-based clustering specifically optimised for high-dimensional, sparse, compositional metagenomic data.

Project Abstract

Bridging the gap in metagenomics analysis through data-driven methods

Metagenomic studies generate high-dimensional, sparse, and compositional datasets that challenge traditional analytical methods. This project systematically benchmarks 8 dimensionality reduction (DR) methods across 3 diverse metagenomic datasets (Human Gut, Ocean, Potato Soil) and validates Graph Neural Network (GNN) architectures for unsupervised microbial community clustering. Our goal is to provide evidence-based recommendations for the metagenomics community and develop a unified preprocessing-to-evaluation pipeline.

⚑ Extreme Sparsity (90–99% zeros) πŸ”— Compositionality Constraints 🌿 Phylogenetic Structure
8
DR Methods Compared
3
Diverse Datasets
24+
Visualizations Generated

Research Phases

A two-phase approach to comparative metagenomics analysis

Phase 1

πŸ“Š DR Benchmarking

βœ“ Completed
  • Datasets: Human Gut (3.6K samples), Ocean (139 samples), Potato Soil (885 features)
  • Methods: PCA, UniFrac PCoA, MDS, t-SNE, UMAP, PaCMAP, PHATE, SONG
  • Metrics: Trustworthiness, Continuity, UniFrac/Bray-Curtis/Aitchison correlations
  • Finding: No universal best method β€” performance is dataset & metric dependent
Phase 2

πŸ”¬ GNN Clustering

β—‰ In Progress
  • Graphs: KNN cosine graph built on nzCLR-transformed features
  • Architectures: DMoN, MinCutPool + K-Means across 8 DR embeddings
  • Evaluation: NMI, ARI, Silhouette score, Stability analysis
  • Goal: Validate GNN clustering vs. traditional methods on 3 real metagenomic datasets

Research Pipeline

End-to-end workflow from raw data to actionable insights

🧹

Preprocessing

nzCLR transformation & Jaccard/Bray-Curtis distance computation to handle compositionality

πŸ“

Dimensionality Reduction

8 methods: PCA, PCoA, MDS, t-SNE, UMAP, PaCMAP, PHATE, SONG

πŸ•ΈοΈ

GNN Clustering

KNN cosine graph construction & GNN architectures (DMoN, MinCutPool) + K-Means

πŸ“ˆ

Evaluation

Trustworthiness, Continuity, NMI, ARI, Silhouette & cross-dataset correlation analysis

Key Results

Highlights from Phase 1 DR benchmarking across 3 datasets

0.927
Trustworthiness
Best: SONG method
0.999
UniFrac Correlation
Best: PCoA method
24+
Visualizations
3 datasets Γ— 8 methods

πŸ’‘ Key Insights

  • No single DR method universally outperforms all others across every dataset and metric
  • UniFrac PCoA excels at preserving phylogenetic distance structure (correlation up to 0.999)
  • PaCMAP and SONG achieve the best balance between local and global structure preservation
  • GNN-based clustering (KNN+DMoN) captures microbial groupings beyond what flat K-Means achieves

Research Highlights

Conferences, expositions, and collaborative work

πŸ†

ICIPROB 2025 β€” International Conference

International Conference on Image Processing & Robotics β€” Research paper presentation on metagenomics clustering using GNN-based methods

🧬

Extensive Work β€” Bio Fusion

Collaborative research initiative integrating biological data science and computational biology methods for comprehensive metagenomics analysis

πŸ“‹

ICIPROB β€” Research Poster

Peradeniya University International Research Sessions & Exposition β€” Poster presentation on data-driven metagenomics

Tech Stack

Tools and frameworks powering our research

🧹 Preprocessing

  • nzCLR Transformation
  • Jaccard / Bray-Curtis Distances
  • Aitchison Geometry

πŸ“ Dim. Reduction

  • scikit-learn
  • umap-learn
  • PaCMAP / PHATE / SONG

πŸ•ΈοΈ GNN Frameworks

  • PyTorch Geometric
  • DMoN / MinCutPool
  • NetworkX

πŸ“Š Visualization

  • Matplotlib
  • Seaborn
  • Plotly

Team & Supervisors

Department of Computer Engineering, University of Peradeniya

Project Team

Team with Dr. Damayanthi Herath

πŸ‘¨β€πŸ’» Team Members

Jananga T.G.C.

Jananga T.G.C.

E/20/158
Prasadinie H.A.M.T.

Prasadinie H.A.M.T.

E/20/300
Malshan P.G.P.

Malshan P.G.P.

E/20/244

πŸŽ“ Research Supervisors

Dr. Damayanthi Herath

Dr. Damayanthi Herath

University of Peradeniya

damayanthiherath@eng.pdn.ac.lk
Dr. Rajith Vidanaarachchi

Dr. Rajith Vidanaarachchi

University of Melbourne

rajith.v@unimelb.edu.au
Dr. Vijini Mallawaarachchi

Dr. Vijini Mallawaarachchi

Flinders University

vijini.mallawaarachchi@flinders.edu.au

Reports & Papers

Academic deliverables and research outputs

πŸ“Š

Final Year Project Presentation

Full project presentation covering research motivation, GNN pipeline, evaluation results and conclusions

πŸ“₯ Download Presentation (PDF)
πŸ›οΈ

Research Poster

Poster presented at International Conference on Image Processing & Robotics

πŸ–ΌοΈ View Poster
πŸ†

ICIPROB 2025 β€” Conference Paper

International Conference on Image Processing & Robotics β€” Paper on GNN-based metagenomics clustering