- ML interpretability:
- Interpretable ML for image data
- Fairness in AI
-
Large Language Models and their applications
- Parameter efficient fine tuning
- Intelligent Tutor based on LLMs
⇑
DEAR
DATA ENGINEERING AND RESEARCH GROUP
UNIVERSITY OF PERADENIYA
DEPARTMENT OF COMPUTER ENGINEERING, FACULTY OF ENGINEERING, UNIVERSITY OF PERADENIYA, SRI LANKA.
We are a research group consisting of faculty members, students, and external collaborators working to push the boundaries of data engineering and research. It’s a friendly space for all enthusiasts to share and learn about data engineering. We operate similarly to a reading group with cake!
Contact: damayanthiherath@eng.pdn.ac.lk
RECENT UPDATES
See more ⟶RESEARCH AREAS
- Machine learning for comparative analysis of metagenomic samples
- Biomarker identification for Alzhemizers disease using DNA sequencing data [1,2]
- Automated Protein function prediction using Machine Learning techniques
- Interpretable ML in inference from sequencing data [8]
- Recovering more sequences from long-reads sequencing experiments
-
Even though the recent long-reads sequencers can result in longer reads compared to Illumnina sequencing,
the accuracy of the reads from the former is a lower and the amount of reads discarded due to the error is
higher. In this work, we ll work on developing a method to improve the yield of long reads from sequencing metagenomic samples.
- Comparative analysis of downstream analysis results when using reads thresholded at different quality levels: Q=20, Q=10, etc.
- Deep Learning based approach for error correction in quality reads (with Q20)
- Machine learning based approach for recovering more reads including the reads with lower quality threshold.
-
Even though the recent long-reads sequencers can result in longer reads compared to Illumnina sequencing,
the accuracy of the reads from the former is a lower and the amount of reads discarded due to the error is
higher. In this work, we ll work on developing a method to improve the yield of long reads from sequencing metagenomic samples.
-
Visualisation techniques for comparative analysis of metagenomic samples
- There exist more powerful dimensionality reduction techniques than PCA which can be used to compare across metagenomic samples. In this work, we'll analyze metagenomic samples gathered from multiple environments using such methods for visualisation. An example datasets that can be used are soil samples, Coral samples , cancer related sample
-
From relative abundance to absolute abundance
- It is more useful to learn the absolute abundance of species in a metagenomic sample than learning the relative abundance of species in the sample, which will be beneficial in comparative analysis of the samples. This works aims to develop a method to estimate absolute abundance of species in a metagenomic sample using DNA sequencing data.
- Work on effective segmentation techniques for cancer detection:
https://www.cancerimagingarchive.net/
Skills: Analytical skill, Programming, Machine Learning
- Computer Vision and ABM for handling human -elephant conflict
- Assisting crop management using Unmanned Aerial Vehicle (UAV)s
- Computer vision for structural health monitoring
- Computer Vision for minerals
- Use of computer vision techniques in precision agriculture
- Plant disease identification in a Sri Lankan context [3]
- Detecting subclinical mastitis in dairy cows using somatic cell count data using Machine Learning
- Machine Learning techniques for weather prediction [4]
- Effective use of ML for wetland monitoring.
- Inferering learner behaviors from LMS data
- Interpretable machine learning on learner behavior data [5]
- Use/effect of Large Language Models (LLM)s on education
- Machine learning for handling cache timing based attacks
- Agent based modeling for transportation
- Species diversity estimation from metagenomics data [6,7]
- Improving the resolution of DNA sequencing experiments
- Applications of Gene Expression Programming as an alternative to interpretable ML methods [4]
FEATURED RESEARCH

Forecasting electricity power generation of Pawan Danavi Wind Farm, Sri Lanka, using gene expression programming
This research introduces a forecasting model developed using Gene Expression Programming (GEP) to predict wind power generation at the Pawan Danavi wind farm in Sri Lanka. The model uses on-site wind speed and ambient temperature as inputs and demonstrates high accuracy (R² = 0.92). Unlike conventional machine learning methods, GEP provides a clear mathematical relationship between climatic factors and power output, making it especially useful for future projections under changing weather conditions. This is the first known application of GEP for wind power forecasting in Sri Lanka.

Chronic Kidney Disease Prediction Using Machine Learning Methods
This research presents a machine learning workflow for the early prediction of Chronic Kidney Disease (CKD) using clinical data. The proposed approach includes data preprocessing, missing value handling through collaborative filtering, and attribute selection. Among 11 evaluated models, the Extra Trees and Random Forest classifiers achieved the highest accuracy with minimal attribute bias. The study also emphasizes the importance of domain knowledge and practical data collection considerations in building reliable CKD prediction systems.

DeepSelectNet: deep neural network based selective sequencing for oxford nanopore sequencing
This research introduces DeepSelectNet, a deep learning-based method for accurate species classification using nanopore sequencing signals. Unlike existing approaches with variable accuracy across datasets (77–97%), DeepSelectNet directly classifies raw current signals using innovative preprocessing techniques and a regularized neural network architecture. The method significantly improves the reliability of selective sequencing, with promising applications in genomics and biodiversity studies.

Leveraging deep learning techniques for condition assessment of stormwater pipe network
This research presents a semi-automated, deep learning-based approach to improve the inspection of stormwater pipe networks, addressing the inefficiencies of traditional CCTV inspections. By integrating computer vision with YOLOv8 instance segmentation, the model accurately detects six defect types based on the WSA Code of Australia, achieving an mAP@0.5 of 0.92 (bounding boxes) and 0.90 (masks). Tested on footage from Banyule City Council, the system significantly reduces long-term costs and enables faster, more consistent assessments empowering local councils with an effective tool for infrastructure condition monitoring.

Forecasting renewable energy for microgrids using machine learning
This study presents a 1D Convolutional Neural Network (CNN) model for forecasting solar and wind energy generation in microgrids. By using data from the UC San Diego microgrid and San Diego Airport weather records, the model addresses key operational challenges such as voltage and frequency stability. Compared to traditional statistical methods, the proposed model achieved up to 229× lower MSE and 24× lower MAE, demonstrating the effectiveness of deep learning in improving short-term renewable energy forecasting and microgrid management.
PEOPLE
FACULTY LECTURERS
REFERENCES
Perera S. , Hewage K. , Gunarathne C. , Navarathna R. , Herath D. and Ragel R.G.
In 2020 Moratuwa Engineering Research Conference (MERCon)(pp. 1-6). IEEE. 2020, July.
Link: https://ieeexplore.ieee.org/abstract/document/9185336
Abstract
It is well recognized, that most common form of dementia is Alzheimer's disease and a successful cure or medication is not discovered. A plethora of research has been conducted to understand the underlying mechanism and the pathogenesis of the Alzheimer's disease. To explore the underlying genetic structure of the disease, gene expression data is being used by many researches and computational and statistical approaches were used to identify possible genes that are risk. In this paper, we propose a machine learning framework that can be used to identify possible bio-marker genes. Our experiments discover possible set of 14 genes, which some of them are validated by biological sources. We also present a critical analysis of the propose machine learning framework using GSE5281 gene dataset.
Dinuwanthi, Imalsha, et al.
2021 10th International Conference on Information and Automation for Sustainability (ICIAfS). IEEE. 2021, August.
Link: https://ieeexplore.ieee.org/abstract/document/9606093
Abstract
Alzheimer’s disease is recognized as one of the common diseases found among elders, which still has no successful cure. Different technologies such as microarray technology, Sanger sequencing, and Next Generation Sequencing have been used by various researchers for gathering samples. Out of these, Next Generation Sequencing has become more common nowadays, as it is a powerful platform which enables to sequence thousands or millions of DNA molecules simultaneously. A set of samples collected using Next Generation Sequencing technology is used in this study. The initial data set includes 70 samples and 2652 miRNAs. In this study, our goal is to determine the best set of miRNA biomarkers which are highly differentially expressed in Alzheimer’s disease. Initially, the data set is preprocessed with the aid of the Galaxy tool and python programming language. Significance value, fold change and area under curve analysis are the statistical methods which are used in this study. Random Forest algorithm and Principal Component Analysis are used for selecting the best set of biomarkers out of the data set obtained at the end of statistical analysis. Using the statistical methods, followed by machine learning techniques, we establish 25 microRNAs as biomarkers for Alzheimer’s disease. Furthermore, we provide an analysis of the selected 25 microRNAs with area under the receiver operating curve and classification algorithms.
Deshan, LA Chamli, MK Hans Thisanke, and Damayanthi Herath.
2021 IEEE 16th International Conference on Industrial and Information Systems (ICIIS). IEEE. 2021, December.
Link: https://ieeexplore.ieee.org/abstract/document/9660681
Abstract
Plant leaf diseases cause great damage to crops, resulting in significant yield losses. Traditionally, identification of plant leaf diseases depends on human annotation by visual inspection. Transfer learning has enabled use of existing solutions in one domain to problems from another domain, resulting in more robust and efficient solutions. This work presents a method to identify tomato plant diseases based on leaf images using transfer learning. We used a publicly available dataset which contains tomato plant leaf images for 10 different classes. We considered only five classes and data was split in ratio 8:1:1 for train, validation and test sets respectively. In this work, six different pre trained models were used with fine-tuning methods where we introduced some layers and removed some layers in the network architectures while enhancing the accuracy of models. Accuracies of all the models were above 97% except one model which got 95% accuracy on the testing. Precision, Recall, F1 Score, Confusion matrix and Classification reports were used for evaluations and finally a novel convolutional neural network is proposed for plant disease classification focusing on a real environment. The mentioned model achieved an accuracy of 99.98% on training and an accuracy of 99% on testing. In this work, a good generalization performance could be achieved without data augmentation. The experimental results show that the proposed fine-tuned architecture is effective in identifying tomato leaf diseases and it could be generalized to identify leaf diseases in other plants.
Herath D. , Jayasinghe J. , Premarathne U.K.
Applied Computational Intelligence and Soft Computing. WILEY. 2022, May.
Link: https://onlinelibrary.wiley.com/doi/pdf/10.1155/2022/7081444
Abstract
This paper presents the development of a wind power forecasting model based on gene expression programming (GEP) for one of the major wind farms in Sri Lanka, Pawan Danavi. With the ever-increasing demand for renewable power generation, Sri Lanka has started harnessing electricity from wind power. Though the initial establishment cost of wind farms is high, the analyses clearly showcased the economic sustainability of wind power generation in long term. In this context, forecasting the wind power generation at Sri Lankan wind farms is important in many ways. However, limited research has been carried out in Sri Lanka to predict the wind power generation against the changing climate. Therefore, to overcome this research gap, a model was developed to forecast wind power generation against two climatic factors, viz. on-site wind speed and ambient temperature. The results showcased the robustness and accuracy of the proposed GEP-based forecasting model (with R2 = 0.92, index of agreement = 0.98, and RMSE = 259 kW). Moreover, the results of the study were compared against three different forecasting models and found comparable in terms of the model accuracy. The GEP-based model is advantageous over machine learning techniques due to its capability in deriving a mathematical expression. As an acceptable relationship was found between wind power generation and climatic factors, the proposed model facilitates the future projection of wind power generations with forecasted climatic factors. Though the application of GEP in the field of wind power generation is reported in a few research publications, this is the first research in which GEP is employed to model the power generation with respect to weather indices. The proposed prediction model is advantageous than machine learning models as the relationship between the wind power and the weather indices can be expressed.
Jayasundara, Shyaman, Amila Indika, and Damayanthi Herath.
2022 2nd International Conference on Advanced Research in Computing (ICARC). IEEE. 2022, February.
Link: https://ieeexplore.ieee.org/abstract/document/9753867
Abstract
Students’ performance prediction can have many uses in the education sector. It helps to take measures to support struggling students and to improve course delivery. However, having meaningful explanations along each prediction is essential for the reliability of the predictions and hence is desirable. In this work, we propose a method for predicting student performance while generating explanations of the predictions made. An Explainable Boosting Machine is implemented to suit multi-class classification to achieve the mentioned objective. The classification performance of the proposed approach is compared with similar supervised learning models, namely a linear model, a decision tree, and a decision rule-based approach for accuracy, precision, recall, and F1 Score. Results show that the Explainable Boosting Machine ranks second in classification performance. At the same time, it provides global and local explanations of the predictions, which are further shown to be consistent with the observations made in feature selection. The proposed approach and its extensions can help in predicting student performance while enabling the interpretation of the predictions made. It will enable educators to devise strategies to improve students’ performance.
Herath D. , Jayasundara D. , Ackland D. , Saeed I. , Tang S.L. and Halgamuge S.
Computational and structural biotechnology journal. Elsevier. 2017, January.
Link: https://www.sciencedirect.com/science/article/pii/S2001037017300223
Abstract
Assessing biodiversity is an important step in the study of microbial ecology associated with a given environment. Multiple indices have been used to quantify species diversity, which is a key biodiversity measure. Measuring species diversity of viruses in different environments remains a challenge relative to measuring the diversity of other microbial communities. Metagenomics has played an important role in elucidating viral diversity by conducting metavirome studies; however, metavirome data are of high complexity requiring robust data preprocessing and analysis methods. In this review, existing bioinformatics methods for measuring species diversity using metavirome data are categorised broadly as either sequence similarity-dependent methods or sequence similarity-independent methods. The former includes a comparison of DNA fragments or assemblies generated in the experiment against reference databases for quantifying species diversity, whereas estimates from the latter are independent of the knowledge of existing sequence data. Current methods and tools are discussed in detail, including their applications and limitations. Drawbacks of the state-of-the-art method are demonstrated through results from a simulation. In addition, alternative approaches are proposed to overcome the challenges in estimating species diversity measures using metavirome data.
N Ranasinghe, A Ramanan, S Fernando, PN Hameed, D Herath, T Malepathirana, P Suganthan, M Niranjan, S Halgamuge.
Journal of the National Science Foundation of Sri Lanka. National Science Foundation of Sri Lanka. 2022, November.
Link: https://jnsfsl.sljol.info/articles/10.4038/jnsfsr.v50i0.11249
Abstract
Artificial Intelligence (AI) and its data-centric branch of machine learning (ML) have greatly evolved over the last few decades. However, as AI is used increasingly in real world use cases, the importance of the interpretability of and accessibility to AI systems have become major research areas. The lack of interpretability of ML based systems is a major hindrance to widespread adoption of these powerful algorithms. This is due to many reasons including ethical and regulatory concerns, which have resulted in poorer adoption of ML in some areas. The recent past has seen a surge in research on interpretable ML. Generally, designing a ML system requires good domain understanding combined with expert knowledge. New techniques are emerging to improve ML accessibility through automated model design. This paper provides a review of the work done to improve interpretability and accessibility of machine learning in the context of global problems while also being relevant to developing countries. We review work under multiple levels of interpretability including scientific and mathematical interpretation, statistical interpretation and partial semantic interpretation. This review includes applications in three areas, namely food processing, agriculture and health.
Jayasundara D. , Herath D. , Senanayake D. , Saeed I. , Yang C.Y. , Sun Y. , Chang B.C. , Tang S.L. and Halgamuge S.K.
BMC bioinformatics. Springer Nature. 2019, February.
Link: https://link.springer.com/article/10.1186/s12859-018-2398-5
Abstract
BackgroundEstimating the parameters that describe the ecology of viruses,particularly those that are novel, can be made possible using metagenomic approaches. However, the best-performing existing methods require databases to first estimate an average genome length of a viral community before being able to estimate other parameters, such as viral richness. Although this approach has been widely used, it can adversely skew results since the majority of viruses are yet to be catalogued in databases.
Results
In this paper, we present ENVirT, a method for estimating the richness of novel viral mixtures, and for the first time we also show that it is possible to simultaneously estimate the average genome length without a priori information. This is shown to be a significant improvement over database-dependent methods, since we can now robustly analyze samples that may include novel viral types under-represented in current databases. We demonstrate that the viral richness estimates produced by ENVirT are several orders of magnitude higher in accuracy than the estimates produced by existing methods named PHACCS and CatchAll when benchmarked against simulated data. We repeated the analysis of 20 metavirome samples using ENVirT, which produced results in close agreement with complementary in virto analyses.
Conclusions
These insights were previously not captured by existing computational methods. As such, ENVirT is shown to be an essential tool for enhancing our understanding of novel viral populations.
P Sudasinghe, D Herath, I Karunarathne, H Weeratunge, L Jayasuriya
Discover Applied Sciences. Springer Nature. 2025, May.
Link: https://link.springer.com/article/10.1007/s42452-025-06895-5
Abstract
Microgrids, comprised of interconnected loads and distributed energy resources, function as single controllable entities with respect to the main grid. However, the inherent variability of distributed wind and solar generation within microgrids presents operational stability challenges concerning voltage regulation and frequency stability. Accurate forecasting of renewable generation is crucial for mitigating these challenges. This work proposes a one-dimensional Convolutional Neural Network (1-D CNN) based approach to forecast photovoltaic (PV) generation and wind energy, using data from the University of California, San Diego microgrid and San Diego Airport weather records. The proposed method is evaluated against various forecasting methods using key metrics: Mean Squared Error (MSE), Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-squared value. Results show that the 1-D CNN model achieves an improvement of up to 229.8 times in MSE and a 24.47 fold improvement in MAE compared to baseline models that use traditional statistical methods in forecasting. This demonstrates the potential of machine learning for enhancing microgrid management, particularly in short-term forecasting of renewable generation.
AN Yussuf, NP Weerasinghe, H Chen, L Hou, D Herath, M Rashid, G Zhang, S Setunge
Journal of Civil Structural Health Monitoring. Springer Nature. 2024, August.
Link: https://link.springer.com/article/10.1007/s13349-024-00841-6
Abstract
Inspections and condition monitoring of the stormwater pipe networks have become increasingly crucial due to their vast geographical span and complex structure. Unmanaged pipelines present significant risks, such as water leakage and flooding, posing threats to urban infrastructure. However, only a small percentage of pipelines undergo annual inspections. The current practice of CCTV inspections is labor-intensive, time-consuming, and lacks consistency in judgment. Therefore, this study aims to propose a cost-effective and efficient semi-automated approach that integrates computer vision technology with Deep Learning (DL) algorithms. A DL model is developed using YOLOv8 with instance segmentation to identify six types of defects as described in Water Services Association (WSA) Code of Australia. CCTV footage from Banyule City Council was incorporated into the model, achieving a mean average precision (mAP@0.5) of 0.92 for bounding boxes and 0.90 for masks. A cost–benefit analysis is conducted to assess the economic viability of the proposed approach. Despite the high initial development costs, it was observed that the ongoing annual costs decreased by 50%. This model allowed for faster, more accurate, and consistent results, enabling the inspection of additional pipelines each year. This model serves as a tool for every local council to conduct condition monitoring assessments for stormwater pipeline work in Australia, ultimately enhancing resilient and safe infrastructure asset management.
A Senanayake, H Gamaarachchi, D Herath, R Ragel
BMC bioinformatics. Springer Nature. 2023, January.
Link: https://link.springer.com/article/10.1186/s12859-023-05151-0
Abstract
Nanopore sequencing allows selective sequencing, the ability to programmatically reject unwanted reads in a sample. Selective sequencing has many present and future applications in genomics research and the classification of species from a pool of species is an example. Existing methods for selective sequencing for species classification are still immature and the accuracy highly varies depending on the datasets. For the five datasets we tested, the accuracy of existing methods varied in the range of 77 to 97% (average accuracy < 89%). Here we present DeepSelectNet, an accurate deep-learning-based method that can directly classify nanopore current signals belonging to a particular species. DeepSelectNet utilizes novel data preprocessing techniques and improved neural network architecture for regularization.
IU Ekanayake, D Herath
2020 Moratuwa Engineering Research Conference (MERCon). IEEE. 2020, July.
Link: https://ieeexplore.ieee.org/abstract/document/9185249
Abstract
Chronic Kidney Disease (CKD) or chronic renal disease has become a major issue with a steady growth rate. A person can only survive without kidneys for an average time of 18 days, which makes a huge demand for a kidney transplant and Dialysis. It is important to have effective methods for early prediction of CKD. Machine learning methods are effective in CKD prediction. This work proposes a workflow to predict CKD status based on clinical data, incorporating data prepossessing, a missing value handling method with collaborative filtering and attributes selection. Out of the 11 machine learning methods considered, the extra tree classifier and random forest classifier are shown to result in the highest accuracy and minimal bias to the attributes. The research also considers the practical aspects of data collection and highlights the importance of incorporating domain knowledge when using machine learning for CKD status prediction.
JOIN US
Do you enjoy exploring and sharing knowledge in this ever-evolving field? If so, the DEAR (Data Engineering And Research) group at the University of Peradeniya invites you to join us!
Why Join Us?
- Diverse Projects: Engage in a variety of projects ranging from theoretical advancements in Machine Learning to practical applications in computational biology, computer vision, AI for agriculture, education, transportation, and security.
- Collaborative Environment: Work alongside esteemed professors and industry experts, including Dr. Damayanthi Herath, Mr. Sampath Deegalla, and Prof. Roshan Ragel, as well as our collaborators from various institutions.
- Innovative Research: Contribute to cutting-edge research that aims to make significant impacts in multiple domains.
- Community and Learning: Be part of a vibrant community that encourages learning, collaboration, and innovation.
How to Join?
- Send Us an Email: If you are interested in joining our group, please send an email to Dr. Damayanthi Herath at damayanthiherath@eng.pdn.ac.lk.
-
Join Our WhatsApp Group: Stay updated and connected with our community by joining our WhatsApp group.
OUR MOTTO
“Fall in love with the problem, not the solution.”
Come, let’s solve problems together and push the boundaries of Machine Learning!
We look forward to welcoming you to our team.
Web Masters
This website is maintained by VITHUSHAN E.T.L. AND ENIYAVAN T. [Department of Computer Engineering, Univeristy of Peradeniya]
Last Update : 06-08-2025
If you have any concerns or issues with this website, please contact to Dr. Damayanthi Herath.
This website is monitored by DEAR group.