SinhSafe: An Iterative Deep Learning & Ensemble Approach to Sinhala Harassment Detection

Team

Supervisors


Project Summary

SinhSafe is a high-precision content moderation framework designed for the linguistic complexities of Sinhala and code-mixed Singlish. Traditional moderation tools often fail on local languages due to the “Semantic Gap”—the difficulty in distinguishing between general vulgarity (Offensive) and targeted, malicious attacks (Harassment).

This project addresses these challenges through a dual-phase iterative approach. We established a rigorous ground truth of ~4,000 manually annotated documents using Inter-Annotator Agreement (IAA). Finding that traditional ML baselines were capped at a ~65% F1-score, we engineered an ensemble of deep learning architectures: XLM-RoBERTa (Large), SinBERT, and SinLLaMA. By deploying these models in a 3-Model Ensemble Pseudo-Labeling Engine, we tripled our dataset size to a perfectly balanced V2 corpus of 16,545 documents. Our final production system utilizes a soft-voting ensemble of the encoder models, achieving a peak F1-score of 90.7% while maintaining real-time inference efficiency.


Methodology & The Data Engine

1. The Data Pipeline

The SinhSafe pipeline begins with raw social media ingestion followed by a hybrid preprocessing engine:

2. Baseline Comparison

Before moving to Deep Learning, we evaluated our V1 dataset against traditional algorithms:


Experiment Setup and Implementation

1. Model Architectures

We engineered three distinct architectures, adding custom layers to prevent overfitting:

2. Training Strategies, Hyperparameter Search & Loss Analysis

To find the perfect training arguments (learning rate, batch size, weight decay) and prevent overfitting, we tested 12 distinct iterations across our three architectures. We tracked training vs. evaluation loss for every version and compared their overall metrics to select the “Best-Fit” model for our final ensemble.

A. SinBERT (LSTM-Head) - 5 Versions Tested

We utilized 5-Fold Stratified Cross-Validation to isolate the best-performing epoch and evaluate stability across different data splits.

Version 1
SinBERT V1
Version 2
SinBERT V2
Version 3
SinBERT V3
Version 4
SinBERT V4
Version 5
SinBERT V5

SinBERT Performance Comparison

SinBERT Metric Comparison


Rationale for Selecting Version 4

Based on the comprehensive metric comparison and loss curve analysis, we selected Version 4 as our optimal SinBERT model. While Versions 2 and 3 showed marginally higher raw accuracy, Version 4 demonstrated an exceptionally strong Weighted Precision, which is critical in moderation systems to minimize “False Positives” (unfairly penalizing normal users).

Furthermore, the V4 Loss Curve presented a mathematically perfect early-stopping threshold: the evaluation loss hit a distinct, sharp minimum exactly at Epoch 2. By halting training at this exact checkpoint, we captured the model at its absolute peak generalization, completely avoiding the severe overfitting observed in the later epochs of the other versions.


🏆 Winning Parameters for Production Model(SinBERT Best Version):

max_len       : 128
batch_size    : 16
epochs        : 2
learning_rate : 2e-05
dropout_p     : 0.3

B. XLM-RoBERTa (Large) - 3 Versions Tested

Similar to SinBERT, XLM-R was evaluated using Stratified 5-Fold Cross-Validation, relying on the lowest evaluation loss and highest weighted precision to prevent data leakage.

Version 1
XLM-R V1
Version 2
XLM-R V2
Version 3
XLM-R V3

XLM-RoBERTa Performance Comparison

XLM-R Metric Comparison


Rationale for Selecting Version 2

Based on the multi-version benchmark, we selected Version 2 as the production-ready model for the XLM-RoBERTa architecture. Version 2 achieved the highest F1-Score (80.41%) and Accuracy (80.46%) across all tested iterations.

While the loss curves indicate that Version 2 eventually began to overfit as training progressed, our implementation of Early Stopping allowed us to capture the model weights at the optimal convergence point (Epoch 3). This balanced peak performance with sufficient generalization to handle the linguistic variance in our large-scale unlabeled dataset.


🏆 Winning Parameters for Production Model(XLM-R Best Version):

num_train_epochs : 10
batch_size       : 32
learning_rate    : 2e-05
warmup_steps     : 500
weight_decay     : 0.01

C. SinLLaMA (8B) - 4 Versions Tested

For the Generative LLM, we evaluated via an 80/10/10 Train/Val/Test split. To prevent the “Testing Collapse” caused by memorization, we implemented strict Early Stopping: training halted if evaluation loss increased for 3 consecutive intervals (every 50 steps), capturing the checkpoint with the absolute lowest evaluation loss.

Version 1
SinLLaMA V1
Version 2
SinLLaMA V2
Version 3
SinLLaMA V3
Version 4
SinLLaMA V4

SinLLaMA Performance Comparison

SinLLaMA Metric Comparison


Rationale for Selecting Version 3

For the SinLLaMA architecture, Version 3 was selected as the optimal configuration. This version achieved the highest overall performance metrics, specifically reaching a peak F1-Score of 65.66%.

While the 8B parameter model exhibited a high tendency to overfit (as seen in the diverging loss curves of other versions), Version 3 maintained a more stable evaluation loss across training steps. By leveraging the specific hyperparameters of this iteration, we were able to maximize the generative potential of the model for our pseudo-labeling engine while mitigating the “Memorization Trap” common in large-scale instruction tuning.

🏆 Winning Parameters for Production Model(SinLLaMA Best Version):

max_length        : 512
batch_size        : 16
num_train_epochs  : 1
learning_rate     : 5e-05
weight_decay      : 0.05
bf16              : True

3. Synthesizing V1 Production Models

After identifying the optimal hyperparameters and the exact “best-fit” epoch for each architecture, we moved out of the cross-validation phase. We retrained XLM-RoBERTa, SinBERT, and SinLLaMA on 100% of the V1 Dataset (6,075 documents) using these winning parameters. This maximized the models’ knowledge retention, resulting in three highly robust, inference-ready “V1 Production Models.”


The Ensemble Pseudo-Labeling Engine (V1 to V2)

To overcome data scarcity, we deployed these three V1 Production Models on 145,000 unlabelled social media comments. We applied a Strict Extraction Logic to build our final V2 Dataset:

  1. Direct Extraction: Any label where at least one model had >90% confidence.
  2. Consensus Extraction: Confidence between 80-90% where XLM-R and SinBERT agreed.
  3. Manual Review: Confidence between 40-80% where all three models agreed; these were manually verified before inclusion.

This process allowed us to extend the Harassment class to 5,515 documents, creating a perfectly balanced V2 dataset (16,545 documents total) for final production training.


Results and Analysis

The transition to the V2 dataset resulted in a massive performance leap across all architectures.

Model Parameter Size V1 F1-Score V2 F1-Score
SinBERT ~110 Million 77.9% 90.7%
XLM-R ~550 Million 80.4% 86.9%
SinLLaMA ~8 Billion 55.7% 64.9%
V1 to V2 Performance Leap
F1 Score Leap
Production Models: Final Eval Loss
Final Evaluation Loss

Optimal Epoch & Loss Curves

By tracking training and evaluation loss, we successfully identified the best epoch to run our 100% data training without overfitting or underfitting.

SinBERT Production Model
SinBERT Best Production Curve
XLM-RoBERTa Production Model
XLM-R Best Production Curve

The “LLM Memorization Trap”

A critical discovery was the failure of SinLLaMA to generalize. Despite its 8B parameters, it exhibited Severe Overfitting, crashing to 64.9% on unseen test data, whereas the lightweight Encoders (SinBERT/XLM-R) learned general linguistic rules more effectively.

SinLLaMA Memorization Trap Graph
Figure: Visualizing the divergence between training and evaluation loss, indicating a collapse in generalization.


Conclusion

The final SinhSafe Production Ensemble utilizes Soft-Voting (Probability Averaging) between XLM-RoBERTa and SinBERT. This configuration provides a culturally aware, real-time moderation solution that outperforms traditional baselines while avoiding the massive computational overhead of generative LLMs.

Project Demo