Automatically categorizing Sinhala news items from selected news sources
Team
Table of Contents
- Introduction
- Problem
- Aim
- Objectives
- Proposed solution
- Solution Architecture
- Tools and Technologies
- Plan of Work
- Links
Introduction
- In the modern world, with the abundance of information available on the internet, it has become challenging to filter out relevant information from a vast amount of data.
- Sinhala is the primary language spoken in Sri Lanka, and many online news sources publish news articles in Sinhala.
- With the increasing demand for relevant news, it is necessary to categorize the Sinhala news items from different news sources automatically.
- Automated text classification is a great option for this. It is the process of using machine learning algorithms and natural language processing techniques to automatically categorize text documents into predefined categories or classes.
Problem
The primary problem addressed by this project is the lack of tools available for automatically categorizing Sinhala news articles. This creates a challenge for readers to find relevant articles quickly and for news organizations to effectively manage their content.
Aim
The aim of this project is to develop an automated system that categorizes Sinhala news items based on their content to make it easier for readers to find relevant articles quickly and for news organizations to effectively manage their content.
Objectives
- To collect a dataset of Sinhala news articles from selected news sources.
- To preprocess and clean the data to prepare it for analysis.
- To develop a machine learning model that can accurately categorize news articles.
- To evaluate the performance of the model using various metrics.
- To deploy the model as a web application to make it accessible to users.
Proposed solution
- The proposed solution is to use machine learning algorithms to categorize Sinhalanews items.
- The system will be trained using a dataset of manually categorized Sinhala news items.
- The system will use natural language processing techniques to extract the relevant features from the news items
- Use machine learning algorithms to categorize the news items.
Impact/Business Value:
- Reduce the time and resources spent on categorization process
- Improve consistency of categorization process
Success Measurements:
- Accuracy of the model on test dataset
- Reduction in time and resources spent on categorization process
User Stories/Use Case Scenarios:
- Journalists can use the tool to categorize news items quickly and accurately
- News agencies can use the tool to automate their categorization process and save resources
- When a news reader wants to view news items in a specific category.
Solution Architecture
Tools and Technologies
For natural language processing and machine learning
- LTK (Natural Language Toolkit)
- Scikit-learn:
- Pandas:
- Numpy:
- PyTorch:
Web application development
- MERN stack
Plan of Work
Outline
Considerations for extendability
- Addition of new categories and sources in the future.
- The model can be extended to other languages
- The tool can be integrated with other news platforms
- Developing a mobile based application
Team, Strengths, and Expertise:
- Machine learning,natural language processing, web development.
- Our team has experience working with Python programming, machine learning and web development.