Long-reads Binning For Microbial Metagenomics Considering Multi-kingdoms



Table of content

  1. Abstract
  2. Background
  3. Related works
  4. Methodology
  5. Experiment Setup and Implementation
  6. Results and Analysis
  7. Conclusion
  8. Publications
  9. Links



Alt text

Microorganisms thrive in a multitude of environments worldwide, fulfilling critical roles in human health, agriculture, food production, climate regulation, and numerous other processes. Every living organism consists of tiny units known as cells, serving dual functions of providing structure and facilitating various biological processes. Enclosed within the nucleus of each cell lies the genome, a comprehensive blueprint encompassing instructions for the construction and sustenance of the entire organism, including its distinct characteristics and behaviors. This genetic blueprint resides within slender, thread-like structures called chromosomes, composed of DNA and proteins. DNA, the carrier of genetic information, adopts a double helix structure comprising two intertwined strands. Comprised of nucleotides, each denoted by specific letters—A (Adenine), C (Cytosine), G (Guanine), or T (Thymine)—DNA serves as the foundation for genetic coding. Genes, the fundamental units of heredity, constitute segments of DNA containing instructions for synthesizing proteins or functional RNA molecules. Serving as conduits of hereditary information, genes perpetuate traits across generations, thereby ensuring the perpetuation of life.

Early long-reads Binning Tools
Megan-LR stands out as one of the earliest tools, employing a reference database. Megan-LR utilizes a protein-alignment-based approach and introduces two algorithms; one for taxonomic binning (based on Lowest Common Ancestor) and another for functional binning (based on an Interval-tree algorithm).

Two other noteworthy reference-independent tools, MetaProb and BusyBee Web, significantly contributed to the domain of unsupervised metagenomic binning. BusyBee Web, in particular, includes a web-based interface, offering additional visual insights into the binning process. However, despite their respective strengths, both MetaProb and BusyBee Web faced challenges related to scalability as input dataset sizes increased, impeding their ability to bin entire datasets in a single iteration.

MetaBCC-LR, a reference-free binning tool, utilizes composition and coverage as read features, relying on trinucleotide frequency vectors for composition and k-mer coverage histograms for coverage. The tool initially clusters reads based on coverage information, which will be re-clustered using composition information. Only a sample of reads is utilized for this process, contributing to computational efficiency. At the final stage, it creates statistical models for each cluster and bin the remaining reads. Despite its high accuracy, it may suffer from potential misclassification issues, particularly for low-abundance species, as well as the need for subsampling large datasets.

Workflow MetaBCC-LR

LRBinner adopts an innovative approach to reference-free binning by concurrently computing composition and coverage information for the entire dataset. It merges these features through a variational autoencoder, eliminating the need for subsampling and improving overall binning accuracy. It uses tetranucleotide frequency vectors for composition and k-mer coverage vectors as coverage information of reads. However, the tool faces challenges in distinguishing long reads from similar regions shared between different species.

Workflow LRBinner

OBLR introduces a novel strategy in reference-free binning, leveraging read overlap graphs to estimate coverages and improve binning outcomes. It then employs the HDBSCAN hierarchical density-based clustering algorithm for read clustering. Additionally, it uses a sample of reads for initial clustering sampled using a probabilistic downsampling strategy. This results in clusters with similar sizes and fewer isolated points. OBLR then utilizes inductive learning with the GraphSAGE neural network architecture to assign bins to remaining reads.

Workflow OBLR

Proposed Work

We have identified the following as the challenges in existing tools.

Therefore, this project aims to develop a method to bin long reads from multiple metagenomic samples while being aware of the underlying microbial kingdoms. Specifically, it will be a Python-based command-line tool addressing the scalability issues with massive datasets.


Experiment Setup and Implementation

Results and Analysis