A User-friendly Pipeline for Isolation of Fast-evolving Internal Transcribed Spacer(ITS) Regions from Skim Sequencing Data

Team

Supervisors

Table of content

  1. Abstract
  2. Related works
  3. Methodology
  4. Experiment Setup and Implementation
  5. Results and Analysis
  6. Conclusion
  7. Links

Abstract

DNA that resides within the nucleus of a cell is the primary genetic material that is responsible for genetic behaviours of various eukaryotic organisms such as plants, humans, animals, algae and fungi as well as prokaryotic organisms such as archaea and bacteria. Genome or DNA of an organism has several complex genomic regions. Specific genomic regions consist of several complex proteins. RNA is the replica of DNA involved in protein synthesis using these complex proteins and instructions carried from DNA. Ribosomes are small particles having these RNA molecules. Ribosomal DNAs are repeating units having fast-evolving as well as conserved subregions. Isolation of fast-evolving regions of rDNA is necessary for a more accurate and successful analysis of variation within and between species. ITS1 and ITS2 are the fast-evolving regions of rDNA that are widely preferred to differentiate various species including humans, plants and fungi. These regions have the highest probability of correct identification and ITS is the most frequently used molecular marker in species variation analysis. There are several computational tools and pipelines available in the literature to isolate ITS regions from different sequence datasets and to analyze the inter-species as well as intra-species variation using these fast-evolving genomic regions. Biologists don’t have a clear understanding on which is the most efficient and suitable computational tool or pipeline to extract and analyze the fast-evolving nuclear ribosomal ITS regions associated with various organisms since they don’t have sufficient computer knowledge. Hence, they find it difficult to select the best tool or technology required for the isolation and analysis process. Therefore, the purpose of our project is to compare the pros and cons of existing computational tools and pipelines based on various parameters such as the CPU time taken to run the software, CPU performance, accuracy, required computer memory and disk space and depending on other specifications to come up with an efficient procedure or pipeline to extract and analyze fast-evolving genomic regions from low coverage skim sequence data at a low cost. Hence, the primary objective of this project is to come up with a simple,user-friendly and efficient workflow or pipeline for the biologists to isolate fast-evolving genomic regions from skim sequencing data for phylogenetic studies.

According to our literature review, previous several researches related to ITS regions, prove that ITS is a standard accepted metabarcode marker for several species including plants, fungi and human fungal pathogens. These researches also emphasize that ITS subregions ITS1 and ITS2 have the highest probability of correct identification within and between above mentioned species.

2.1 Justifications on ITS Regions

2.2 Justifications on Software Tools and Pipelines

According to our literature review, there are various computational tools and pipelines used in several previous researches over the last one and a half-decade to extract and analyze ITS regions from different fungal sequencing datasets. Here, we have provided justifications based on those tools and pipelines.

Considering the above literature review justifications, these researches have focused on the importance of using ITS regions as a molecular marker for accurate fungal species variation analysis and the articles related to the software tool and pipelines to extract and analyze ITS regions mostly focusing on Fungal ITS sequences. Hence, these papers could not utilize these tools and pipelines to isolate ITS regions from plant and human skim sequence data. We have not found a review paper comparing the pros and cons of all the tools and pipelines. Therefore, through our research, we are going to do a comparative analysis between these software tools and pipelines to identify the best tool or technology or pipeline. Then, we are going to find whether they are applicable for plant and human skim sequence data and how much data is enough for the analysis.

Methodology

First, we analyzed the software and hardware requirements of the software tools and pipelines that we identified from our literature review such as ITSx, ITSxpress and PIPITS about how much RAM and disk space is needed to install each tool and what kind of input data is needed for each tool. Then, we identified whether we need to input skim data or contigs or scaffolds for these tools. After that, we installed these tools ITSx, ITSxpress and PIPITS with all the other required tools in our department aiken server using the anaconda environment.

Next, we have collected the data that separately contains forward and reverse raw reads of different Cinnamomum species such as Cinnamomum Capparu Coronde, Cinnamomum Verum and Cinnamomum Zeylanicum. Then, we tested these tools using the given cinnamomum capparu coronde data containing the forward raw reads around 19 GB and reverse raw reads around 19 GB. We recorded the output and the CPU time for each tool.We earlier got empty files as output for the tools earlier and it took a very long time to obtain the output in akien server.

As a result, we got some ITS regions and we verified those output ITS regions by blasting agianst NCBI nr/nt database to make sure we exactly got the ITS regions of cinnamomum species. We did a comparative analysis of the tools based on the CPU time and the similarity of the ITS regions to identify the better tool which is much efficient and accurate. Further, we analyzed the steps of the existing pipeline PIPITS to come up with a similar pipeline by improving it.

We used the tool seqtk to partition different sizes of the collected data such as 1GB, 2GB,3GB and 5GB for both forward and reverse raw reads in order to test our pipeline. Earlier, We ran our pipeline in our department aiken server for 1GB and 2GB data of cinnamomum capparu coronde and recorded the respective run times of each tool that performs the relevant step. Then, we shifted to agbc server later because the aiken server was responding too slow. As a result, we have experienced much improved performance for each tool of our pipeline in agbc server compared to aiken server. Hence, we tested the same 1GB and 2GB data in agbc server and recorded the improved run times with respect to each tools of the pipeline.

In the first step of our pipeline, we have done the quality checking of the reads to find the GC content, no of low quality reads and other relevant characteristics of the reads.Then, we filtered out the low quality reads using fastqc. However, we couldn’t filter out both forward and reverse reads simultaneously using that tool. Therefore, we found another tool afterqc to filter out both forward and reverse reads at the same time and tested it successfully.

After quality filtering, we ran ITSx by directly using the good quality forward and reverse raw reads as input to extract ITS regions. However, we failed in the process and we found that the read length 150bp is not sufficient to extract ITS regions using ITSx. Therefore, we needed to assemble the forward and reverse reads to get contigs in order to maximize the read length. Hence, we used the assembler Spades to obtain the contigs in fasta format. In the next step, we ran ITSx using the resultant contigs in fasta format for the aforementioned different sizes of data separately.

Meanwhile, we also converted the obtained contigs from fasta format to fastq format using the tool seqtk since ITSxpress only accepts fastq format input files. Then, we used the resultant contigs in fastq format as input to ITSxpress to extract ITS regions. Here, we looked for more tools associated with fasta to fastq conversion and found the tool bbmap(reformat.sh) in addition to seqtk. Then, we compared the performance between seqtk and bbmap(reformat.sh) for 3GB and 5GB cinnamomum capparu coronde data and observed the difference in the recorded run times. Thereafter, we input the resultant fastq files obtained form both seqtk and bbmap(reformat.sh) to ITSxpress and compared the run times taken to get the output.

Earlier, when we ran ITSx using contigs as input for 1GB data, we got empty output file. Then, we merged the forward and reverse reads using the tool vsearch and ran ITSx again using the obtained merged reads as input. As a result, we got some ITS sequences as output. Then, we have checked the quality of the output and blasted to ensure that we got the exact ITS regions of cinnamomum species. However, we identified multiple sequences in the output.

Further, we found that some of the sequences in contigs which are greater than 100kbp in read length is the reason behind getting these multiple sequences in the output. Therefore, we filtered out those sequences that are greater than 100kbp from the contigs file using the tool bbmap(reformat.sh). After that, we ran ITSx using the filtered contigs to extract ITS regions and we ended up getting some ITS sequences as output for cinnamomum capparu coronde 1GB data. When we checked the quality of the output this time, we found that there were no multiple sequences.

Earlier, we ran ITS using the default mumber of threads which is only one CPU thread and later we increased the number of CPU threads from one to sixteen to run the ITSx. As a result, we found much improvement in the performance of ITSx. After that, we tried with different thread sizes for 1GB cinnamomum capparu coronde data and observed the deviation in the performance of ITSX with respect the increasing or decreasing thread sizes.

Meanwhile, we faced no problem when directly using contigs obtained from spades as input to run ITSxpress and we got an ITS sequence as the output from ITSxpress for the same data. Then, we checked the quality of the ITSxpress output and blasted it to verify that the obtained ITS sequence belongs to cinnamomum species. Next, we compared the ITSxpress output with the ITSx output by doing multiple alignment to find whether both of them are same.

Further, we tested our pipeline for 1GB cinnamomum verum and cinnamomum zeylanicum data as well using the same process and compared all the results of both ITSx and ITSxpress for the three cinnamomum species based on the performance and quality of the output obtained by blasting the output sequences and doing multiple alignment to obtain the distance matrix.

Results and Analysis

We have obtained separate results from ITSx and ITSxpress for different sizes of data associated with different cinnamomum species such as cinnamomum capparu coronde and cinnamomum zeylanicum. Then, we blasted those outputs separately against NCBI/nr/nt database to find and verify whether that the obtained blasted results contain ITS regions of cinnamomum species. After that, we compared the ITS output sequence/sequences that contain ITS regions of cinnamomum species obtained from ITSx with the ITSxpress output sequence/sequences which also contain the ITS regions of cinnamomum species by doing multiple alignment using MAFFT algorithm to check whether both are exactly same sequences or not using the created distance matrix.

The results are as follows:

  1. Comparison 1
  2. Comparison 2
  3. Comparison 3
  4. Comparison 4

Conclusion

We conclude that if the input for ITSx are the merged sequences of reverse and forward reads having the sequence length between 150bp-300bp and the actual ITS region is greater than 300bp, then the extracted ITS region is more likely to be a partial sequence. On the other hand, if the input sequences to the ITSxpress are contigs assembled from Spades, then, there is a high probability to have a complete ITS region.

If the blast results of the ITSx output sequences for a given data size against the NCBI nr/nt database contain the ITS regions of a relevant species and the blast results of the ITSxpress output sequences for the same data size against the NCBI nr/nt database also contain the ITS regions of the same species, then when we compare both these ITSx and ITSxpress sequences using multiple alignment, we can conclude the both the ITSx and ITSxpress output are exactly same for a given data size if the created distance matrix shows no difference.

In future, we suggest to test our pipeline for different datasets associated with different species such as wild rice and strobilathes. Overall, we believe that our work will be very helpful for the biologists to provide them a clear idea on the isolation of ITS regions as efficiently and as accurately as possible from skim sequencing data for their phylogenetic studies.