Prediction of viral families and hosts of single-stranded RNA viruses based on K-Mer coding from phylogenetic gene sequences

dc.contributor.authorCiftci, Bahar
dc.contributor.authorTekin, Ramazan
dc.date.accessioned2024-12-24T19:26:59Z
dc.date.available2024-12-24T19:26:59Z
dc.date.issued2024
dc.departmentSiirt Üniversitesi
dc.description.abstractThere are billions of virus species worldwide, and viruses, the smallest parasitic entities, pose a serious threat. Therefore, fighting associated disorders requires an understanding of the genetic structure of viruses. Considering the wide diversity and rapid evolution of viruses, there is a critical need to quickly and accurately classify viral species and their potential hosts to better understand transmission dynamics, facilitating the development of targeted therapies. Recognizing this, this study has investigated the classes of RNA viruses based on their genomic sequences using Machine Learning (ML) and Deep Learning (DL) models. The PhyVirus dataset, consisting of pathogenic Single-stranded RNA viruses of Baltimore group four (+ssRNA) and five (-ssRNA) with different hosts and species, was analyzed. The dataset containing viral gene sequences was analyzed using the KMer coding technique, which is based on base words of various lengths. The study used classical ML algorithms (Random Forest, Gradient Boosting and Extra Trees) and the Fully Connected Deep Neural Network, a Deep Learning algorithm, to predict viral families and hosts. Detailed analyses were performed on the classifier performance in scenarios with different train-test ratios and different word lengths (k-values) for K-Mer. The observed results show that Fully Connected Deep Neural Network has a high success rate of 99.60 % in predicting virus families. In predicting virus hosts, the Extra Trees classifier achieved the highest success rate of 81.53 %. This study is considered to be the first classification study in the literature on this dataset, which has a very large family and host diversity consisting of gene sequences of Single-stranded RNA viruses. Our detailed investigations on how varying word lengths based on K-Mer coding in gene sequences affect the classification into viral families and hosts make this study particularly valuable. This study shows that ML and DL methods have the potential to produce valuable results in phylogenetic studies. In addition, the results and high-performance values show that these methods can be successfully used in regenerative applications of gene sequences or in studies such as the elimination of losses in gene sequences.
dc.identifier.doi10.1016/j.compbiolchem.2024.108114
dc.identifier.issn1476-9271
dc.identifier.issn1476-928X
dc.identifier.pmid38852362
dc.identifier.scopus2-s2.0-85195421687
dc.identifier.scopusqualityQ1
dc.identifier.urihttps://doi.org/10.1016/j.compbiolchem.2024.108114
dc.identifier.urihttps://hdl.handle.net/20.500.12604/6444
dc.identifier.volume112
dc.identifier.wosWOS:001358298000001
dc.identifier.wosqualityN/A
dc.indekslendigikaynakWeb of Science
dc.indekslendigikaynakScopus
dc.indekslendigikaynakPubMed
dc.language.isoen
dc.publisherElsevier Sci Ltd
dc.relation.ispartofComputational Biology and Chemistry
dc.relation.publicationcategoryMakale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı
dc.rightsinfo:eu-repo/semantics/closedAccess
dc.snmzKA_20241222
dc.subjectRNA Viruses
dc.subjectVirus Hosts
dc.subjectVirus Families
dc.subjectK-Mer Coding
dc.subjectMachine Learning
dc.titlePrediction of viral families and hosts of single-stranded RNA viruses based on K-Mer coding from phylogenetic gene sequences
dc.typeArticle

Dosyalar