Primena mašinskog učenja za određivanje podtipova, patotipova i linija različitih virusa korišćenjem sekvenci celih genoma ili gena

Gajdov, Vladimir; Banović Đeri, Bojana; Samojlović, Milena; Lupulović, Diana; Lazić, Gospava; Vidanović, Dejan; Petrović, Tamaš

View/Open

primgv22.pdf (1013.Kb)

Date

2022-04-27

Author

Gajdov, Vladimir

Banović Đeri, Bojana

Samojlović, Milena

Lupulović, Diana

Lazić, Gospava

Vidanović, Dejan

Petrović, Tamaš

Metadata

Show full item record

Abstract

Many disease-causing viruses are clustered into subtypes, pathotypes, variants or lineages with clinical significance. Most methods for viral genome classification require the alignment of the input sequence against predefined reference sequences, which enables algorithms to compare homologous sequence features which can be computationally expensive. Moreover, highly divergent genome regions may affect the alignment algorithm’s performance. In order to overcome these obstacles, various machine learning (ML) algorithms have been used for viral genome classification. In this work, an alignment-free artificial intelligence approach has been implemented for the determination of avian influenza virus (AIV) subtype by using hemagglutinin (HA) and neuraminidase (NA) genomic sequences, for differentiating between highly and low pathogenic H5 AIV by using the HA gene sequences, for differentiating between West Nile virus (WNV) lineage 1 and 2 and for the determination of different SARS-CoV-2 variants by using whole genome sequences for both viruses. From the NCBI GenBank, hundred publicly available, randomly chosen unique, both complete and partial coding HA and NA sequences were retrieved for each H and N subtype, except for H14 and H15 for which 47 and 23 sequences were retrieved respectively, given that those were the only available sequences, whereas for WNV and SARSCoV- 2 whole genome sequences were retrieved. For training of the ML models, the data was randomly split into training (80%) and test data (20%). The accuracy, F1, precision and recall scores were evaluated for all models by using a confusion matrix. The empirical results showed that all models performed the classification task with scores >99% which suggests that this approach could be applied for accurate classification of viral genome sequences. However, this dataset is relatively small, so in order to evaluate these ML models further, more samples and sequences of different length should be used.

URI

https://repo.niv.ns.ac.rs/xmlui/handle/123456789/493

Collections

Zbornici