Alignment-Free Machine Learning Serotype Classification of the Dengue Virus

Gajdov, Vladimir; Prošić, Isidora; Kavran, Mihaela; Bosilkov, Filip; Petrović, Tamaš; Konstantinov, Jelena; Lazić, Gospava

dc.rights.license	CC BY
dc.contributor.author	Gajdov, Vladimir
dc.contributor.author	Prošić, Isidora
dc.contributor.author	Kavran, Mihaela
dc.contributor.author	Bosilkov, Filip
dc.contributor.author	Petrović, Tamaš
dc.contributor.author	Konstantinov, Jelena
dc.contributor.author	Lazić, Gospava
dc.date.accessioned	2026-02-28T21:04:12Z
dc.date.available	2026-02-28T21:04:12Z
dc.date.issued	2026
dc.identifier.issn	1999-4915
dc.identifier.uri	https://repo.niv.ns.ac.rs/xmlui/handle/123456789/1096
dc.description.abstract	Dengue virus (DENV) serotyping is essential for epidemiological surveillance, clinical risk assessment, and vaccine evaluation, as the four dengue serotypes differ in pathogenicity, immune interactions, and population dynamics. Existing subtyping methods largely rely on sequence alignment and phylogenetic inference, which can be computationally intensive and unreliable for short, fragmented, or error-prone sequences commonly generated in diagnostic and surveillance settings. There is a need for fast, alignment-free serotyping approaches that maintain high accuracy across heterogeneous sequence lengths while remaining scalable, transparent, and suitable for real-world diagnostic inputs. We demonstrate that compact 3-mer composition features are sufficient for highly accurate dengue virus serotyping when coupled with a lineage-aware Random Forest classification framework. Using 64 normalized 3-mer frequency features per sequence with ambiguity masking and enforcing strict cluster-aware validation at both 99% and 95% nucleotide identity thresholds, our approach achieved near-perfect accuracy and macro-F1 scores on held-out internal test sets. To further ensure independence, external validation datasets were filtered to remove exact sequence matches and any sequences sharing ≥99% or ≥95% nucleotide identity with internal data. On these strictly independent external datasets, the model maintained 100% accuracy and macro-F1 performance, confirming robust generalization beyond database redundancy. Robustness analyses showed stable performance under contiguous sequence truncation down to 300 bp and in the presence of ambiguous nucleotides, indicating resilience to realistic diagnostic inputs. These results demonstrate that a lightweight, alignment-free, machine learning approach can rival alignment-dependent methods while maintaining strict lineage-aware evaluation controls. The proposed framework combines high predictive accuracy, probabilistic reliability, computational efficiency, and reproducible validation design, making it well suited for large-scale genomic surveillance, rapid pre-screening, and diagnostic decision-support applications.	en_US
dc.description.sponsorship	This research was funded by the Science Fund of the Republic of Serbia, #GRANT No 10945 and by the Ministry of Science, Technological Development and Innovation of the Republic of Serbia (Contract numbers 451-03-34/2026-03/200143 and 451-03-33/2026-03/200031 and 451-03-34/2026- 03/200117).	en_US
dc.language.iso	en	en_US
dc.publisher	mdpi	en_US
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.source	Viruses	en_US
dc.subject	k-mer composition	en_US
dc.subject	viral classification	en_US
dc.subject	vector-borne diseases	en_US
dc.subject	subtyping	en_US
dc.title	Alignment-Free Machine Learning Serotype Classification of the Dengue Virus	en_US
dc.type	Article	en_US
dc.identifier.doi	10.3390/v18030280
dc.citation.volume	18	en_US
dc.citation.rank	M21	en_US
dc.type.version	published	en_US

Files in this item

Name:: aliggv26.pdf
Size:: 882.6Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Naučni radovi
Radovi objavljeni u naučnim časopisima

Show simple item record

Except where otherwise noted, this item's license is described as CC BY