Show simple item record

dc.rights.licenseCC BY
dc.contributor.authorGajdov, Vladimir
dc.contributor.authorProšić, Isidora
dc.contributor.authorKavran, Mihaela
dc.contributor.authorBosilkov, Filip
dc.contributor.authorPetrović, Tamaš
dc.contributor.authorKonstantinov, Jelena
dc.contributor.authorLazić, Gospava
dc.date.accessioned2026-02-28T21:04:12Z
dc.date.available2026-02-28T21:04:12Z
dc.date.issued2026
dc.identifier.issn1999-4915
dc.identifier.urihttps://repo.niv.ns.ac.rs/xmlui/handle/123456789/1096
dc.description.abstractDengue virus (DENV) serotyping is essential for epidemiological surveillance, clinical risk assessment, and vaccine evaluation, as the four dengue serotypes differ in pathogenicity, immune interactions, and population dynamics. Existing subtyping methods largely rely on sequence alignment and phylogenetic inference, which can be computationally intensive and unreliable for short, fragmented, or error-prone sequences commonly generated in diagnostic and surveillance settings. There is a need for fast, alignment-free serotyping approaches that maintain high accuracy across heterogeneous sequence lengths while remaining scalable, transparent, and suitable for real-world diagnostic inputs. We demonstrate that compact 3-mer composition features are sufficient for highly accurate dengue virus serotyping when coupled with a lineage-aware Random Forest classification framework. Using 64 normalized 3-mer frequency features per sequence with ambiguity masking and enforcing strict cluster-aware validation at both 99% and 95% nucleotide identity thresholds, our approach achieved near-perfect accuracy and macro-F1 scores on held-out internal test sets. To further ensure independence, external validation datasets were filtered to remove exact sequence matches and any sequences sharing ≥99% or ≥95% nucleotide identity with internal data. On these strictly independent external datasets, the model maintained 100% accuracy and macro-F1 performance, confirming robust generalization beyond database redundancy. Robustness analyses showed stable performance under contiguous sequence truncation down to 300 bp and in the presence of ambiguous nucleotides, indicating resilience to realistic diagnostic inputs. These results demonstrate that a lightweight, alignment-free, machine learning approach can rival alignment-dependent methods while maintaining strict lineage-aware evaluation controls. The proposed framework combines high predictive accuracy, probabilistic reliability, computational efficiency, and reproducible validation design, making it well suited for large-scale genomic surveillance, rapid pre-screening, and diagnostic decision-support applications.en_US
dc.description.sponsorshipThis research was funded by the Science Fund of the Republic of Serbia, #GRANT No 10945 and by the Ministry of Science, Technological Development and Innovation of the Republic of Serbia (Contract numbers 451-03-34/2026-03/200143 and 451-03-33/2026-03/200031 and 451-03-34/2026- 03/200117).en_US
dc.language.isoenen_US
dc.publishermdpien_US
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.sourceVirusesen_US
dc.subjectk-mer compositionen_US
dc.subjectviral classificationen_US
dc.subjectvector-borne diseasesen_US
dc.subjectsubtypingen_US
dc.titleAlignment-Free Machine Learning Serotype Classification of the Dengue Virusen_US
dc.typeArticleen_US
dc.identifier.doi10.3390/v18030280
dc.citation.volume18en_US
dc.citation.rankM21en_US
dc.type.versionpublisheden_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record

CC BY
Except where otherwise noted, this item's license is described as CC BY