| dc.description.abstract | Dengue virus (DENV) serotyping is essential for epidemiological surveillance, clinical risk
assessment, and vaccine evaluation, as the four dengue serotypes differ in pathogenicity, immune interactions, and population dynamics. Existing subtyping methods largely
rely on sequence alignment and phylogenetic inference, which can be computationally
intensive and unreliable for short, fragmented, or error-prone sequences commonly generated in diagnostic and surveillance settings. There is a need for fast, alignment-free
serotyping approaches that maintain high accuracy across heterogeneous sequence lengths
while remaining scalable, transparent, and suitable for real-world diagnostic inputs. We
demonstrate that compact 3-mer composition features are sufficient for highly accurate
dengue virus serotyping when coupled with a lineage-aware Random Forest classification
framework. Using 64 normalized 3-mer frequency features per sequence with ambiguity
masking and enforcing strict cluster-aware validation at both 99% and 95% nucleotide
identity thresholds, our approach achieved near-perfect accuracy and macro-F1 scores on
held-out internal test sets. To further ensure independence, external validation datasets
were filtered to remove exact sequence matches and any sequences sharing ≥99% or ≥95%
nucleotide identity with internal data. On these strictly independent external datasets, the
model maintained 100% accuracy and macro-F1 performance, confirming robust generalization beyond database redundancy. Robustness analyses showed stable performance under
contiguous sequence truncation down to 300 bp and in the presence of ambiguous nucleotides, indicating resilience to realistic diagnostic inputs. These results demonstrate that
a lightweight, alignment-free, machine learning approach can rival alignment-dependent
methods while maintaining strict lineage-aware evaluation controls. The proposed framework combines high predictive accuracy, probabilistic reliability, computational efficiency,
and reproducible validation design, making it well suited for large-scale genomic surveillance, rapid pre-screening, and diagnostic decision-support applications. | en_US |