Leveraging large-scale Mycobacterium tuberculosis whole genome sequence data to characterise drug-resistant mutations using machine learning and statistical approaches

dc.contributor.authorPruthi, Siddharth Sanjay
dc.contributor.authorBillows, Nina
dc.contributor.authorThorpe, Joseph
dc.contributor.authorCampino, Susana
dc.contributor.authorPhelan, Jody E.
dc.contributor.authorMohareb, Fady R.
dc.contributor.authorClark, Taane G.
dc.date.accessioned2024-12-19T16:11:01Z
dc.date.available2024-12-19T16:11:01Z
dc.date.freetoread2024-12-19
dc.date.issued2024-12-01
dc.date.pubOnline2024-11-07
dc.description.abstractTuberculosis disease (TB), caused by Mycobacterium tuberculosis (Mtb), is a major global public health problem, resulting in > 1 million deaths each year. Drug resistance (DR), including the multi-drug form (MDR-TB), is challenging control of the disease. Whilst many DR mutations in the Mtb genome are known, analysis of large datasets generated using whole genome sequencing (WGS) platforms can reveal new variants through the assessment of genotype-phenotype associations. Here, we apply tree-based ensemble methods to a dataset comprised of 35,777 Mtb WGS and phenotypic drug-susceptibility test data across first- and second-line drugs. We compare model performance across models trained using mutations in drug-specific regions and genome-wide variants, and find high predictive ability for both first-line (area under ROC curve (AUC); range 88.3–96.5) and second-line (AUC range 84.1–95.4) drugs. To aggregate information from low-frequency variants, we pool mutations by functional impact and observe large improvements in predictive accuracy (e.g., sensitivity: pyrazinamide + 25%; ethionamide + 10%). We further characterise loss-of-function mutations observed in resistant phenotypes, uncovering putative markers of resistance (e.g., ndh 293dupG, Rv3861 78delC). Finally, we profile the distribution of known DR-associated single nucleotide polymorphisms across discretised minimum inhibitory concentration (MIC) data generated from phenotypic testing (n = 12,066), and identify mutations associated with highly resistant phenotypes (e.g., inhA − 779G > T and 62T > C). Overall, our work demonstrates that applying machine learning to large-scale WGS data is useful for providing insights into predicting Mtb binary drug resistance and MIC phenotypes, thereby potentially assisting diagnosis and treatment decision-making for infection control.
dc.description.journalNameScientific Reports
dc.description.sponsorshipBiotechnology and Biological Sciences Research Council, Engineering and Physical Sciences Research Council, Medical Research Council
dc.format.mediumElectronic
dc.identifier.citationPruthi SS, Billows N, Thorpe J, et al.,. (2024) Leveraging large-scale Mycobacterium tuberculosis whole genome sequence data to characterise drug-resistant mutations using machine learning and statistical approaches. Scientific Reports, Volume 14, November 2024, Article number 27091en_UK
dc.identifier.eissn2045-2322
dc.identifier.elementsID559314
dc.identifier.issn2045-2322
dc.identifier.paperNo27091
dc.identifier.urihttps://doi.org/10.1038/s41598-024-77947-w
dc.identifier.urihttps://dspace.lib.cranfield.ac.uk/handle/1826/23279
dc.identifier.volumeNo14
dc.languageEnglish
dc.language.isoen
dc.publisherSpringeren_UK
dc.publisher.urihttps://www.nature.com/articles/s41598-024-77947-w
dc.rightsAttribution 4.0 Internationalen
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/
dc.subject31 Biological Sciencesen_UK
dc.subject3102 Bioinformatics and Computational Biologyen_UK
dc.subject32 Biomedical and Clinical Sciencesen_UK
dc.subject3105 Geneticsen_UK
dc.subjectRare Diseasesen_UK
dc.subjectTuberculosisen_UK
dc.subjectOrphan Drugen_UK
dc.subjectInfectious Diseasesen_UK
dc.subjectAntimicrobial Resistanceen_UK
dc.subjectEmerging Infectious Diseasesen_UK
dc.subjectPreventionen_UK
dc.subjectHuman Genomeen_UK
dc.subjectBiodefenseen_UK
dc.subjectMachine Learning and Artificial Intelligenceen_UK
dc.subject2.1 Biological and endogenous factorsen_UK
dc.subject2.5 Research design and methodologies (aetiology)en_UK
dc.subject4.1 Discovery and preclinical testing of markers and technologiesen_UK
dc.subjectInfectionen_UK
dc.subject3 Good Health and Well Beingen_UK
dc.subject.meshMycobacterium tuberculosisen_UK
dc.subject.meshMachine Learningen_UK
dc.subject.meshWhole Genome Sequencingen_UK
dc.subject.meshMutationen_UK
dc.subject.meshAntitubercular Agentsen_UK
dc.subject.meshHumansen_UK
dc.subject.meshGenome, Bacterialen_UK
dc.subject.meshTuberculosis, Multidrug-Resistanten_UK
dc.subject.meshMicrobial Sensitivity Testsen_UK
dc.subject.meshDrug Resistance, Multiple, Bacterialen_UK
dc.subject.meshDrug Resistance, Bacterialen_UK
dc.titleLeveraging large-scale Mycobacterium tuberculosis whole genome sequence data to characterise drug-resistant mutations using machine learning and statistical approachesen_UK
dc.typeArticle
dc.type.subtypeJournal Article
dcterms.dateAccepted2024-10-28

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Leveraging_large-scale_Mycobacterium_tuberculosis-2024.pdf
Size:
2.68 MB
Format:
Adobe Portable Document Format
Description:
Published version

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.63 KB
Format:
Plain Text
Description: