Ir al contenido

Documat


Resumen de Clinical microbiology with multi-view deep probabilistic models

Alejandro Jorge Guerrero Lopez

  • 1. Background Real-world data are often heterogeneous and contain a mix of data types, such as real-valued, categorical, multilabel, binary, or time-series data, which can make it challenging to exploit. These types of data can be found in various domains, such as finance, weather, and health. Clinical microbiology is a topic of utmost importance in the current century of health. The identification and discrimination of microorganisms is a major global public health threat, as acknowledged by international health organisations such as the World Health Organisation (WHO) and the European Centre for Disease Prevention and Control (ECDC). Rapid spread, high morbidity and mortality rates, and the associated economic burden of treatment and control are some of the significant impacts caused by microorganisms.

    The differentiation of microorganisms is particularly critical for clinical applications.

    For example, Clostridium difficile (C. diff ) is known to increase the morbidity and mortality of healthcare-related infections. Additionally, over the past two decades, other bacteria, such as Klebsiella pneumoniae (K. pneumoniae), have exhibited a remarkable tendency to acquire antibiotic resistance mechanisms. As a result, the use of ineffective antibiotics could result in fatal outcomes. Machine learning (ML) has the potential to revolutionise clinical microbiology by automating existing methodologies and providing more efficient personalised treatments.

    However, microbiological data are challenging to exploit owing to the presence of a heterogeneous mix of data types, such as real-valued high-dimensional data, categorical indicators, multilabel epidemiological data, binary targets, or even time-series data representations. This problem, which in the field of ML is known as multi-view or multi-modal representation learning, has been studied in other application fields such as mental health monitoring or haematology. Multi-view learning combines different modalities or views representing the same data to extract richer insights and improve understanding. Each modality or view corresponds to a distinct encoding mechanism for the data, and this dissertation specifically addresses the issue of heterogeneity across multiple views.

    In the probabilistic ML field, the exploitation of multi-view learning is also known as Bayesian Factor Analysis (FA). Current solutions face limitations when handling high-dimensional data and non-linear associations. Recent research proposes deep probabilistic methods to learn hierarchical representations of the data, which can capture intricate non-linear relationships between features. However, some DL techniques rely on complicated representations, which can hinder the interpretation of the outcomes. This lack of transparency may prevent healthcare professionals from fully trusting the models¿ outputs and lead to incorrect treatment decisions. Therefore, the development of interpretable ML models is crucial to enable healthcare professionals to understand the decision-making process and the rationale behind the model¿s output. Current limitations of complex and opaque models need to be overcome to ensure their reliability and effectiveness in clinical microbiology. In addition, some inference methods used in DL approaches can be computationally burdensome, which can hinder their practical application in real- world situations. Therefore, there is a demand for more interpretable, explainable, and computationally efficient techniques for high-dimensional data. Multimodal representation learning could provide a better understanding of the microbial world by combining multiple views representing the same information, such as genomic, proteomic, and epidemiologic data.

    2.Objective This dissertation seeks to further research the application of ML to automate clinical microbiology laboratory procedures. However, microbiology data pose significant challenges due to their multimodal nature and inherent heterogeneity.

    Additionally, it is crucial that the models developed are interpretable to effectively address the needs of clinical settings. As such, the aim of this thesis is to propose the development of probabilistic deep models that enable interpretability in the automation of clinical microbiology procedures.

    To do so, we propose the development of two theoretical Bayesian models that can handle multimodal and heterogeneous data while ensuring the interpretability of the results. The first model we propose is a kernelised formulation that can handle non-linear data relationships and provide compact representations through the automatic selection of relevant vectors. We use an Automatic Relevance Determination (ARD) over the kernel to determine the input feature relevance functionality, thereby ensuring interpretability of the results. The second model we propose is a hierarchical Variational AutoEncoder (VAE) that can handle a wide range of data types, including multilabel, continuous, binary, categorical, and image data, using an explainable FA latent space. With its versatility and power, this model can provide a solution for real-world data sets, depending on the VAE architecture used.

    3. Models 3.1. KSSHIBA The presented algorithm is the Kernelised SSHIBA, which extends the SSHIBA approach to handle non-linear data relationships, select relevant vectors (RVs), and determine input feature relevance by using an Automatic Relevance Determination (ARD) over the kernel. To test the capabilities of this model, it is evaluated in a variety of benchmark datasets.

    First, it is tested in a multi-dimensional regression scenario using eight different databases, outperforming similar solutions. Second, the selection of RVs of the kernel is evaluated over the same databases proposed. Finally, the feature selection and interpretability are tested over three different image datasets. The results indicate that the proposed formulation is relevant, achieving competitive performance and reducing the data into a set of interpretable latent variables and a compact model consisting of a reduced subset of RVs. Additionally, the feature relevance criteria can learn relevant masks that provide insight into the input space for the goal task.

    In conclusion, the KSSHIBA framework offers a robust and potent solution to address the challenges posed by non-linear data relationships and high-dimensional data, particularly in the context of microbiology. The kernelised formulation employed by KSSHIBA is particularly advantageous in this regard, enabling efficient and effective handling of complex data structures. Furthermore, the interpretability of the results generated by KSSHIBA represents a critical feature, particularly in microbiology, where transparent and clear decision-making is essential.

    3.2. FA-VAE The FA-VAE algorithm, a hierarchical VAE for heterogeneous data utilising an interpretable FA latent space, is presented as the second proposed model. This model builds upon the successful KSSHIBA approach, which can handle semi- supervised heterogeneous multi-view problems, and expands its capabilities to handle a wider range of data domains. The FA-VAE model proposes a hierarchical structure consisting of two levels. Firstly, each heterogeneous view is processed by a marginal VAE, transforming it into a Gaussian-embedded space. Secondly, the FA model manages each Gaussian-embedded space and generates a global latent variable at a second level of hierarchy that contains the shared information about the multimodal dataset. The versatile architecture of the VAE can handle various types of data, such as continuous, image, and temporal data.

    The performance of the FA-VAE model is assessed in three distinct experimental settings. First, a two-view scenario with images and categorical data is considered by fine-tuning a pre-trained VAE model. Second, the model¿s adaptability to new domains is tested by evaluating its performance in a domain adaptation problem.

    Finally, a novel approach is proposed that leverages FA-VAE as a transfer learning mechanism between generative models. The experimental results indicate that the model can be efficiently conditioned to arbitrary labels within 150 epochs, and can even generate emojis from real images. Additionally, the transfer learning approach enhances the posterior of a deeper model through knowledge transfer from a simpler model.

    The obtained results highlight the robustness and versatility of the FA-VAE model for real-world applications, particularly in the field of microbiology. The modular nature of the model enables it to adapt to new data types and perform domain adaptation effectively. These capabilities are especially relevant in the healthcare setting where data can be highly diverse, and epidemiological profiles may vary. Furthermore, the model¿s ability to transfer learning from specific models learnt in the field provides a promising approach for enhancing posterior models.

    4. Applications This thesis has been conducted in partnership with the Instituto de Investigación Sanitaria Gregorio Marañón (IISGM), and as part of this collaboration, the previ- ously proposed models were utilised to address two clinical microbiological issues.

    Specifically, the KSSHIBA model has been used to automatically determine the resistance of Klebsiella pneumoniae based on MALDI-TOF MS data, while both the FA-VAE and KSSHIBA models are used to automate ribotyping of Clostridium difficile strains.

    4.1. Automatic antibiotic resistance prediction using KSSHIBA A novel method for predicting antibiotic resistance of K. pneumoniae to Extended- Spectrum Beta-Lactamases (ESBL) and Carbapenemases (CP) production that integrates both MALDI-TOF spectra and epidemiological information is presented.

    The method utilises multimodal learning through the KSSHIBA formulation, outperforming state-of-the-art algorithms such as XGBoost, LightGBM, MLP, SVMs, and GPs in terms of AUC. Notably, the proposed method is the first to process raw MALDI-TOF data without requiring external preprocessing and offers dimensionality reduction by integrating MALDI-TOF and epidemiological data into a low-dimensional latent space. The KSSHIBA algorithm provides interpretable results by leveraging epidemiological information through its multi-view architecture and enables automatic tuning of model hyperparameters using Bayesian inference.

    In our study, we applied the proposed KSSHIBA model to two bacterial domains: (1) data from a single hospital and (2) strains grouped from 18 hospitals across different geographic locations, selected based on their phenotypic and genotypic resistance to beta-lactams. Our results revealed that current non-heterogeneous models, such as GPs or SVMs, suffered from overfitting to one of the domains and performed poorly in the smaller domain. Therefore, multimodal models capable of analysing epidemiological information are necessary to predict antibiotic resistance in a fair and unbiased manner between domains. Our experiments demonstrated that it is crucial to account for different data distributions when working with two domains simultaneously. The inclusion of domain information improved the learning process of KSSHIBA, allowing it to properly model different data distributions, thereby overcoming the bias introduced by the data and avoiding overfitting, particularly if there is domain imbalance.

    This application contributes towards the important goal of reducing ineffective antibiotic prescribing by enabling the prediction of possible resistance mechanisms in K. pneumoniae. The implementation of our method in microbiological laboratories has the potential to improve the detection and treatment of multidrug-resistant infections, as well as significantly reduce the time required to obtain resistance results compared to traditional manual methods. This could have a substantial impact on global public health by improving patient outcomes.

    4.2. Automatic ribotyping based on probabilistic techniques n this preliminary study, we present a new approach for the automatic ribotyping of C. diff by harnessing the potential of probabilistic deep learning techniques using MALDI-TOF data. We investigate the practical viability of the proposed Bayesian FA models, namely KSSHIBA and FA-VAE, as a solution to this issue.

    To the best of our knowledge, this is the first demonstration of the feasibility of utilising probabilistic models to perform ribotyping of C. diff. To assess the viability of our approach, we conducted experiments on 275 samples from the HGUGM and achieved accuracy rates above 80% for all models, where particular configurations of KSSHIBA even reached perfect accuracy. Additionally, we tested KSSHIBA and FA-VAE in a real-life outbreak scenario in the HGUGM where FA-VAE performed a successful classification. Our results not only exhibit high accuracy in predicting the ribotype of each strain but also reveal an interpretable latent space, which represents a crucial advancement in the field. Additionally, the traditional ribotyping methods typically took 7 days to provide results, while our proposed methods were able to produce results on the same day, offering a significant reduction in the time required to take action.

    It is important to note the limitations of this study, as they present opportunities for future research. For example, it would be beneficial to analyse strains with geographical differences in order to determine the generalisability of the findings.

    Moreover, wider testing over time should be performed, along with PCR techniques, to keep evaluating the system. Despite these limitations, this preliminary study successfully demonstrated the potential of using MALDI-TOF-based probabilistic deep learning for automating bacterial ribotyping. The promising results obtained in a real outbreak provide a solid foundation for further advancements in this field.

    The ultimate goal of this study would be to establish the viability of probabilistic models based on MALDI-TOF for clinical use and to demonstrate its superiority over traditional methods by reducing time costs. The next step is to conduct a rigorous longitudinal study, which will use the models, FA-VAE and KSSHIBA, in real-world laboratory procedures and compare them to traditional PCR techniques.

    Moreover, the scope of future studies should be expanded to include a diverse range of sample origins, in order to fully understand the impact of epidemiological characteristics on bacterial ribotyping. This will not only deepen our understanding of bacterial ribotyping but also inform more effective public health measures.

    In conclusion, this study is a crucial stepping stone towards realising the full potential of MALDI-TOF for bacterial ribotyping and advancing our ability to tackle bacterial outbreaks.

    5. Conclusions In this dissertation, we have presented two technical innovations that address the challenge of incorporating diverse and multiple data sources. Specifically, we have tailored these solutions to handle microbiological data, working in collaboration with the Instituto de Investigación Sanitaria Gregorio Marañón (IISGM), and im- plemented them in real-world microbiology laboratories. We have developed a novel approach that merges advanced Factor Analysis (FA) techniques with kernel-based methods and powerful generative models such as Variational AutoEncoder (VAE)s.

    The result has been a set of robust, modular, and easily interpretable models that have been applied to two important microbiological scenarios, including the prediction of antibiotic resistance and the automation of the ribotyping procedure.

    These contributions represent an advancement in the field and pave the way for future breakthroughs in microbiological research.

    In light of the results presented in this dissertation, there is immense potential for the utilisation of probabilistic models to revolutionise the field of microbiology and fully automate laboratory procedures. This research can be expanded to encompass more comprehensive and far-reaching objectives, such as improving the predictivity of FA-VAE in unbalanced multi-view problems by proposing a modification in the marginal VAE regularisation term. Additionally, we aim to widen the scope of the K. pneumoniae study by adding international strains from a recently published dataset of Swiss MALDI-TOF MS. Furthermore, a longitudinal study of the C. diff. previously explained will be performed to test the models in real-time situations along the HGUGM. Finally, a future line of work is to perform a multi-view unsupervised E. coli spread analysis, as it is one of the most prevalent pathogens in the HGUGM nowadays.


Fundación Dialnet

Mi Documat