Ir al contenido

Documat


Resumen de Estratègies de kernel per a la predicció de fenotips complexos

Elies Ramon Gurrea

  • The relationship between phenotype and genotypic information is considerably intricate and complex. Machine Learning (ML) methods have been successfully used for phenotype prediction in a great range of problems within genetics and genomics. However, biological data is usually structured and belongs to ‘nonstandard’ data types, which can pose a challenge to most ML methods. Among them, kernel methods bring along a very versatile approach to handle different types of data and problems through a family of functions called kernels.

    The main goal of this PhD thesis is the development and evaluation of specific kernel approaches for phenotypic prediction, focusing on biological problems with structured data types or study designs.

    In the first part, we predict drug resistance from HIV-mutated protein sequences (protease, reverse transcriptase and integrase). We propose two categorical kernel functions (Overlap and Jaccard) that take into account HIV data particularities, such as allele mixtures. The proposed kernels are coupled with Support Vector Machines (SVM) and compared against two well-known standard kernel functions (Linear and RBF) and two nonkernel methods: Random Forests (RF) and the Multilayer Perceptron neural network. We also include a relative weight into the aforementioned kernels, representing the importance of each protein residue regarding drug resistance. Taking into account both the categorical nature of data and the presence of mixtures consistently delivers better predictions. The weighting effect is higher in reverse transcriptase and integrase inhibitors, which may be related to the different mutational patterns in the viral enzymes regarding drug resistance.

    In the second part, we extend the previous study to consider the fact that protein positions are not independent. Mutated sequences are modeled as graphs, with edges weighted by the Euclidean distance between residues, obtained from crystal three-dimensional structures. A kernel for graphs (the exponential random walk kernel) that integrates the previous Overlap and Jaccard kernels is then computed. Despite the potential advantages of this kernel for graphs, an improvement on predictive ability as compared to the kernels of the first study is not observed.

    In the third part, we propose a kernel framework to unify unsupervised and supervised microbiome analyses. To do so, we use the same kernel matrix to perform phenotype prediction via SVMs and visualization via kernel Principal Components Analysis (kPCA). We define two kernels for compositional data (Aitchison-RBF and compositional linear) and discuss the transformation of beta-diversity measures into kernels. The compositional linear kernel also allows the retrieval of taxa importances (microbial signatures) from the SVM model. Spatial and time-structured datasets are handled with Multiple Kernel Learning and kernels for time series, respectively. We illustrate the kernel framework with three datasets: a single point soil dataset, a human dataset with a spatial component, and a previously unpublished longitudinal dataset concerning pig production. Analyses across the three case studies include a comparison with the original reports (for the two former datasets), as well as contrast with results from RF. The kernel framework not only allows a holistic view of data but also gives good results in each learning area. In unsupervised analyses, the main patterns detected in the original reports are conserved in kPCA. In supervised analyses SVM has better (or, in some cases, equivalent) performance than RF, while microbial signatures are consistent with the original studies and previous literature.


Fundación Dialnet

Mi Documat