Ir al contenido

Documat


The role of blood dna methylation in environment-related chronic disease: A biostatistical toolkit

  • Autores: Arce Domingo Relloso
  • Directores de la Tesis: José Domingo Bermúdez Edo (dir. tes.) Árbol académico, María Téllez Plaza (codir. tes.) Árbol académico
  • Lectura: En la Universitat de València ( España ) en 2023
  • Idioma: inglés
  • Tribunal Calificador de la Tesis: Juan Ramón González Ruiz (presid.) Árbol académico, Carmen Iñiguez Hernández (secret.) Árbol académico, Elena Colicino (voc.) Árbol académico
  • Enlaces
  • Resumen
    • Epigenetic changes refer to modifications that alter gene expression without changing the genomic sequence. Environmental and behavioral factors are well-known epigenetic modifiers, leading to heritable changes that might disrupt essential biological processes and, in turn, influence the development of disease.

      DNA methylation is the most widely studied epigenetic mark. Scientific evidence supports the association between environmental factors, such as smoking and metals, and DNA methylation dysregulations. In addition, the evidence supports the association between DNA methylation dysregulations and chronic disease, especially for cancer. However, it is unknown whether these associations are causal or happen due to DNA methylation being a biomarker of other disrupted biological processes.

      In order to evaluate the potential role of genome-wide DNA methylation on the association between environmental factors and chronic disease, appropriate statistical methods for the analysis of ultra-high dimensional and highly correlated data are needed.

      To begin with, we need to select which methylation sites in the genome are associated with our outcome of interest. Existing methods for variable selection and effect estimation lose predictive ability and are subject to bias in ultra-high dimensional settings. Additionally, they are not able to quantify statistical uncertainty.

      Once we get to select the set of epigenomic features associated with our outcome, mediation analysis is a valuable tool to quantify the potential intermediate effect of these methylation sites on the association between environmental factors and chronic disease. The most biologically plausible scenario is that several correlated DNA methylation marks (as opposed to a single one) are mediators between an exposure and an outcome. On the other hand, it is common to consider time-to-event outcomes in epidemiological settings, in order to incorporate the time in which the outcome happened into the statistical model. However, to date, no mediation analysis algorithms able to deal with multiple correlated mediators with survival outcomes have been developed.

      Thus, this thesis has two main objectives, the first one related to variable selection in ultra-high dimensional settings, and the second one focused on multiple mediation analysis with survival outcomes.

      Abstract of objective 1. The first objective of this thesis arises from the need to extend the Iterative Sure Independence Screening (ISIS) statistical tool, which conducts variable selection for ultra-high dimensional data, in order to improve its predictive accuracy, effect estimation and to incorporate statistical uncertainty. The objective was to pair the ISIS algorithm with two shrinkage methods: elastic-net and adaptive elastic-net (Aenet), and to include an algorithm for calculation of bootstrap-based confidence intervals. This extension of ISIS has been added to the SIS R package, which is available in the CRAN repository.

      As part of this first objective, this dissertation shows two applications of the ISIS algorithm. For this purpose, we used data from the Strong Heart Study (SHS), the largest and longest prospective cohort of American Indians. The first application aimed to evaluate the improvements introduced by our extension of ISIS (Aenet, elastic-net, MSAenet) as compared to other shrinkage methods implemented in the original version. The ISIS algorithm paired with Aenet provides increased predictive ability as compared to the original ISIS version, especially for continuous and binary outcomes. Additionally, by pairing ISIS with Aenet, a more consistent effect estimation is obtained because Aenet fulfills the oracle property. Our bioinformatics analysis reveals that it also leads to a more robust variable selection in terms of subsequent biological pathway enrichment.

      The second application is an epidemiologic study in which we evaluate the potential intermediate role of single DNA methylation sites on the well-documented association between arsenic and cardiovascular disease (CVD). We used the ISIS algorithm paired with Aenet to select methylation sites associated with CVD, and we subsequently conducted a simple mediation analysis (one marker at a time) in the selected sites. We found statistically significant mediated effects for 21 and 15 differentially methylated positions (DMPs) for CVD incidence and mortality, respectively. In addition, six of the 21 DMPs showing statistically significant mediated effects for CVD incidence were replicated in three independent American cohorts (the Framingham Heart Study, Women's Health Initiative y Multi-Ethnic Study of Atherosclerosis) with the same direction in the association. The genes annotated to methylation sites with statistically significant mediated effects were also replicated in a mouse model. The biological plausibility of those genes in CVD provides additional robustness of the results.

      Abstract of objective 2. The second objective of this thesis focuses on the extension of the multimediate algorithm, which conducts mediation analysis in the context of multiple correlated mediators, to survival outcomes. Jerolon and colleagues developed this algorithm for continuous and binary outcomes. Using the Lin-Ying additive models, we extended the multimediate algorithm as well as the theoretical results for identification of mediated effects to time-to-event data. In addition, we adapted the multimediate algorithm to incorporate potential exposure-mediator interactions. The extension of the algorithm to survival outcomes is available in the following Github repository: https://github.com/AllanJe/multimediate. The extension including exposure-mediator interactions will soon be posted in the same repository.

      As part of this second objective, we also included two data applications of this algorithm. The first application is a simulation study in which we prove the better performance of the multimediate algorithm as compared to simple mediation analysis, even in settings of uncorrelated mediators.

      The second data application is an epidemiologic study in which we investigate the potential intermediate role of multiple, potentially correlated, DNA methylation marks on the association between smoking and smoking-related cancers using data from the SHS. We first used the ISIS algorithm paired with elastic-net to select DNA methylation sites associated with cancer. Subsequently, we applied the multimediate algorithm to evaluate several methylation sites as potential mediators on the association between smoking and cancer. The algorithm identified a joint mediated effect of 81.3 % attributable to three DMPs for lung cancer, and of 64.4 % attributable to four DMPs for a combined endpoint including all smoking-related cancers available (lung, esophagus-stomach, colorectal, liver, pancreatic and kidney). The results of the mediation analysis were largely replicated in an independent population (the Framingham Heart Study), in which we also conducted functional validation using gene expression data. In general, we found inverse association between DNA methylation and gene expression for the methylation sites identified in our mediation analysis.

      In addition to these two main objectives, this thesis presents a short section focused on gene expression, the biological process directly influenced by DNA methylation, which points to future research lines. Even if mediated effects of DNA methylation on the association between environmental factors and chronic disease are identified, this does not necessarily imply causality, as unmeasured confounders and other sources of bias might exist. Thus, investigating the biological processes influenced by DNA methylation might help as functional support of its role in chronic disease.

      In particular, gene expression measured in single cells (scRNAseq) is at the forefront of omics data research, as it enables the characterization of cell heterogeneity. However, these data present statistical challenges due to high proportions of zeros obtained in gene expression measurements for each individual gene and cell.

      In addition to evaluating differences in means of gene expression across groups, differences in variability have shown to be biologically relevant. Several methods have been developed for the evaluation of differential variability in omics data. However, these methods are not specific for scRNAseq data. In this thesis, we have used simulations to evaluate the impact of high proportions of zero counts in statistical methods for the identification of differentially variable genes in scRNAseq data. We found that high proportions of zeros lead to inflated variances and p-values, as well as higher false discovery rates. The distinct algorithm, which uses permutation tests to identify differences in distributions across groups, shows the best performance in terms of compromise between false discovery and true positive rates.

      In summary, this thesis has contributed to the field of omics data research, both by providing novel statistical methods for DNA methylation data analysis, which can also be used for other omics, and by contributing to the body of epidemiological evidence that supports a role of environmental epigenetics in chronic disease.


Fundación Dialnet

Mi Documat

Opciones de tesis

Opciones de compartir

Opciones de entorno