Compositional methodology and statistical inference of family relationships using genetic markers

Iván Galván-Femenía

Ayuda

Compositional methodology and statistical inference of family relationships using genetic markers

Autores: Iván Galván-Femenía
Directores de la Tesis: Jan Graffelman (dir. tes.) , Carles Barceló i Vidal (codir. tes.)
Lectura: En la Universitat de Girona ( España ) en 2020
Idioma: español
Tribunal Calificador de la Tesis: M. Luz Calle Rosingana (presid.) , Glòria Mateu Figueras (secret.) , Jaume Bertranpetit Busquets (voc.)
Enlaces
- Tesis en acceso abierto en: TDX
Resumen
- The present thesis is a compendium of three research articles produced between 2015 and 2019. All these three articles have a common link: they are different contributions based on compositional statistical methodology and statistical inference of genetic relatedness. In brief, Compositional Data are random vectors with strictly positive components whose sum is constant. These components represent parts of a whole which only carry relative information. Therefore, Compositional Data is usually represented as proportions or percentages. Relatedness is based on the principle of allele sharing between individuals for a given set of genetic markers. The larger the proportion of alleles shared between a pair of individuals, the more likely they are to be related.
  
  In the first work presented in this thesis, we review the classical graphical methods used to detect relatedness and introduce the analysis of Compositional Data for relatedness research. For any genetic marker, two individuals can share 0, 1 or 2 alleles. Allele sharing analysis is based on alleles identical by state (IBS) and alleles identical by descent (IBD). Two alleles are IBS if they are identical in terms of their DNA composition and do not necessarily come from a common ancestor. Otherwise, two alleles are IBD if they are derived from a common ancestor. A remarkable difference between IBS and IBD alleles is that IBD is an unobservable measure, and therefore it is necessary to estimate the probabilities of sharing 0, 1 or 2 IBD allelles by maximum likelihood procedures. The IBD probabilities are essential for relatedness research, since they have reference values for any family relationship category and it allows to classify them. Classical graphical methods based on IBS alleles depict the mean and the standard deviations of the number of shared IBS alleles over genetic variants. The scatterplot of the proportion of sharing zero and two IBS alleles has been also considered in the literature. Both representations of allele sharing data are able to detect outliers which correspond to potentially related individuals. Regarding the graphics based on IBD alleles, some authors represent data in an scatterplot of any combination of two out three IBD probabilities. Therefore, we propose the use of tools of Compositional Data analysis such as the ternary diagram and the isometric log-ratio transformation of the IBS/IBD probabilities. The ternary diagram is used to represent simultaneously all three IBS/IBD allele probabilities in contrast to the classical two-dimensional scatterplot. On the other hand, we introduce the isometric log-ratio transformation to overcome the problems of the Euclidean distance interpretation in the constrained space of the IBS/IBD allele sharing data.
  
  In the second article, we propose the analysis of IBS genotype sharing data instead of the classical IBS allele sharing data. This allows us to analyse the genetic data in more than three dimensions. We consider genotype sharing counts as a six-part composition and explore the data using log-ratio biplots based on principal component analysis. Classification of pairs of individuals into family relationship categories is performed using linear discriminant analysis. In this context, the log-ratio biplot approach is compared with the classical plots in a simulation study. In a non-inbred homogeneous population the classification rate of the log-ratio principal component approach outperforms the classical graphics across the whole allele frequency spectrum. Furthermore, the log-ratio biplot is able to identify accurately family relationships up to and including fourth degree relationships. The log-ratio biplot methodology uncovered the detection of three-quarter siblings, a family relationship which has received very little attention in the literature. Consequently, the third article finishes the thesis with the development of an additional statistical methodology such as the likelihood ratio approach. The likelihood ratio approach is developed to infer three-quarter siblings in genetic databases. We derive the IBD probabilities for three-quarter siblings and calculate likelihood ratios to distinguish three-quarter siblings from full-siblings and half-siblings.
  
  To illustrate all the results of this doctoral thesis we use genetic markers from worldwide human population projects such as the Human Genome Diversity Project and the 1000 Genomes Project, as well as from a local prospective human cohort of the Genomes of Catalonia (GCAT).