The human genome is divided in 23 pairs of chromosomes thereof one copy is inherited from the father and the other from the mother. A chromosome is an organized structure of DNA that contains many genes, regulatory elements and other nucleotide sequences.
When a nucleotide site of a speci c chromosome region shows a statistically signi cant variability within a population then it is called Single Nucleotide Polymorphism (SNP).
Speci cally, a site is considered a SNP if for a minority of the population a certain nucleotide is observed (called the least frequent allele) while for the rest of the population a di erent nucleotide is observed (the most frequent allele). The least frequent allele, or mutant type, is generally encoded as `1', as opposed to the most frequent allele, or wild type, generally encoded as `0'. A haplotype is a set of alleles, or more formally, a string of length p over an alphabet = f0; 1g.
The diploid nature of humans implies that, for a given SNP, an individual can be either homozygous of type 0/1 (i.e., both the father and the mother alleles are equal) or heterozygous (i.e., the father and the mother alleles are di erent). When extracting the SNPs from an individual (i.e., when genotyping an individual) the information about which haplotype (maternal or paternal) a given allele belongs to is missed and only the homo or heterozygous nature of the site is known. Hence, the genotype of an individual can be thought as a string of length p over an alphabet = f0; 1; 2g, where the symbols `0' or `1' are used to denote a homozygous sites and the symbol `2' is used to denote a heterozygous sites. As an example, the sequence < 0; 1; 2 > denotes a genotype in which:
the rst SNP is homozygous of wild type, the second SNP is homozygous of mutant type, and nally the third SNP is heterozygous.
Haplotyping a set of genotypes means recovering from a set of genotype the corresponding generating haplotypes. This task is fundamental for the diagnosis and treatment of human diseases. For example, haplotypes are necessary in evolutionary studies to extract the information needed to detect diseases and to reduce the number of tests to be carried out. In functional genomics haplotypes are used to discover a functional gene or in the study of an altered response of an organism to a particular therapy. In human pharmacogenetics, haplotypes explain why people react di erently to di erent types or amounts of drugs. In fact, since SNPs a ect the structure and function of proteins and enzymes, they may inuence the way in which a drug is absorbed and metabolized.
However, haplotyping a set of genotypes is not an easy task, because the current molecular sequencing methods only provide genotype rather than haplotype information. When the family-based genetic information of a population of individuals is available, haplotypes can be retrieved experimentally. However, in the most general case the experimental approach is laborious, cost-prohibitive, requires advanced molecular isolation strategies, and sometimes not even possible. In absence of a family-based genetic information, a valid alternative to the experimental approach is provided by computational methods.
In order to recover the haplotype set of a population of individuals, computational methods have to solve an optimization problem, called the Haplotype Estimation Problem (HEP), consisting in nding the set of haplotypes that, appropriately combined, generates the set of analyzed genotypes. It is worth noting that the number of possible generating haplotypes for a given genotype g grows up exponentially in function of the number of heterozygous sites of g. Speci cally, if n is the number of heterozygous sites in a genotype g, then there exist 2n..1 possible haplotypes that may generate g. As an example, genotype < 0; 1; 2; 2 > may be generated by combining appropriately either the pair of haplotypes f< 0; 1; 0; 0 >;< 0; 1; 1; 1 >g or the pair f< 0; 1; 1; 0 >;< 0; 1; 0; 1 >g. This insight entails the use of a criterion to select pairs of haplotypes among possible alternatives.
The analysis of low-rate recombination genes of di erent molecular functions (e.g., chaperone, ligase, isomerase, kinase, and transferase) has shown that the number of distinct haplotypes existing in a large population of individuals is generally much smaller than the overall number of distinct genotypes observed in that population. Hence, for low-rate recombination genes at least, the criterion of minimizing the overall number of haplotypes necessary to explain a set of genotypes may have good chances to recover the biological haplotype set. This criterion, rstly introduced by Gus eld, is known as the pure parsimony criterion of haplotype estimation and can be formalized as follows.
Given a pair of haplotypes fhi; hjg, de ne the operator sum among hi and hj as the genotype g whose p-th entry is hip if hip = hjp, and 2 otherwise. As an example, the genotype obtained by summing haplotypes hi =< 0; 1; 1; 0 > and hj =< 1; 1; 0; 0 > is g =< 2; 1; 2; 0 >. We say that a genotype gk is resolved from a pair of haplotypes fhi; hjg if gk = hihj . Haplotyping a set of genotypes under the pure parsimony criterion, hence, consists of solving the following optimization problem:
Pure Parsimony Haplotype (PPH): Given a set G of m genotypes, having p SNPs each, nd the minimum set H of haplotypes such that for each genotype gk 2 G there exists a pair of haplotypes fhi; hjg 2 H resolving gk.
PPH is known to be polynomially solvable when each genotype has at most two heterozygous sites, and APX-hard when each genotype has at least three heterozygous sites. This result has justi ed the recent development of enumerative optimization methods and of approximation algorithms. In this work we shall provides an overview of PPH and review the main approaches to solution that occur in the literature.
© 2008-2024 Fundación Dialnet · Todos los derechos reservados