In recent years, the falling in the cost of the sequencing process has provided genomes of several organisms to the research community, becoming necessary the transformation of such data in useful information for disciplines as medicine, biology or agriculture. One of the first steps in the interpretation of genomes, and one of the most determinant steps at the same time, is the recognition of genes contained on them, as well as the prediction of their structures. This task is known as gene prediction, gene finding or gene recognition. However, it is not a trivial task. Despite the important advances experimented in the last decades, the difficulty of the work is shown since the current techniques of gene prediction are far from being reliable.
Initially, gene prediction was addressed like a experimental task in the laboratory. The growth of data amount became required an automatic methodology where information theory knowledge was integrated. Actually, current manual gene annotation performed over genomes make use of automatic gene prediction techniques. At this moment, the tendency goes to the development of approaches that integrate both perspectives. Thus, the gene prediction problem can be addressed by using Machine Learning techniques since it is possible formulate the problem partly like a classification task, partly like a optimization process.
Gene prediction aims to identify those parts of DNA sequences that hold the information to codify functional biological molecules, such as proteins or elements like ncRNA. From a computational point of view, a DNA sequence is a string over the alphabet {A, T, G, C}, letters that correspond to the nitrogenous bases of the nucleotides, Adenine, Thymine, Guanine and Cytosine. The main goal of the task is to correctly label each element as belonging to a specific region: coding region, non coding, intergenic... At the boundaries, each type of region is determined by specific marks which are recognized by the cellular machinery. Practically, almost every gene prediction systems have a component to identify theses border points called functional sites. In this work we propose some approaches to address the funcional site recognition by employing Machine Learning techniques and, at the same time, they consider biological aspects from the nature of the problem. Specifically, the presented methodologies to functional site recognition in DNA sequences have the following features:
- From a pure Machine Learning perspective, it is demonstrated the class imbalance nature of the problem and we show how useful is to apply specific approaches developed in order to deal with this undesirable feature. In the same way, we made a study that reveals the benefits on terms of performance improvement and interpretability when feature selection techniques are used to face the problem. - We propose a new site recognition methodology based on the idea that it is possible to consider more than two classes to group the training patterns used to build the classification models. The premise comes from the fact that the patterns present different biological features with various levels of importance in the classification process, therefore, the usage classical binary separation positive/negative is dismissed. Additionally, more than one type of classifier is considered following the precept that the behavior of a set of classifier usually is better than one individually, specially when diversity is introduced among the nature of the classifiers.
- Finally, we propose a new methodology for functional site classification based on the construction of a model by combining classifiers that consider as many different sources of evidence as possible from several informant genomes and as many different type of classifiers as needed. In particular, the resulting models make use of five types of classifiers, every one of them with a different nature, and they use information from the genomes of twenty species.
Once identified those points that represent functional sites with high likelihood in the sequence, the next step is to integrate all the available information in order to present as result of the process a correct gene structure. Therefore, orthogonally to the functional site prediction, we present the guidelines to design a global gene prediction framework that addresses the problem from a general perspective. This element considers the problem like a search process where as many information sources as available, or needed, are joined in order to find the more likelihood gene structures while the biological constraints attached to the problem are respected. The optimization process is carried out by using the principles of Evolutionary Computation. A population of individuals, each individual represents a solution to the problem, is used to simulate an evolution process. During this process, the better adapted individuals to the solution of the problem, and therefore, the individuals with statistical features that characterize coding regions, are determined by a function that is based on different sensors. The search power shown by the system in a specially difficult and complex space of solutions, and the flexibility to deal with the biological constraints of the problem, suggest evolutionary nature as very attractive in order to address the gene prediction task.
The experiments carried out on several chromosomes of several organisms, mainly on human genome, show the usefulness of the presented techniques to approach the gene prediction problem. The obtained results overtake the results of the best methods described in the bibliography up to now. The research performed during the development of this thesis had fruitful results in the form of presentations at international congresses and the publication of several articles in outstanding journals in the area.
© 2008-2024 Fundación Dialnet · Todos los derechos reservados