As the stored data grow, the relational schemas needed to organize all these data get more complex, increasing the number of relations in the database. The direct consequence of this growth is the need for creating new SQL queries that involve an increasing number of relations in the database. A clear example of this problem are some advanced applications like SAP R/3, that often need to combine a large set of tables to reconstruct complex business objects. For instance, a SAP schema may contain more than 10,000 relations and may join more than 20 of these in a single SQL query.
Once a SQL query is introduced into the DBMS, several steps have to be performed in order to return the corresponding results. One of the first steps involves executing the query optimizer. It is the task of the query optimizer to transform a declarative SQL statement into a query execution plan (QEP) by constructing alternative QEPs, determining the cost of each QEP and selecting the cheapest one for execution.
State-of-the-art query optimizers, which typically employ dynamic programming techniques, have problems handling such a large amount of joins due to the exponential explosion of the search space. In these situations, optimizers either resort to heuristics or fall back to greedy algorithms which iteratively construct a QEP by initially selecting the cheapest two-way join, and then determining the cheapest join with a third table and so forth until all tables have been joined. However, greedy algorithms do not consider the entire search space and thus may overlook the optimal plan, resulting in bad query performance which may cause queries to run hours instead of seconds.
The problem of accessing a large number of relations from a query has been named the Very Large Join Query Problem. Randomized search techniques like Iterative Improvement or Simulated Annealing remedy the exponential explosion of dynamic programming techniques by iteratively exploring the search space and converging to a nearly optimal solution. The big advantage of this type of algorithms is that they have constant space overhead.
Alternatively to random-walk algorithms, optimizing queries involving a large number of joins using evolutionary approaches was introduced by Bennet et al. and tested later by Steinbrunn et al. showing that it can be a very competitive approach.
However, the application of evolutionary algorithms in this scenarios has not been studied in detail. There exist two main aspects which are still unclear and need further investigation, namely, the quality of the results and the speed to converge to an optimum solution.
In this Thesis, we implement a query optimizer based on genetic programming algorithms. We compare the results yielded by our optimizer with those yielded by the UDB DB2 optimizer, especially for very large join queries. We show that the genetic approach can be very useful in such scenarios. We also present an statistical model that allows us to establish a set of criteria to parameterize our randomized approach. We study the internal behavior of the optimizer during the optimization process. From this set of analysis, we present a set of proposals in order to improve both the quality and the speed of convergence of our optimizer, proposing a new class of algorithm and achieving improvements in some cases of three and four orders of magnitude with respect to the original pure genetic optimizer. We have also compared our approach with some of the most efficient randomized algorithms proposed in the literature, such as the Two-Phase Optimization (2PO). Our studies show that our new prototype improves these algorithms when the number of relations involved in the queries increases beyond twenty. In general, the larger the number of relations involved in the query, the larger the benefit obtained by our new class of algorithm.
© 2008-2024 Fundación Dialnet · Todos los derechos reservados