Nowadays, there are many applications that manage very large databases of objects in which the searches do not rely on comparisons of the equality/inequality of two objects, but on how similar the objects are. This is the case of problems such as searching for similar fingerprints to one given as a query, content-based image retrieval in multimedia databases, sequence searching in genome databases, duplicate document detection in web search engines, and spam detection, just to name a few of them. Metric spaces provide a generic and useful framework for those problems where the exact comparison of two objects is not possible or does not make sense since it is useless. Inside this framework different methods for indexing and search have been proposed. In this work we provide an extensive description of the state of the art.
In all these domains where a similarity search is needed, a distance function to compute the value of the similarity or proximity between any two objects in the database is available. The naive implementation of similarity search would consist in sequentially scanning the entire database. However, that is not a feasible solution in practice, since the computation of the distance between two objects, that is, the use of the distance function, is in general very costly. Methods for searching in metric spaces make similarity search efficient by indexing the database and reducing as much as possible the number of distance computations needed for solving a query. That is, methods for searching in metric spaces avoid to compare (using the distance function) all the objects in the database with the query.
In order to index the database, all methods using the metric space approach select a set of reference objects from the database and store in the index the distances between those reference objects and the rest of objects in the database. The way these distances are stored and used allows us to classify existing methods in pivot-based methods, which store the distances from the reference objects to the rest of objects in the database, and clustering-based methods, which partition the space into a set of clusters around the reference objects. In both cases, the selection of good reference objects determines the effectiveness of the index for pruning the search space and thus reduce the cost of the search.
The main contribution of this thesis is a new method for the selection of effective reference objects. An important difference with previous methods is that the reference objects selected with our method are well distributed in the space. Our method for selecting reference objects has also other interesting properties: it automatically determines the optimal number of reference objects, it works with both discrete and continuous distances, it adapts the set of reference objects to the characteristics of the space, and the references are selected dynamically as new objects are inserted into an initially empty database with no extra cost.
Other contributions of this thesis are:
- A pivot-based method that uses our method for the selection of reference objects. This method is competitive when compared with previous methods in search cost, and is clearly better in other aspects: it is dynamic and adaptive, it determines by itself the optimal number of pivots, and it does not impose an additional cost for the selection of pivots.
- A clustering-based method that uses our method for the selection of reference objects to create an unbalanced index structure adapted to the characteristics of the space. By using an unbalanced structure to index the space, it clearly outperforms previous clustering-based methods.
- A criteria for refining the set of references by removing those that do not contribute to increase the capability of the set of references for discarding objects. The result is a smaller set of references that conserves its capacity for pruning the search space.
- Finally, we introduce the concept of {\em nested metric spaces} to explain certain irregularities that appear in real problems. We show how these irregularities can affect the search performance of some methods, and we show that by detecting and treating them is it possible to improve the search performance.
Therefore, this thesis contributes to the state of the art in this field with methods that combine an improvement of the search cost with other practical aspects important for their application to real problems.
© 2008-2024 Fundación Dialnet · Todos los derechos reservados