Ir al contenido

Documat


Resumen de Density-based clustering: algorithms and evaluation techniques

Eduardo Pla Sacristán

  • This technical summary is structured as follows. First, we provide an introduction of the context and motivations of this work, including the objectives set for the thesis. After that, a brief analysis of the state of the art and related relevant techniques is presented. Subsequently, the novel scientific contributions of this thesis are introduced and explained. To conclude, we briefly discuss the results obtained, drawing conclusions from them, and we analyze the potential lines of research.

    1. Introduction Density-based clustering algorithms involve a relevant subset of all the methods developed for cluster analysis, which is one of the fundamental pillars of unsupervised learning. While the origins of clustering can be traced to the early 20th century, it is not until the 1990s that the concerns that would lead to develop density-based clustering algorithms are raised. In 1996, the most popular density-based clustering algorithm to date (DBSCAN) is published and, with it, many applications for density-based clustering are found within increasingly different fields over the next decades.

    Density-based clustering techniques are used in a wide range of scenarios. Since they were originally conceived for spatial databases, the most typical usually involve this type of data. In this context, even many of the recent instances of research within the field of density-based clustering are proposed generically with spatial datasets in mind, although other lines of research have proposed alternative applications for density-based clustering.

    A frequent characteristic among scenarios where density-based clustering algorithms are applicable is the scarcity of labeled data. Moreover, even when labels are available, there is often disparity of criteria in the annotations. Therefore, the development of specific evaluation techniques for density-based clustering is of paramount importance. Furthermore, there is a need to standardize these evaluation techniques to provide a sensible consensus. In this context, the acquisition of relevant data to conform an evaluation benchmark is a robust option to cover the mentioned needs. Benchmark datasets are chosen to represent different characteristics in data distributions and, hence, reflect the behaviour of an algorithm in such circumstances. Such data can be either real-world data or synthetic, with each option having advantages and disadvantages. Authors in and both discuss these pros and cons.

    An important factor that should be accounted for when considering a benchmark is the choice of the evaluation metric. There are two types of metrics to be considered when evaluating clustering partitions: (1) external metrics, which rely on some ground-truth labeled partition to be compared to the partition under analysis and provide a measure of their similarity; and (2) internal metrics, which are based on the characteristics of the data that belong to each cluster, usually providing a measure of cluster compactness and separation. When labeled data is available, the use of external evaluation is the obvious choice, since it is based on the true nature of the data. However, this is not always the case. Particularly, since unsupervised learning techniques do not require labels for their development and adjustment, the databases are often not annotated. Thus, the development of truthful internal metrics for density-based clustering is crucial.

    It is worth mentioning that this thesis has been partially developed within the framework of an existing research project, which focused on obtaining touristic landmarks from publicly available data (geo-located images and some additional textual information). As a result, the research presented in this thesis is often applied to the task of landmark discovery, which becomes the standard case of study for the proposed algorithms and evaluation techniques. The main goal of automatic landmark discovery is to determine, within a certain distribution, concealed regions or points that are particularly relevant, given a specific criterion. Although this task has been approached from a wide range of angles, there are problems that have not been properly solved yet (scale and density variability along the sample set), so there was an opportunity for additional research in this context through the development of novel clustering techniques.

    Taking advantage of the research opportunities discussed above, the following objectives were established:

    - To design and implement a new density-based clustering technique tailored to the task of automatic landmark discovery, i.e, suitable for scenarios where clusters may have varying shapes and densities.

    - To develop a method that automatically generates synthetic data with a flexible, parametrizable system that truthfully reflects the nature of data distributions with arbitrary characteristics such as dimensionality, noise, shape, density variations, etc.

    - To design and implement a new internal evaluation technique that resembles in the most accurately possible way the performance of external metrics within the field of density-based clustering.

    - To gather a sufficiently reflective dataset collection, both synthetic and real, to be able to build an evaluation benchmark for density-based clustering.

    2. State of the Art Clustering, as expressed by, is a type of classification imposed on a finite set of objects. The most important particularity of clustering among classification techniques is its unsupervised, intrinsic nature. In other words, only the nature of the data itself, and not its labels, class or category (if known) is used for cluster analysis. Among the different ways to tackle the clustering task, density-based methods are characterized by establishing a sample's affiliation to each cluster through the analysis of its surrounding density. Consequently, clusters are considered to be high density regions of the data distribution. Basic density-based clustering approaches have two convenient characteristics that make them suitable to work in certain environments: (1) the number of resulting clusters does not need to be known a priori, and (2) no assumptions regarding the nature of the underlying data distributions are made. This allows density-based clustering algorithms to generate clusters with arbitrary shapes, unlike most alternative approaches, which usually lead to convex clusters with globular shapes.

    The most well-known density-based clustering algorithm is probably DBSCAN, which is a method based on packing together samples based on their vicinity. Many modifications and variations of this algorithm have been proposed for different tasks since it was published in 1996. We can group these algorithms into two main categories based on their approach: (1) local algorithms, which start from a point and expand the clusters without taking into account the whole distribution; and (2) global algorithms, which first analyze the sample space and then use the obtained information to cluster the samples. Among the local, algorithms like OPTICS have been frequently employed. Furthermore, hierarchical approaches (usually through modifications of the original DBSCAN) have been proposed. On the other hand, global algorithms that make use of Kernel Density Estimations to discover the underlying distribution of the data, e.g., DENCLUE, are popular.

    As mentioned above, evaluation of density-based clustering techniques is a challenge. Obtaining data and, particularly, labeled data for evaluation can be a challenging task. Online platforms (Instagram, Foursquare, Flickr) can be used to this end, but privately owned databases are not always accessible and generating real databases by gathering and annotating the data personally can be a strenuous and tedious process. A suitable solution to the scarcity of data is resorting to synthetic data generation. However, data generation systems are quite diverse, and finding one that appropriately fills our needs is not always possible. The system in is promising, since it is able to generate data with varying densities and arbitrary shapes and sizes for the groups of samples. However, the generated data looks rather synthetic and, thus, not realistic.

    Once the data is obtained, the evaluation metric needs to be selected. External evaluation indices Rand, Fowlkes-Mallows, V-Measure, and Mutual Information are considered in this research. Furthermore, we also consider internal indices Dunn, Davies-Bouldin, Calinski-Harabasz, Silhouette, Density-Based Clustering Validation (DBCV) and Separation-Variance (SV). The choice of the metric highly depends on the available data and the task at hand. A metric useful to evaluate clustering partitions in a certain context does not have to be useful in another.

    3. Sctientific Contributions The fulfilment of the objectives presented above led to several novel contributions. Particularly: two density-based clustering algorithms, a public dataset for touristic landmark detection, a data generation system to reflect varying densities and shapes, an internal evaluation index and a benchmark for density-based clustering. This contributions are explained subsequently.

    - A Kernel-based variation of DBSCAN (KDBSCAN), published in, whose goal is to identifying arbitrarily-shaped groups of points within a significantly sparse sample space, without previous knowledge of the amount of resulting clusters. KDBSCAN first estimates the underlying distribution in the sample space and then it selects its peaks. After the main attractors of the distribution have been selected, the samples are assigned to it using an iterative process based on DBSCAN. This algorithm will be applied to the discovery of non-connected human settled areas (towns, villages) within a region of analysis.

    - A multi-scale variation of DBSCAN that takes into account the variations in density when moving away from the centroid of the data (VDBSCAN). It is a hierarchical modification of the DBSCAN algorithm. At each new iteration, it works at a finer resolution, subdividing several clusters into smaller ones following an established criterion based on compactness and separation. This algorithm, published in, will be used for the discovery of relevant landmarks within a connected region, considering user-provided, geodesic and text-based information.

    - A novel public dataset, made from user-generated contents, which contains six partially labeled heterogeneous locations, from single conurbations to larger regions with non-connected cores. We have considered six diverse regions, including a small village, Tapia de Casariego (Asturias), with approximately 3.7K inhabitants registered in 2020's population census; two medium-sized cities, Jerez de la Frontera (Andalucia) and Getafe (Madrid), with approximately 213K and 185K inhabitants respectively; a large city, Valencia (Comunitat Valenciana), with approximately 800K inhabitants; a region in Guadalajara (including various villages with Black Architecture), and a region in Euskadi (between rivers Lea and Oria, with several coastal touristic villages). This dataset will be used to evaluate the aforementioned algorithms and compare them with other relevant algorithms.

    - A novel synthetic data generation system with flexible parametrization (SynFlex) that reflects real data characteristics. It follows an iterative probabilistic data generation process (controlled by specifying a few interpretable parameters) that can be divided into two different modules: (1) cluster generation, which describes the generation process and parametrization (shape, density, sample cardinality, dimensionality) of a single, isolated data cluster, and (2) collection generation, which defines the whole sample space through a hierarchical generation of clusters and establishes the relationships between them (cluster cardinality, hierarchy level, noise presence). Therefore, the latter makes use of the former to generate the distribution. This system will be used to generate realistic, synthetic data to evaluate density-based clustering algorithms.

    - A new internal evaluation index (FLAG) based on intra-cluster fluctuation and the agglomeration of elements in the local environment of each cluster (inter-cluster dispersion). Unlike most internal metrics, which rely on sample separation to compute intra-cluster dispersion, our metric focuses on density fluctuation. Furthermore, let us remember that, when computing inter-cluster dispersion, most of the discussed metrics rely on cluster centroid positions to calculate absolute distances between clusters, which works appropriately for globular clusters, but is not representative of arbitrarily shaped clusters or clusters with different scales. Instead, our metric will take into account the distance between cluster boundaries to compute distances and gain notion of the agglomeration of the compared clusters taking into account their respective scales. This metric will be used to evaluate the performance of clustering algorithms when annotated real-world data is unavailable.

    - A benchmark for density-based clustering. We have identified the following as the primary factors to be studied: (1) the shape of the clusters to be discovered, (2) the density variability of the sample space, (3) the amount of noise in the distribution and (4) the depth of the hierarchy in the data generation process and (5) the dimensionality of the data. A series of databases exploring each of these factors is generated. This is done by exploring different values of the parameters that control the SynFlex generation system. Furthermore, two additional studies are presented: (6) a difficulty study, which combines several factors of impact creating four levels of clustering difficulty and (7) a study of unlabeled real-world databases using the FLAG index. The objective of these last two experiments is to provide a realistic performance evaluation of the algorithms in generic scenarios.

    4. Conclusions One of the main goals of this work was to find clustering algorithms able to adapt to data distributions with complex density variations. In this dissertation, we proposed two novel density-based clustering algorithms: 1) KDBSCAN is designed to identify, from the underlying distributions of the data, clusters of points with high intra-cluster density variations that may be separated by sparse areas; and 2) VDBSCAN is designed to exploit data distributions where density decreases radially from a center of mass.

    Additionally, we have presented automatic touristic landmark discovery as a direct application for these two methods. Finding touristic landmarks is a daring endeavor, not only because of the subjective nature of the task, but also because of the challenge of data gathering and meta-data analysis. Nonetheless, understanding how and why people visit different places might help in the challenge of building more sustainable tourism models in popular destinations, as well as attracting interest in less popular ones. Furthermore, manually crafting travel guides is a tedious process that hinders the availability of guides in less mainstream destinations. This particular scenario raises the need for automatic tools to generate valuable content. To tackle this problem, we have made use of Flickr, a public web platform that stores user-generated multimedia content. The development of this research resulted in the generation of a dataset containing all the studied regions, which has been made publicly available.

    Taking all the above into account, we have assessed our algorithms in a real scenario (the dataset obtained from Flickr). First, KDBSCAN was used to discriminate independent conurbations (Places of Interest) within a certain region. After that, we applied VDBSCAN to each resulting conurbation to provide an estimation of the landmark distribution within the region. The obtained results prove that this approach outperforms the current methods in the literature regarding landmark discovery (6\% increase over second best for individual cities or towns). This improvement is particularly significant when analyzing regions with multiple conurbations (15\% increase over second best method in the literature). It is worth noticing that when algorithms KDBSCAN and VDBSCAN were designed, certain considerations were taken into account, the main one being the availability of metadata in the samples (i.e., descriptive tags) to be combined with the geographical information. Nevertheless, later experiments showed that, in the absence of this additional metadata, HDBSCAN often proved to be a more effective method for landmark discovery. This emphasises the relevance of the proposed combined distance and, furthermore, opens up potential lines of research that will be discussed later.

    Another important problem considered in this thesis was the lack of proper evaluation techniques in the context of density-based clustering. To tackle this issue, we have proposed a synthetic data generation system with flexible parameters (SynFlex) that allows to model different characteristics regarding the nature of the data (density, shape, dimensionality, hierarchy, noise, etc.). This system was proven to be able to flexibly produce realistic datasets in the context of landmark discovery, but its high flexibility in the parametrization makes it potentially applicable to a wide range of fields. Also related to the evaluation demand, an internal index based on intra-cluster fluctuation and inter-cluster agglomeration (FLAG) was proposed. This index showed significant correlation with the external metric used to evaluate annotated data in this context (Adjusted Mutual Information), outperforming other generic, well-known indices (Dunn, Davies-Bouldin, Calinski-Harabasz and Silhouette) as well as some task-specific metrics (SV, DBCV).

    To conclude, these two contributions were combined to build a new evaluation benchmark for density-based clustering. The benchmark consists of real and synthetic data gathered with the goal of providing a systematic evaluation of the performance of clustering algorithms regarding five crucial factors: shape, density, noise, hierarchy and dimensionality. A set of well-known clustering algorithms have been assessed using the benchmark, and the obtained results outlined some of their strengths and weaknesses. Particularly, noise was proven to be a dominant factor when analyzing the algorithms' performance, with DBSCAN and its modifications (HDBSCAN and K+VDBSCAN) being less affected by its presence. Also worth mentioning is the swap in performance at the top of the ranking with respect to the experiments on landmark discovery. In those experiments, we were dealing with spatial and textual features and, while K+VDBSCAN showed a better performance when additional textual information was available, HDBSCAN performed slightly better in most generic scenarios with only spatial information.

    In summary, we have demonstrated the usefulness of our benchmark and we firmly believe that it will become a powerful tool to support the development of new algorithms and techniques in the field of density-based clustering.

    Although the research conducted for this dissertation allowed us to extract some significantly important conclusions, there are still some potential lines of research that could further enhance the impact of this thesis. Regarding the research performed for the development of density-based clustering algorithms, one would be to explore the inclusion of the visual analysis in the clustering procedure, in the same way that we have included a textual analysis. However, image processing techniques are often extremely time-consuming, so this should be done only if an exceptionally robust input for the subsequent processing blocks of the project was required. On the other hand, regarding the research on evaluation techniques, the main line of research that we identify is the inclusion of additional experiments to the evaluation benchmark. Additionally, the experiments on the benchmark also showed that, in the absence of additional metadata, HDBSCAN proved to be more competitive than other clustering alternatives, so it would be interesting to pursue a line of research that combined this method with complementary algorithms, such as the proposed KDBSCAN, to further improve its performance.

    Finally, we are confident that the proposed benchmark can significantly boost research in the field, as it offers a fair, systematic and complete evaluation mechanism. For instance, deep learning models for metric learning, very useful for clustering tasks, could be trained using large, labeled, synthetic datasets generated using SynFlex. Furthermore, the benchmark can be the basis to approach scenarios where algorithms provide poor performances (e.g., scenarios with a strong noise presence) and attempt to design and develop new clustering algorithms.


Fundación Dialnet

Mi Documat