Ir al contenido

Documat


Multidimensional clustering with Bayesian networks

  • Autores: Fernando Rodríguez Sánchez-Gontán
  • Directores de la Tesis: Pedro Larrañaga Múgica (codir. tes.) Árbol académico, Concha Bielza Lozoya (codir. tes.) Árbol académico
  • Lectura: En la Universidad Politécnica de Madrid ( España ) en 2021
  • Idioma: inglés
  • Enlaces
  • Resumen
    • The evolution of communication and a continued globalization process have resulted in bigger quantities of data being storaged. However, data has not only increased in volume but also in complexity. Nowadays, more and more data is collected from different measurement methods. In this context, traditional clustering algorithms are unable to comprehensively describe all of the contained information. That is why new clustering techniques that consider multiple dimensions of data are more necessary than ever. One of these techniques is multidimensional clustering, which extends model-based clustering by learning mixture models with multiple categorical latent variables. Each latent variable identifies a dimension along which data are partitioned into clusters. Each dimension is conformed of a different subset of domain variables.

      Bayesian networks are useful in multidimensional clustering for several reasons. First, their graphical structure allows for an easier interpretation, showing which variables are relevant for each clustering. Second, their conditional independences result in more compact models that are easier to learn. Finally, Bayesian networks support probabilistic inference, which is useful for making predictions, diagnoses and explanations.

      In this dissertation we explore the problem of learning Bayesian network models for multidimensional clustering. Although there is an extensive literature on multidimensional clustering methods for categorical data and for continuous data, there is a lack of work for mixed data (i.e., data that is composed of both categorical and continuous variables). For this reason, we propose approaches that are able to efficiently deal with mixed data by exploiting the Bayesian network factorization and the variational Bayes framework. More specifically, we make the following contributions.

      First, we present an incremental algorithm for learning conditional linear Gaussian Bayesian networks with categorical latent variables whose structures are restricted to forests. The learning process is divided in two phases. In the first phase, the forest structure is expanded with a new arc or latent variable. In the second phase, the cardinalities of latent variables are estimated. Furthermore, we devise a variant of this algorithm that only considers a subset of the possible structures and demonstrate the effectiveness of the approach.

      Second, we develop a greedy algorithm for learning conditional linear Gaussian Bayesian networks with categorical latent variables that are not restricted to tree-like structures. To this purpose, the proposed method hill-climbs the space of models using a series of latent operators and a variational Bayesian version of the structural expectation-maximization algorithm.

      Finally, we present a multidimensional clustering study with Parkinson's disease data where we apply the proposed methodology. We consider data from a large, multi-center, international, and well-characterized cohort of patients. As a result, eight sets of motor and non-motor symptoms are identified. Each of them provides a different way to group patients: impulse control issues, overall non-motor symptoms, presence of dyskinesias and psychosis, fatigue, axial symptoms and motor fluctuations, autonomic dysfunction, depression, and excessive sweating.


Fundación Dialnet

Mi Documat

Opciones de tesis

Opciones de compartir

Opciones de entorno