Knowledge discovery in non-linear structures

Aída Angela Jiménez Moscoso del Prado

Ayuda

Knowledge discovery in non-linear structures

Autores: Aída Angela Jiménez Moscoso del Prado
Directores de la Tesis: Fernando Berzal (dir. tes.) , Juan Carlos Cubero Talavera (dir. tes.)
Lectura: En la Universidad de Granada ( España ) en 2011
Idioma: inglés
ISBN: 9788469429198
Tribunal Calificador de la Tesis: María Amparo Vila Miranda (presid.) , Nicolás Marín Ruiz (secret.) , José Hernández Orallo (voc.) , Rudolf Kruse (voc.) , Ernestina Menasalvas (voc.)
Enlaces
- Tesis en acceso abierto en: DIGIBUG
Resumen
- The use of non-linear data structures in many interesting problems has spurred the interest of data mining researchers in the development of efficient and scalable data mining techniques for these special data structures. Trees, in particular, appear in many different problem domains, ranging from the Web and XML documents to bioinformatics and computer networks. They have recently attracted the attention of the research community, in part because they are particularly amenable to efficient pattern mining techniques. Identifying frequent patterns in a database of trees is an important task in solving many tree mining problems. These frequent patterns, also known as frequent subtrees, represent common substructures in a tree database.
  
  Many tree mining algorithms have been proposed in the literature but they cannot be directly applied to partially-ordered trees because they typically work either on completely-ordered or on completely-unordered trees. Partially-ordered trees are useful when the order within some sets of siblings is important but it is not necessary to establish an order relationship within all the sets of sibling nodes.
  
  An example of application can be taken from a typical scenario in market basket analysis, where the information of the orders of a customer in a supermarket it is stored in an XML document. In this example, it is important to preserve the order among the different orders placed by the same customer, while the order among the items in the same order is irrelevant. In practice, XML schemata define the structure of XML documents and they determine which sets of sibling nodes are ordered and which ones are unordered.
  
  In this thesis we propose a new tree mining algorithm which is able to work with partially-ordered trees. Furthermore, the algorithm devised has been improved to be more effient in execution time as well as in memory consumption.
  
  On the other hand, we have applied tree mining tecnhiques in two different areas: multirelational data mining and identifying motifs in music. Furthermore, as a parallel line of work, we have studied the pesence of groups in a database, and how they can be defined using a new kind of association rules that we have called group association rules.
  
  Obtained results The results of this thesis can be grouped in two different categories, those which are directely related with tree pattern mining techniques and those which refers to applications.
  
  Frequent tree pattern mining ¿ We have realized a full survey of the state of the art in the tree mining area, studying the different tree mining algorithms proposed in the literature. This review provides a classification of the algorithms according to the kinds of input trees they can be applied to and the kinds of subtrees they are able to identify within their input trees. We have also examined these algorithms and compare them in order to highlight their similarities and differences [1].
  
  ¿ We have devised a new algorithm, called POTMiner, that is able to work with partially-ordered trees, as well as completely ordered and completely unordered trees. This algorithm is as efficient and scalable as existing algorithms that exclusively work on either ordered or unordered trees [2].
  
  ¿ POTMiner algorithm has been improved in order to reduce its execution time and its memory consumption. In order to improve the CPU time, a parallel version of POTMiner has been implemented. Parallel POTMiner lets us exploit the architecture of modern multi-core processors and multiprocessors. Furthermore, POTMiner Light and POTMiner DP, another two variants of POTMiner, have been devised in order to reduce its memory consumption [3].
  
  Applications ¿ Multirelational data mining: Several algorithms have been developed in order to mine multirelational databases. Those that resort to join-based techniques do not preserve the proper support counting, and others are specially oriented to classiffication or clustering. Our proposal is to transform the multirelational database into a set of trees; these way efficient pattern mining algorithms that exist for working with trees can be applied. Furthermore, we can represent a knowlegde that previous techniques could not extract from the multirelational databases. Several kinds of tree patterns can be extracted from the multirelational datbase and we have studied the equivalence between these sets of patterns. Moreover, we have extracted association rules from the multirelational database using these patterns [4].
  
  ¿ Group association rules: Databases can naturally contain groups of individuals that share some characteristics. The goal of our approach is discover such groups in the databases and define using a new kind of association rules that we have called group association rules. we describe how to adapt some of the interestingness measures that have been defined for association rules to group association rules and how these modified measures will help us to rank the different groups in a database in order to highlight the most interesting ones [5].
  
  ¿ Identifying transposed motif in music: The study of the frequent patterns, called motifs in this context, that appears in the melody of a song is useful, for example, to determine the chorus of the song. The problem of motif extraction in a piece of music can be seen as a particular case of mining frequent patterns from a sequence when using the symbolic representation of the song. Then, we have adapted our algorithm POTMiner to deal with sequences. We consider the problem of the presence of transposed motifs. These motifs preserve the interval distances within notes although the notes are not the same that in the original motif. Transposed motifs should be counted as if they were exact repetitions [6].