Ir al contenido

Documat


Resumen de Classification of imbalanced data sets, data with missing attributes and streaming data

Mónica Millán Giraldo

  • This Ph.D. thesis proposes different strategies to deal with the problem of classification of data sets with class imbalance, missing attributes and streaming data with missing and delayed attributes.

    Although the classification of imbalanced data sets has been extensively studied in the feature space, very few approaches explore the class imbalance problem in the dissimilarity space. In this thesis, the effect of using an imbalanced or balanced representation set before representing data by dissimilarities is investigated. In addition, the effect of applying prototype selection methods for under-sampling the dissimilarity matrix is also investigated. In the first case, the best results are obtained when the training set is previously preprocessed by using some resampling algorithm. In the second case, the results indicate that the proposed strategy, which selects the representative set from the majority class, performs better than the classical prototype selection methods applied over all classes.

    Regarding the problem of classification of data sets with missing attributes, this is analyzed from both feature and dissimilarity spaces. In the representation based on features, a modification of the imputation technique using regression with support vector machines (SVR) is carried out. In the dissimilarity space, two approaches that do not require any imputation process for handling incomplete data are proposed. Additionally, the performance of the techniques proposed in both spaces is analyzed in credit scoring applications. The obtained results show that the modified technique based on SVR outperforms other methods in terms of accuracy, type I and type II errors, when this is used in combination with the nearest neighbor classifier. Similarly, the proposed methods based on dissimilarities provide the best accuracy rates than other methods when they are used with the linear, quadratic and Fisher classifiers.

    In relation to the classification of data streams, with objects for which one attribute arrives only after a given delay, three on-line learning strategies are proposed. These strategies take into account how to classify the incomplete objects, whether to wait for the delayed attribute before performing any classification, or when and how to update the training set. The results reveal that the proposed on-line strategies, despite their simplicity, may outperform classifiers using only the original, labeled-and-complete samples as a fixed training set. In other words, learning is possible by properly tapping into the unlabeled, incomplete samples, and their delayed attributes.

    Furthermore, as conventional methods dealing with incomplete objects perform differently on different data sets, a more general approach is required to make the learning and classification procedures more data-independent. To that end, two new perspectives, based on reinforcement learning (RL), are proposed. Unlike existing RL approaches, a novelty of the proposed RL algorithms is that the data stream itself is regarded as the environment, and the learning/classification decisions are the possible actions. As evidenced empirically through a range of data sets, these approaches make the algorithm exhibit an adaptive behavior and develop an ability to use available individual methods more effectively than if they are just applied separately.


Fundación Dialnet

Mi Documat