Ir al contenido

Documat


Resumen de Novel support vector machines for diverse learning paradigms

Gabriela Angela Melki

  • 1. Introduction / Motivation In traditional classification and regression problems, learning algorithms uncover dependencies and patterns that exist between given inputs (samples) and their outputs (categorical or continuous), using training data. Identifying these patterns is a non-trivial task due to many factors, such as the high dimensionality of the data, as well as dataset size. Over the past decade, dataset sizes have grown disproportionately to the speed of processors and memory capacity, limiting machine learning methods to computational time. Many real-world applications, such as human activity recognition, operations research, and video/signal processing, require algorithms that are scalable and accurate, while being able to provide insightful information in a timely fashion.

    More recently, these classic methods have been extended to accommodate various types of data paradigms [1]. Examples include Multiple Target (MT) learning, Multiple Instance (MI) learning, and Data Stream learning. These emerging paradigms require algorithms to be robust, while accommodating and exploiting their different and complex data representations.

    In this thesis, various approaches are devised for solving classification and regression problems within the traditional and non-traditional learning paradigms mentioned, using support vector machines.

    Multi-target learning is a challenging task that consists of creating predictive models for problems with multiple simultaneous outputs [2, 3, 4]. Learning under this paradigm has the capacity to generate models representing a wide variety of real-world applications, ranging from natural language processing [5] to bioinformatics [6]. MT learning includes multi-target regression (MTR), which addresses the prediction of continuous targets, multilabel classification [7] which focuses on binary targets, and multi-dimensional classification which describes the prediction of discrete targets [8]. One contribution of this dissertation will be focused on tackling the multi-target regression problem, also known as multi-output, multi-variate, or multi-response regression [8].

    A characteristic of multi-target data is that the outputs have some structure, in the form of inter-relationships, correlations, and dependencies. Although modeling the multi-variate nature and possible complex relationships between the target variables simultaneously is challenging, past empirical work has shown that the targets are more accurately represented by a single multi-target model [3, 9]. The most valuable advantage of using multi-target techniques is that, not only are the relationships between the sample variables and the targets exploited, but the relationships between the targets amongst themselves are as well [2, 3]. This guarantees a better representation and interpretability of real-world problems that produce multiple outputs, unlike a series of single-target (or traditional) models [10]. In addition, MT models could also be considerably more computationally efficient to train, rather than training multiple single-target models individually [11].

    Several methods have been proposed for solving such multi-target tasks and can be categorized into two groups. The first being problem transformation methods, also known as local methods, in which the multi-target problem is transformed into multiple single-target (ST) problems, each solved separately using standard classification and regression algorithms. The second being algorithm adaptation methods, also known as global or big-bang methods, which adapt existing traditional algorithms to predict all the target variables simultaneously [8]. It is known that algorithm adaptation methods outperform problem transformation methods, however they are deemed to be more challenging since they predict, model, and interpret multiple outputs simultaneously.

    Multi-instance learning (MIL) is a generalization of supervised learning that has been recently been gaining interest because of its applicability to many real-world problems such as image classification and annotation [12], human action recognition [13], predicting student performance [14], and drug activity prediction [15]. The difference between MIL and traditional learning is the nature of the data. In the multi-instance classification setting, a sample is considered a bag that contains multiple instances and is associated with a single label. The individual instance labels within a bag are unknown and bag labels are assigned based on a multi-instance assumption, or hypothesis. Introduced by Dietterich et. al. [15], the standard MI assumption states that a bag is labeled positive if and only if it contains at least one positive instance and is negative otherwise. In other words, the bag-level class label is decided by the disjunction of the instance-level class labels. Other hypotheses have been proposed by Foulds and Frank [16] to encompass a wider range of applications with MI data, but for the scope of this thesis, the focus will be on the standard MI assumption.

    Multi-instance classification methods are typically categorized on how the information within the data is exploited. Under the Instance-Space (IS) paradigm, discriminative information is considered to be at the instance level, where instance-level classifiers aim to separate the instances from positive bags from those in negative ones. Given a new bag, the classifier will predict the bag-label by aggregating the instance-level scores using some MI assumption. The IS paradigm is based on local, or instance-level information, where learning is not concerned with global characteristics of the entire bag. Unlike the IS paradigm, the Bag-Space (BS) paradigm considers the information provided of the bag as whole, also known as global, or bag-level information. Another approach for dealing with multi-instance data falls under the Embedded-Space (ES) paradigm, where each bag is mapped to a single feature vector, which summarizes the information contained within each bag. The original bag space is mapped to a vector space, where a classifier is then trained. Under this paradigm, the multi-instance problem is transformed into a traditional supervised learning problem, where any classifier can then be applied.

    A recent survey [17] organized the various problems and complexities associated with MIL into four broad categories: Prediction level, Bag composition, Label ambiguity and Data distribution; each raising different challenges. As mentioned previously, when instances are grouped into bags, predictions can be performed at two levels: the bag-level or the instance level. Certain types of algorithms are often better suited for one of these two types of predictions. The composition of each bag, such as the proportion of instances of each class and the relationship between instances also affects the performance of MIL methods. The ambiguity amongst the instance labels stemming from label noise and unclear relationships between an instance and its class is another complexity that should be considered. Finally, the underlying distributions of positive and negative classes affects MIL algorithms depending on their assumptions about the data.

    One of the major complexities that this thesis will be tacking is dealing with the ambiguity of the relationship between a bag label and the instances within the bag. This issue stems from the standard MI assumption, where the underlying distribution among instances within positive bags is unknown. There have been different attempts to overcome this complexity, such as “flattening" the MIL datasets, meaning instances contained in positive bags each adopt a positive label, allowing the use of classical supervised learning techniques [18]. This approach assumes that positive bags contain a significant number of positive instances, which may not be the case, causing the classifier to mislabel negative instances within the bag, decreasing the power of the MI model. To overcome this issue, a different MIL approach was proposed, where subsets of instances are selected from positive bags for classifier training [19]. One drawback of this type of method is that the resulting training datasets become imbalanced towards positive instances. Model performance further deteriorates when more instances are selected as subsets than needed [20]. The MIL contribution of this thesis aims to deal with these drawbacks by minimizing class imbalance, which is achieved by optimally selecting bag-representatives from both classes.

    The Data Stream learning paradigm has become a more pragmatic area of research recently with the prevalence and advancements in software and hardware technologies, which have vastly increased the amount and frequency of available data [21]. The classic machine learning process, whether it be for traditional supervised learning or non-traditional learning paradigms, is divided into two phases: model building and model testing from static datasets. It is often assumed that the data generating process is stationary, i.e. the data are drawn from a fixed, yet unknown probability distribution, however, in many real-world scenarios, this might not be the case. Rather, data are now being made available in an online, or streamed fashion, and are usually generated by an evolving, or drifting, phenomenon [22, 23, 24]. The former is described as a stationary stream and the latter, non-stationary. These drifts can be due to a number of events, such as seasonality effects, changes in user preferences, or hardware/software faults. Due to the fluid nature of the data generation environment, where the probabilistic characteristics of data can change over time, traditional machine learning methods will be bound to perform sub-optimally at best or completely fail at worst. These data complexities prompted the need for effective, efficient, and accurate algorithms for learning from stationary streamed data, as well as adapting to, drifting environments.

    Applications of data stream classification can vary from astronomical and geophysical operations [25] to real-time recommender systems for business and industrial uses [26, 27].

    Adapting traditional classification methods to these types of scenarios is usually a non-trivial task. The algorithms need to perform classification immediately upon request, since it may not be possible to control the rate at which test samples arrive. Another hurdle stems from the possibility of drifting concepts, and if they occur, the classifier will most likely become outdated after a period of time [28]. There are two popular strategies commonly used when learning from non-stationary data streams, commonly referred to as active and passive approaches [23, 29]. They differ in their employed methods for adapting to a possibly evolving data stream. The active approach relies on an explicit drift detector in order to utilize an appropriate adaptation mechanism, while the passive approach continuously updates the model over time when data is received, without the need for an explicit drift detector. Deciding which approach to utilize depends on the application (whether there are sudden concept drifts, or if the data arrive online or in batches), the computational resources that are available, and any prior assumptions about the data distribution [30].

    Generally, passive approaches have been shown to be more effective in classification settings where there are gradual drifts [29]. Although detecting and deal with gradual drifts can be done by active approaches, the change detection is considerably more difficult [31]. Active approaches work well in settings where the concept drift is abrupt. Additionally, passive approaches generally perform better in batch learning settings, whereas active approaches have been shown to work well in the online setting [32, 33].

    The contributions of this thesis aim to deal with the drawbacks that exist within these non-traditional learning paradigms, using traditional and a novel solvers for support vector machines. Support vector machines (SVMs), proposed by Cortes and Vapnik [34], represent popular linear and non-linear (kernelized) learning algorithms based on the idea of a large-margin classifier. They have been shown to improve generalization performance for binary classification problems. SVMs are similar to other machine learning techniques, but literature shows that they usually outperform them in terms of scalability, computational efficiency, and robustness against outliers. They are known for creating sparse and non-linear classifiers, making them suitable for handling large datasets.

    A traditional approach for training SVMs is the Sequential Minimal Optimization (SMO) algorithm [35], a method for solving the L1-SVM's Quadratic Programming (QP) task. Although SMO provides an exact solution to the SVM QP problem, its performance is highly dependent on the SVM hyperparameters. More recent approaches, which have been shown to surpass SMO in terms of scalability while remaining competitive in accuracy, include the LASVM algorithm [36], the Minimal Norm SVM (MNSVM) [37] and the Non-Negative Iterative Single Data Algorithm (NNISDA) [38, 39]. With all the various efforts aimed at solving the SVM task efficiently, this area is still requires investigation.

    To deal with issues of scalability, this thesis introduces a different approach which focuses on the minimization of the regularized L1-SVM through Stochastic Gradient Descent (SGD), a well-known simple, yet efficient technique for learning classifiers under convex loss functions. Recently, SGD algorithms have been shown to have considerable performance and generalization capabilities in the context of large-scale learning [40], and have been used to solve the SVM problem, such as NORMA [41] and PEGASOS [42, 43].

    2. Research Contributions This dissertation introduces novel support vector machines (SVM) for the following traditional and non-traditional learning paradigms: Online classification, Multi-Target Regression, Multiple-Instance classification, and Data Stream classification.

    Three multi-target support vector regression (SVR) models are _first presented. The first involves building independent, single-target SVR models for each target. The second builds an ensemble of randomly chained models using the first single-target method as a base model. The third calculates the targets' correlations and forms a maximum correlation chain, which is used to build a single chained SVR model, improving the model's prediction performance, while reducing computational complexity.

    Under the multi-instance paradigm, a novel SVM multiple-instance formulation and an algorithm with a bag-representative selector, named Multi-Instance Representative SVM (MIRSVM), are presented. The contribution trains the SVM based on bag-level information and is able to identify instances that highly impact classification, i.e. bag-representatives, for both positive and negative bags, while finding the optimal class separation hyperplane. Unlike other multi-instance SVM methods, this approach eliminates possible class imbalance issues by allowing both positive and negative bags to have at most one representative, which constitute as the most contributing instances to the model.

    Due to the shortcomings of current popular SVM solvers, especially in the context of large-scale learning, the third contribution presents a novel stochastic, i.e. online, learning algorithm for solving the L1-SVM problem in the primal domain, dubbed OnLine Learning Algorithm using Worst-Violators (OLLAWV). This algorithm, unlike other stochastic methods, provides a novel stopping criteria and eliminates the need for using a regularization term. It instead uses early stopping. Because of these characteristics, OLLAWV was proven to efficiently produce sparse models, while maintaining a competitive accuracy.

    3. Conclussion This thesis introduced several novel SVM algorithms for learning from the following diverse machine learning paradigms: multi-target regression, multi-instance classification, traditional supervised classification, and data stream classification.

    Three unique approaches for multi-target regression were proposed: the baseline problem transformation support vector regressor SVR, an ensemble of randomly generated chains using this base model SVRRC, and a maximally correlated chained model SVRCC. The results highlighted the better performance of SVR as a base model, however, because it generates independent regressors, the possible correlations amongst the targets are lost on the final model. SVRRC was designed to test whether taking these correlations into account would benefit the final learning model, and the results showed a performance increase. However, due to the random nature of SVRRC and its limit on the number of generated chains, capturing target correlations is not guaranteed. SVRCC was designed to remedy this issue using a maximum correlation chain and proved to capture target correlations accurately, providing the best results among the contributions, as well as against the methods compared.

    A novel multi-instance bag-level formulation and algorithm, MIRSVM, with a bag-representative selector, are proposed. MIRSVM trains the model on bag-level information, iteratively selecting the best representative instance for each positive and negative bag, while finding the optimal separating hyperplane. This approach, unlike other existing ones, eliminates possible class imbalance issues by allowing both positive and negative bags to be represented. The experimental and statistical study showed that bag-level learners outperform instance-level learners and wrapper methods. MIRSVM outperformed current contemporary multi-instance SVMs, as well as other algorithms of different classes over several metrics. After the previous experimental studies, it was evident that the existing popular SVM solvers that were used suffered from several disadvantages when used in traditional and nontraditional settings. This prompted the design and implementation of a novel online, also known as stochastic, learning algorithm for solving the primal L1-SVM problem, dubbed OLLAWV. Unlike other online methods, OLLAWV eliminates the need for specifying the number of iterations, as well as the use of a regularization term. The proposed algorithm uses early stopping as its regularizer. OLLAWV also was designed to have a novel stopping criteria, a trait that most stochastic methods do not have. The experimental study, involving strict nested cross-validation, evaluated and compared the proposal with current popular SVM kernel methods that have been shown to outperform the traditional and widely used approaches for solving L1-SVMs, such as SMO and quadratic programming solvers. The results of the experimental study, along with complementary statistical analysis, showed the better performance of OLLAWV against the compared methods, along with 5 non-SVM contemporary methods. OLLAWV was shown to produce sparse models at very fast speeds, without sacrificing accuracy. This, along with the online nature of OLLAWV, prompted the investigation of its performance in the data stream setting.

    The final contribution of this thesis involved implementing OLLAWV in a batch data stream classification setting due to its major success with traditional classification. However, before OLLAWV's implementation and experimentation was complete, its parent, OLLA-L2, was the first contender for a data stream algorithm due to its online nature, competitive performance against the popular SMO algorithm, as well as its ability to produce sparse models at very fast rates. After preliminary experiments between the two algorithms, OLLAWV was shown to outperform OLLA-L2 in terms of accuracy. The prequential experimental study analyzed the performance of OLLAWV against 12 popular data stream algorithms, most capable of adapting to drifts. The goal of the study was to assess OLLAWV's performance as a base-line model in the data stream (stationary or not) environment. The results highlighted OLLAWV's better performance against the methods compared. They also indicated that adaptive algorithms and methods with drift-detectors were superior to the rest.

    4. Bibliografía [1] P. N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 2005. isbn: 0321321367.

    [2] J. Baxter. “A Bayesian/information theoretic model of learning to learn via multiple task sampling”. In: Machine Learning 28.1 (1997), pp. 7–39.

    [3] R. Caruana. “Multitask learning”. In: Machine Learning 28.1 (1997), pp. 41–75.

    [4] S. Thrun. “Is learning the n-th thing any easier than learning the first?” In: Advances in Neural Information Processing Systems. 1996, pp. 640–646.

    [5] M. Jeong and G. G. Lee. “Multi-domain spoken language understanding with transfer learning”. In: Speech Communication 51.5 (2009), pp. 412–424.

    [6] Q. Liu et al. “Multi-task learning for cross-platform siRNA efficacy prediction: an in-silico study”. In: BMC Bioinformatics 11.1 (2010), pp. 181–196.

    [7] M. L. Zhang and Z. H. Zhou. “A review on multi-label learning algorithms”. In: IEEE Transactions on Knowledge and Data Engineering 26.8 (2014), pp. 1819–1837.

    [8] H. Borchani et al. “A survey on multi-output regression”. In: Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 5.5 (2015), pp. 216–233.

    [9] T.Evgeniou, C.A. Micchelli, and M. Pontil. “Learning multiple tasks with kernel methods”. In: Journal of Machine Learning Research 6 (2005), pp. 615–637.

    [10] S. Ben-David and R. Schuller. “Exploiting task relatedness for multiple task learning”. In: Learning Theory and Kernel Machines. Springer, 2003, pp. 567–580.

    [11] T. Evgeniou and M. Pontil. “Regularized multi–task learning”. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. 2004, pp. 109–117.

    [12] G. Herman et al. “Region-based image categorization with reduced feature set”. In: Proceedings of the 10th IEEE Workshop on Multimedia Signal Processing. 2008, pp. 586–591.

    [13] Y. Yi and M. Lin. “Human action recognition with graph-based multiple-instance learning”. In: Pattern Recognition 53 (2016), pp. 148–162.

    [14] A. Zafra, C. Romero, and S. Ventura. “Multiple instance learning for classifying students in learning management systems”. In: Expert Systems with Applications 38.12 (2011), pp. 15020–15031.

    [15] T. G. Dietterich, R. H. Lathrop, and T. Lozano-Perez. “Solving the multiple instance problem with axis-parallel rectangles”. In: Artificial Intelligence 89 (1997), pp. 31–71.

    [16] J. Foulds and E. Frank. “A review of multi-instance learning assumptions”. In: The Knowledge Engineering Review 25.1 (2010), pp. 1–24.

    [17] M. A. Carbonneau et al. “Multiple instance learning: A survey of problem characteristics and applications”. In: Pattern Recognition 77 (2018), pp. 329–353.

    [18] S. Ray and M. Craven. “Supervised versus multiple instance learning: an empirical comparison”. In: Proceeding of the International Conference on Machine Learning. 2005, pp. 697–704.

    [19] O. Maron and T. Lozano-P´erez. “A framework for multiple-instance learning”. In: Neural Information Processing Systems 3201 (1998), pp. 570–576.

    [20] M. A. Carbonneau et al. “Robust multiple-instance learning ensembles using random subspace instance selection”. In: Pattern Recognition 58 (2016), pp. 83–99.

    [21] M. M. Gaber, A. Zaslavsky, and S. Krishnaswamy. “Mining data streams: a review”. In: ACM Sigmod Record 34.2 (2005), pp. 18–26.

    [22] National Research Council. Frontiers in massive data analysis. National Academies Press, 2013.

    [23] G. Ditzler et al. “Learning in nonstationary environments: A survey”. In: Computational Intelligence Magazine 10.4 (2015), pp. 12–25.

    [24] Z. H. Zhou et al. “Big data opportunities and challenges: Discussions from data analytics perspectives [discussion forum]”. In: Computational Intelligence Magazine 9.4 (2014), pp. 62–74.

    [25] M. C. Burl et al. “Diamond Eye: A distributed architecture for image data mining”. In: Data Mining and Knowledge Discovery: Theory, Tools, and Technology. Vol. 3695. International Society for Optics and Photonics. 1999, pp. 197–207.

    [26] H. Kargupta et al. “MobiMine: Monitoring the stock market from a PDA”. In: SIGKDD Explorations Newsletter 3.2 (2002), pp. 37–46.

    [27] H. Kargupta et al. “VEDAS: A mobile and distributed data stream mining system for real-time vehicle monitoring”. In: Proceedings of the SIAM International Conference on Data Mining. SIAM. 2004, pp. 300–311.

    [28] J. Gama et al. “A survey on concept drift adaptation”. In: ACM computing surveys (CSUR) 46.4 (2014), p. 44.

    [29] R. Elwell and R. Polikar. “Incremental learning of concept drift in nonstationary environments”. In: IEEE Transactions on Neural Networks 22.10 (2011), pp. 1517–1531.

    [30] C. Alippi. Intelligence for embedded systems. Springer, 2014.

    [31] C. Alippi, G. Boracchi, and M. Roveri. “Just in time classifiers: Managing the slow drift case”. In: Proceedings of the International Joint Conference on Neural Networks. IEEE, 2009, pp. 114–120.

    [32] A. Bifet and R. Gavalda. “Learning from time-changing data with adaptive windowing”. In: Proceedings of the SIAM international conference on data mining. SIAM. 2007, pp. 443–448.

    [33] J. Gama et al. “Learning with drift detection”. In: Proceedings of the 7th Brazilian symposium on artificial intelligence, lecture notes in computer science. Springer. 2004, pp. 286–295.

    [34] C. Cortes and V. Vapnik. “Support-vector networks”. In: Machine Learning 20.3 (1995), pp. 273–297.

    [35] J. Platt. “Sequential minimal optimization: A fast algorithm for training support vector machines”. In: Technical Report: MSR-TR-98-14 (1998).

    [36] A. Bordes et al. “Fast kernel classifiers with online and active learning”. In: Journal of Machine Learning Research 6 (2005), pp. 1579–1619.

    [37] R. Strack. “Geometric approach to support vector machines learning for large datasets”. PhD thesis. Virginia Commonwealth University, 2013.

    [38] V. Kecman and L. Zigic. “Algorithms for direct L2 support vector machines”. In: Proceedings of the IEEE International Symposium on Innovations in Intelligent Systems and Applications. 2014, pp. 419–424.

    [39] L. J. Zigic. “Direct L2 support vector machine”. PhD thesis. Virginia Commonwealth University, 2016.

    [40] L. Bottou. “Large-scale machine learning with stochastic gradient descent”. In: Proceedings of nineteenth International Conference on Computational Statistics. 2010, pp. 177–186.

    [41] J. Kivinen, A. J. Smola, and R. C. Williamson. “Online learning with kernels”. In: IEEE Transactions on Signal Processing 52.8 (2004), pp. 2165–2176.

    [42] S. Shalev-Shwartz et al. “Pegasos: Primal estimated sub-gradient solver for svm”. In: Mathematical programming 127.1 (2011), pp. 3–30.

    [43] T. Zhang. “Solving large scale linear prediction problems using stochastic gradient descent algorithms”. In: Proceedings of the twenty first International Conference on Machine Learning. ACM. 2004, p. 116.


Fundación Dialnet

Mi Documat