Due to recent scientific and technological advances in information systems it is now possible to continuously record data at high speeds in a wide range of devices. The need to make sense of such massive amounts of data opens an opportunity to create new data stream classification techniques to model and predict the behavior of streaming data.
When learning from data streams, the problem of concept drift means that the underlying data distributions can change over time. This has a strong impact on classification techniques, as predictive models become invalid and have to be updated. Furthermore, these changes in concept are usually a consequence of changes in context, and this relationship could be exploited to handle concept drift.
Recurring concepts is a particular case of concept drift, where concepts that have drifted can suddenly reoccur. In this situation it may be possible to avoid relearning these previously observed concepts. However, the few existing approaches that take advantage of concept recurrence are neither designed to take context into consideration nor to take into account the resources required to store representations of past concepts. Both issues are of particular significance for ubiquitous data stream mining, where the learning process is executed in dynamically changing environments using resource constrained devices.
Moreover, most existing techniques assume that the underlying data stream feature space is static. However, in many real-world applications the set of features and their relevance to the target concept may change over time. Despite its importance, this issue has received little attention, particularly on how it can be efficiently addressed when tracking recurring concepts.
Sharing knowledge among ubiquitous devices to collaboratively improve the modeling of local concepts is another interesting idea which has not been properly explored. This could improve the accuracy of the local model as it would benefit from patterns similar to the local concept that were observed in other ubiquitous devices, but not yet locally.
In addition, the deployment of data stream classification as an autonomous and adaptive service to support the data analysis requirements of ubiquitous applications is still an open issue that lacks research in the field of ubiquitous data stream mining.
This PhD thesis addresses the aforementioned open issues, focusing on learning anytime, anywhere classification models from data streams in ubiquitous environments, where the underlying concepts may change over time, with special emphasis on recurring concepts. Four main contributions are presented:
-The MReC (Mining Recurring Concepts) approach that integrates context with previously learned concepts to improve the adaptation to recurring concepts. Moreover, to deal with situations of resource constraints, an intelligent strategy to discard models is also proposed.
- The MReC-DFS (Mining Recurring Concepts in a Dynamic Feature Space) approach, that extends MReC to address the challenges of a dynamic feature space while simultaneously reducing the memory cost of storing past models. In addition, a novel incremental feature selection method is proposed that dynamically determines the threshold used to select the most relevant features for a certain concept.
- A Collaborative Data Stream Mining (Coll-Stream) approach that explores the knowledge available in the community to improve local classification accuracy. Coll-Stream integrates community knowledge using an ensemble method where the classifiers are selected and weighted based on their local accuracy for difierent partitions of the instance space.
- A UDSM (Ubiquitous Data Stream Mining) Service to support the data analysis requirements of ubiquitous applications. As the basis for our service we describe a general mechanism, which autonomously adapts the execution of the data stream classification process to each situation, using context and resource awareness.
Finally, the experimental validation of the proposed contributions using synthetic and real datasets allows us to achieve the objectives and answer the research questions proposed for this dissertation.
© 2008-2024 Fundación Dialnet · Todos los derechos reservados