Framework for data quality in knowledge discovery tasks

David Camilo Corrales Muñoz

Ayuda

Framework for data quality in knowledge discovery tasks

Autores: David Camilo Corrales Muñoz
Directores de la Tesis: Agapito Ismael Ledezma Espino (dir. tes.)
Lectura: En la Universidad Carlos III de Madrid ( España ) en 2018
Idioma: inglés
Tribunal Calificador de la Tesis: Fernando Fernández Rebollo (presid.) , Gustavo Adolfo Ramírez González (secret.) , Juan Pedro Caraça-Valente Hernández (voc.)
Enlaces
- Tesis en acceso abierto en: e-Archivo
Resumen
- The creation and consumption of data continue to grow by leaps and bounds. Due to advances in Information and Communication Technologies (ICT), today the data explosion in the digital universe is a new trend. The Knowledge Discovery in Databases (KDD) gain importance due the abundance of data. For a successful process of knowledge discovery is necessary to make a data treatment. The experts affirm that preprocessing phase takes the 50% to 70% of the total time of knowledge discovery process.
  
  Software tools based on Knowledge Discovery Methodologies offers algorithms for data preprocessing. According to Gartner 2018 Magic Quadrant for Data Science and Machine Learning Platforms, KNIME, RapidMiner, SAS, Alteryx and H20.ai are the leader tools for knowledge discovery. These software tools provide different techniques and they facilitate the evaluation of data analysis, however, these software tools lack any kind of guidance as to which techniques can or should be used in which contexts. Consequently, the use of suitable data cleaning techniques is a headache for inexpert users. They have no idea which methods can be confidently used and often resort to trial and error.
  
  This thesis addresses the data quality issues in knowledge discovery (KD) tasks (classification and regression) through three contributions: (i) a conceptual framework to provide the user a guidance to address data problems, (ii) an ontology that represent the knowledge in data cleaning and (iii) a case-based reasoning system to recommend the suitable algorithms for data cleaning. Each contribution is aligned with the specific objectives:
  
  Objective 1: Define a conceptual framework to guide to user in data quality issues in knowledge discovery tasks (classification and regression).
  
  Conceptual framework to provide the user a guidance to address data quality issues in knowledge discovery tasks. To build the conceptual framework, we followed the phases:
  
  • Mapping the selected data sources: we identified the data quality issues presented in classification and regression tasks. We reviewed four relevant methodologies: Knowledge Discovery in Databases, Cross Industry Standard Process for Data Mining, Sample, Explore, Modify, Model and Assess and The Data Science Process. Also, we found a taxonomy of data quality challenges in empirical software engineering (ESE), based on a literature review. Noise, missing values, outliers, high dimensionality, inconsistency, redundancy, amount of data, heterogeneity, and timeliness were the data quality issues found in the knowledge discovery methodologies and ESE taxonomy.
  
  • Understanding the selected data: in this phase we explained the data quality found in the knowledge discovery methodologies and ESE taxonomy.
  
  • Identifying and categorizing components: we organized and filtered the data quality issues according to their meaning:
  
  o Inconsistency, redundancy and timeliness were renamed as mislabelled class, duplicate instances and data obsolescence.
  
  o We considered kinds of noise: missing values, outliers, high dimensionality, imbalanced class, mislabelled class and duplicate instances.
  
  o Amount of data, heterogeneity and data obsolescence are issues of recollection data process. These data quality issues were classified in a new category called Provenance.
  
  • Integrating components: we defined the data cleaning tasks to address the data quality issues. Subsequently, we proposed the conceptual framework based on the integration of the data cleaning tasks.
  
  • Validation: the conceptual framework (CF) was evaluated through 48 datasets (28 datasets for classification and 20 for regression) of the UCI Repository of Machine Learning Databases. The cleaned datasets by our conceptual framework were used to train the same algorithms proposed by authors of UCI datasets. For classification datasets, 85.71% of the models (generated by the datasets cleaned by CF) achieve the highest Precision and AUC than models proposed by datasets authors. Regarding regression datasets, 90% of the models reach MAE less than models proposed by datasets authors. Concerning mini-challenges, 4/6 classification mini-challenges, the classifiers generated by the datasets cleaned by CF achieved the highest Accuracy and AUC, while 2/3 regression mini-challenges, the models generated by the datasets cleaned by CF reached the lowest Mean Absolute Error.
  
  Objective 2: Establish strategies that advise the suitable data cleaning algorithm to user for solving the data quality issue.
  
  Data Cleaning Ontology (DCO) to represent the knowledge of data quality issues in classification and regression tasks and data cleaning tasks to address the data quality issues. We used METHONTOLOGY as the methodology to create DCO. Thus, we followed five phases to build DCO:
  
  • Build glossary of terms: in this phase were identified the set of terms included on the Data cleaning ontology as Dataset, Attribute, Data quality issue, Data cleaning task, Classes balancing, Dimensionality reduction, Imputation, Label correction, Outliers detection, Remove duplicate instances, Outliers detection, Model and Performance.
  
  • Build concept taxonomies: we presented seven taxonomies for the classes Attribute, Data cleaning task, Imputation, Outliers Detection, Classes balancing, Label correction, and Dimensionality Reduction.
  
  • Build ad hoc binary relation diagrams: in this phase were defined the relations between DCO classes: o A Dataset has Data Quality Issue.
  
  o A Data Quality Issue is resolved with Data cleaning task.
  
  o A Dataset uses Data cleaning tasks.
  
  o An Attribute is part of a Dataset.
  
  o An Attribute has Data Quality Issue.
  
  o A Model is built with a Dataset.
  
  o A Model has Performance.
  
  • Build concept dictionary: this phase described the instances and features of the DCO classes. We presented three subsections, the first described the classes: Dataset and Data quality, followed by Data cleaning task class and finally, the classes: Model and Performance.
  
  • Describe rules: the rules were built in Semantic Web Rule Language (SWRL). We built 19 rules to detect data quality issues (4 rules) and select the available algorithms of data cleaning approaches (15 rules).
  
  Objective 3: Build a mechanism that gathers data cleaning algorithms to solve the data quality issues identified by the framework.
  
  Case-based reasoning (CBR) system for data cleaning. The aim of our CBR is to recommend data cleaning algorithms to the inexpert data analyst with the goal of preparing the dataset for classification and regression tasks. In addition, the CBR is composed by the stages:
  
  • Case-base construction: a case is composed by space of problem and solution. We represented the problem space by the meta-features of the dataset, its attributes, and the target variable. The solution space contains the algorithms of data cleaning used for each dataset. We represent the cases through DCO.
  
  • Case retrieval: the case retrieval mechanism is composed of a filter and similarity phases. In the first phase, we defined two filter approaches based on clustering and quartile analysis. These filters retrieve a reduced number of relevant cases. The second phase computes a ranking of the retrieved cases by filter approaches, and it scores a similarity between a new case and the retrieved cases.
  
  • Case reuse: if the problem space of new case like to retrieved case, then the old data cleaning solution is copied. In case of problem space of new case is different to the retrieved case, DCO recommends similar data cleaning algorithms to the algorithm proposed in the solution space of retrieved case.
  
  • Case retain: to retain a case, we proposed to verify the case quality through human experts supported in three data quality dimensions: Accuracy, Completeness, and Validity.
  
  We evaluated the retrieval mechanism through a set of judges. The panel of judges scores the similarity between a query case against all cases of the case-base. The results of the retrieval mechanism reach an average precision on judges ranking of 94.5% in top-3, for top-7: 84.55%, while in top-10: 78.35%.
  
  Objective 4: Develop and evaluate experimentally a prototype that tests the mechanisms and strategies of the framework for data quality in knowledge discovery tasks.
  
  We developed a prototype called Hygeia data, which implements the conceptual framework to provide the user a guidance to address data problems, the DCO that represents the knowledge in data cleaning and the CBR to recommend the suitable algorithms for data cleaning. The tool guides to the user in the data cleaning process, also Hygeia recommends the suitable data cleaning algorithms respect a user dataset. The system architecture of Hygeia data tool is represented by a logical view. This view organizes the software classes into packages and three layers.
  
  Application layer The Application layer provides the functionalities to a Hygeia user. This layer is composed by the package:
  
  • Graphical user interface which contains the software classes and forms to achieve a visual representation. This enables a user interacts with the Hygeia tool functionalities through graphical elements, such as text, windows, icons, buttons, text fields, combo box etc. We developed the forms with Swing API in NetBeans IDE 8.2.
  
  Mediation layer The mediation layer contains software classes named controllers. These represent the logical of Ontology, CBR and Conceptual Framework, also the controllers take user requests and pass it into the foundation layer.
  
  • Ontology contains a set of software classes which mapping the structure of the ontology. The mapped classes allow the communication between data cleaning ontology and CBR controllers.
  
  • CBR implements the Retrieval mechanism, Reuse and Retain modules through software classes. Additionally, this package sends to Graphical User Interface the retrieved case of the case–base.
  
  • Conceptual framework is composed by a set of software classes for guiding the user in the data cleaning process, also this package requests the parameters of data cleaning methods from Graphical User Interface and it sends the result of data cleaning methods to Graphical User Interface.
  
  Foundation layer The foundation layer is represented by the software used in the Hygeia data tool.
  
  • Apache Jena 3.6.0 is a Java framework. This includes software functionalities for RDF, RDFS, OWL, SPARQL, also an inference engine. Apache Jena allows the communication between Data cleaning ontology and the Ontology Controllers.
  
  • MongoDB 3.6.1 is a NoSQL database. This stores data in JSON documents. We used MongoDB as backup of the case-base, also the discarded cases are stored in the mongoDB. The case-base is located in: http://artemisa.unicauca.edu.co/ ̃dcorrales/case-base/cb_v.0.6.tar.
  
  • OpenCSV 4.0 is a CVS parser library for Java. It was used for preprocessing of the new datasets in Conceptual Framework.
  
  • Commons–lang 3.3.6 and Commons–io 2.6 provide utilities in Java, directly in String manipulation, numerical methods, creation and serialization and System properties.
  
  • Rserve 1.7.3 Rserve acts as a socket server (TCP/IP or local sockets) which responds to requests from Conceptual Framework controllers. It listens for any incoming connections and processes incoming requests. In other words, Rserve allows to embed R code within Conceptual Framework controllers.
  
  • Rengine is an engine of R statistical program. The data cleaning algorithms and charts belong to R packages, they are collections of functions developed by the R community. We used R version 3.4.2 with missForest and mice packages for imputation task, Rlof and fpc packages for outliers detection task, UBL and smotefamily packages for balanced classes and Fselector package for dimension ality reduction tasks. In case of remove duplicate instances and label correction, we used R primary functions.
  
  The results of this PhD thesis were published as articles in several scientific journals:
  
  • Corrales, D. C., Ledezma, A., & Corrales, J. C. (2018) “From theory to practice: a data quality framework for classification tasks”, Symmetry, 10(7), (JCR: Q2 ).
  
  • Corrales, D. C., Ledezma, A., & Corrales, J. C. (2018). “How to Address the Data Quality Issues in Regression Models: A Guided Process for Data Cleaning”, Symmetry, 10(4), (JCR: Q2).
  
  • Corrales, D. C., Lasso, E., Ledezma, A., & Corrales, J. C. (2018). “Feature selection for classification tasks: Expert knowledge or traditional methods?”. Journal of Intelligent & Fuzzy Systems. In Press (JCR: Q3 ).
  
  • Corrales, D. C., Ledezma, A., & Corrales, J. C. (2016) .“A systematic review of data quality issues in knowledge discovery tasks”. Revista Ingenierı́as Universidad de Medellı́n, 15(28), 125-150.
  
  • Corrales, D. C., Ledezma, A., & Corrales, J. C. (2015). “A conceptual framework for data quality in knowledge discovery tasks (FDQ-KDT): A Proposal”. Journal of Computers, 10(6), 396-405 (SJR: Q3).
  
  Also, we published other papers related with data preprocessing in domain applications as coffee rust, water quality and network intrusion detection:
  
  • Corrales, D. C., Lasso, E., Figueroa, A., Ledezma, A., & Corrales, J. C. (2018). “Estimation of coffee rust infection and growth through two-level classifier ensembles based on expert knowledge”. International Journal of Business Intelligence and Data Mining (IJBIDM), 13(4), 369-387.
  
  • Castillo, E., Corrales, D. C., Lasso, E., Ledezma, A., & Corrales, J. C. (2017). “Water quality detection based on a data mining process on the California estuary”. International Journal of Business Intelligence and Data Mining, 12(4), 406-424.
  
  • Corrales, D. C., Gutierrez, G., Rodriguez, J. P., Ledezma, A., & Corrales, J. C. (2017). “Lack of Data: Is It Enough Estimating the Coffee Rust with Meteorological Time Series?”. In International Conference on Computational Science and Its Applications (pp. 3-16). Springer, Cham.
  
  • Corrales, D. C., Corrales, J. C., Sanchis, A., & Ledezma, A. (2016). “Sequential classifiers for network intrusion detection based on data selection process”. In Systems, Man, and Cybernetics (SMC) IEEE International Conference (pp. 001827-001832). IEEE Xplore.
  
  • Castillo, E., Corrales, D. C., Lasso, E., Ledezma, A., & Corrales, J. C. (2016). “Data Processing for a Water Quality Detection System on Colombian Rio Piedras Basin”. In International Conference on Computational Science and Its Applications (pp. 665-683). Springer, Cham.
  
  • Corrales, D. C., Figueroa, A., Ledezma, A., & Corrales, J. C. (2015). “An empirical multi-classifier for coffee rust detection in colombian crops”. In International Conference on Computational Science and Its Applications (pp. 60-74). Springer, Cham.
  
  Finally of this PhD thesis we can concluded:
  
  The conceptual framework is a useful data cleaning process for classification and regression tasks. We validated the conceptual framework with datasets of the UCI Repository of Machine Learning Databases. We cleaned the datasets following the conceptual framework and applying the data cleaning algorithms. We applied several times these algorithms until obtaining results upper or like the obtained by UCI datasets. The cleaned datasets by our conceptual framework were used to train the same algorithms proposed by authors of UCI datasets. In this sense, 85.71% of the classification models achieve the highest precisions and AUC than models proposed by datasets authors, while 90% of the regression models reach Mean Absolute Error less than models proposed by datasets authors. In summary, 87.85% of the models (classification and regression) generated by the datasets cleaned of the conceptual framework (without knowledge of dataset domain) reached good performance compared with the models proposed by datasets authors.
  
  However, the validation process of the CF is not enough due the dataset authors omit details about the process of data preparation as the creation and modification of attributes from original ones, model validation technique (cross-validation, test set, etc.), or experimental configuration of the models. In addition, the original dataset and the dataset cleaned by CF are different. Thus, we proposed mini-challenges with the aim to enrich the validation process. In this way, CF achieved the highest Accuracy and AUC in 4/6 classification mini-challenges, while regression tasks CF reached the lowest Mean Absolute Error in 2/3 mini-challenges. As conclusion, the conceptual framework takes on importance when the user has not knowledge about dataset domain. Compared with effort in data preparation and previous domain knowledge by dataset authors, the conceptual framework offers a general data cleaning solution tested on 56 datasets of the UCI Repository.
  
  However, we must know the data cleaning algorithms to apply the suitable method. To solve this problem, we proposed a case-based reasoning (CBR) system to recommend the suitable data cleaning algorithms to the inexperienced users of the conceptual framework. As the retrieval is the main phase in a CBR, we focus on the validation of the case retrieval mechanism. This was evaluated through a set of judges from three queries for each knowledge discovery tasks (classification and regression). The first query (Q1) corresponds to a case contained in the case-base, whereas the second query (Q2) is a modified case of the case-base, and the third query (Q3) is a new case. The results of the retrieval mechanism for classification tasks and all queries reach a position precision on judges ranking of 100% in top 1 (P-P@1) and top 2 (P-P@2), while in top 3 (P-P@3) 50%. In case of regression tasks, the retrieval mechanism achieves a position precision of 100% in top 1 (P-P@1), top 2 (P-P@2) and top 3 (P-P@3). In other words, we can guarantee the retrieval of the two most similar cases respect to all queries.
  
  With the aim to support the CBR, we proposed a Data cleaning ontology (DCO). The knowledge acquired in the construction and application of the conceptual framework (data quality issues found in datasets, data cleaning tasks, approaches, and algorithms used) was conceptualized in the Data cleaning ontology for case representation. This reduces considerably the knowledge acquisition bottleneck of data quality in knowledge discovery tasks. Also, the representation of cases through Data cleaning ontology allows the integration between ontologies of specific domains to support some data quality issues, as the selection of relevant attributes based on expert knowledge.
  
  Finally, our proposal can be improved through domain knowledge. For example, in the dimensionality reduction task, the domain knowledge could support the construction of new attributes based on the original attributes. The new attributes can be relevant to build a model. In case of outliers detection task, the domain knowledge allows to define the values range allowed for each attribute.