A systematic review of data quality issues in knowledge discovery tasks

David Camilo Corrales Muñoz; Agapito Ismael Ledezma Espino; Juan Carlos Corrales

Ayuda

A systematic review of data quality issues in knowledge discovery tasks

Corrales, David Camilo ^[2] ; Ledezma, Agapito Ismael ^[1] ; Corrales, Juan Carlos
1. [1] Universidad Carlos III de Madrid
  
  Universidad Carlos III de Madrid
  
  Madrid, España
2. [2] Universidad del Cauca - Universidad Carlos III de Madrid
Localización: Revista de Ingenierías: Universidad de Medellín, ISSN 1692-3324, Vol. 15, Nº. 28, 2016, págs. 125-149
Idioma: español
DOI: 10.22395/rium.v15n28a7
Enlaces
- Texto completo (pdf)
Resumen
- Large volume of data is growing because the organizations are continuously capturing the collective amount of data for better decision-making process. The most fundamental challenge is to explore the large volumes of data and extract useful knowledge for future actions through knowledge discovery tasks, nevertheless many data has poor quality. We presented a systematic review of the data quality issues in knowledge discovery tasks and a case study applied to agricultural disease named coffee rust.
Referencias bibliográficas
- J. Gantz and David Reinsel, “The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east,” IDC VIEW,...
- H. Hu, Y. Wen, T.-S. Chua, and X. Li, “Toward Scalable Systems for Big Data Analytics: A Technology Tutorial,” IEEE Access, vol. 2, pp. 652-687,...
- A. Rajaraman and J. D. Ullman, Mining of Massive Datasets. New York, N.Y. ; Cambridge: Cambridge University Press, 2011.
- F. Pacheco, C. Rangel, J. Aguilar, M. Cerrada, and J. Altamiranda, “Methodological framework for data processing based on the Data Science...
- G. A. Liebchen and M. Shepperd, “Software productivity analysis of a large data set and issues of confidentiality and data quality,” in Software...
- G. A. Liebchen and M. Shepperd, “Data Sets and Data Quality in Software Engineering,” in Proceedings of the 4th International Workshop on...
- M. F. Bosu and S. G. Macdonell, “A Taxonomy of Data Quality Challenges in Empirical Software Engineering,” in Software Engineering Conference...
- D. C. Corrales, A. Ledezma, and J. C. Corrales, “A conceptual Framework for data quality in knowledge discovery tasks (FDQ-KDT): a proposal,”...
- B. A. Kitchenham, “Systematic Review in Software Engineering: Where We Are and Where We Should Be Going,” in Proceedings of the 2Nd International...
- F. Hakimpour and A. Geppert, “Resolving Semantic Heterogeneity in Schema Integration,” in Proceedings of the International Conference on Formal...
- F. Castanedo, “A Review of Data Fusion Techniques,” Sci. World J., vol. 2013, p. e704504, Oct. 2013.
- W. Zou and W. Sun, “A Multi-dimensional Data Association Algorithm for Multi-sensor Fusion,” in Intelligent Science and Intelligent Data Engineering,...
- S. Lloyd, “Least Squares Quantization in PCM,” IEEE Trans Inf Theor, vol. 28, no. 2, pp. 129-137, Sep. 2006.
- A. W. Michael Shindler, “Fast and Accurate k-means For Large Datasets,” 2011.
- S. K. Chang, E. Jungert, and X. Li, “A progressive query language and interactive reasoner for information fusion support,” Inf. Fusion, vol....
- T. Aluja-Banet, J. Daunis-i-Estadella, and D. Pellicer, “GRAFT, a complete system for data fusion,” Comput. Stat. Data Anal., vol. 52, no....
- D. M. Hawkins, “Introduction,” in Identification of Outliers, Springer Netherlands, 1980, pp. 1-12.
- A. Daneshpazhouh and A. Sami, “Entropy-based outlier detection using semi-supervised approach with few positive examples,” Pattern Recognit....
- W. Yalin, X. Wenping, W. Xiaoli, and C. Bin, “Study on online outlier detection method based on principal component analysis and Bayesian...
- B. Liang, “A hierarchical clustering based global outlier detection method,” in 2010 IEEE Fifth International Conference on Bio-Inspired Computing:...
- R. Pamula, J. K. Deka, and S. Nandi, “An Outlier Detection Method Based on Clustering,” in 2011 Second International Conference on Emerging...
- J. Qu, W. Qin, Y. Feng, and Y. Sai, “An Outlier Detection Method Based on Voronoi Diagram for Financial Surveillance,” in International Workshop...
- J. Liu and H. Deng, “Outlier detection on uncertain data based on local information,” Knowl.- Based Syst., vol. 51, pp. 60-71, Oct. 2013.
- B. Mogoş, “Exploratory data analysis for outlier detection in bioequivalence studies,” Biocybern. Biomed. Eng., vol. 33, no. 3, pp. 164-170,...
- D. Cucina, A. di Salvatore, and M. K. Protopapas, “Outliers detection in multivariate time series using genetic algorithms,” Chemom. Intell....
- J. Shen, J. Liu, R. Zhao, and X. Lin, “A Kd-Tree-Based Outlier Detection Method for Airborne LiDAR Point Clouds,” in 2011 International Symposium...
- X. Peng, J. Chen, and H. Shen, “Outlier detection method based on SVM and its application in copper-matte converting,” in Control and Decision...
- H. Xiong, G. Pandey, M. Steinbach, and V. Kumar, “Enhancing data analysis with noise removal,” IEEE Trans. Knowl. Data Eng., vol. 18, no....
- V. Chandola, A. Banerjee, and V. Kumar, “Anomaly Detection: A Survey,” ACM Comput Surv, vol. 41, no. 3, pp. 15:1-15:58, Jul. 2009.
- N. Verbiest, E. Ramentol, C. Cornelis, and F. Herrera, “Preprocessing noisy imbalanced datasets using SMOTE enhanced with fuzzy rough prototype...
- Z. J. Ding and Y.-Q. Zhang, “Additive noise analysis on microarray data via SVM classification,” in 2010 IEEE Symposium on Computational Intelligence...
- H. Yin, H. Dong, and Y. Li, “A Cluster-Based Noise Detection Algorithm,” in 2009 First International Workshop on Database Technology and Applications,...
- S. R. Kannan, R. Devi, S. Ramathilagam, and K. Takezawa, “Effective FCM Noise Clustering Algorithms in Medical Images,” Comput Biol Med, vol....
- Y.-L. He, Z.-Q. Geng, Y. Xu, and Q.-X. Zhu, “A hierarchical structure of extreme learning machine (HELM) for high-dimensional datasets with...
- K. Hayashi, “A simple extension of boosting for asymmetric mislabeled data,” Stat. Probab. Lett., vol. 82, no. 2, pp. 348-356, Feb. 2012.
- B. Sluban and N. Lavrač, “Relating ensemble diversity and performance: A study in class noise detection,” Neurocomputing, vol. 160, pp. 120-131,...
- P. Shen, S. Tamura, and S. Hayamizu, “Feature reconstruction using sparse imputation for noise robust audio-visual speech recognition,” in...
- B. Frenay and M. Verleysen, “Classification in the Presence of Label Noise: A Survey,” IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no....
- C. Catal, O. Alan, and K. Balkan, “Class noise detection based on software metrics and ROC curves,” Inf. Sci., vol. 181, no. 21, pp. 4867-4877,...
- I. B. Aydilek and A. Arslan, “A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression...
- F. Qin and J. Lee, “Dynamic Methods for Missing Value Estimation for DNA Sequences,” in 2010 International Conference on Computational and...
- S. Zhang, Z. Jin, and X. Zhu, “Missing data imputation by utilizing information within incomplete instances,” J. Syst. Softw., vol. 84, no....
- B. Lotfi, M. Mourad, M. B. Najiba, and E. Mohamed, “Treatment methodology of erroneous and missing data in wind farm dataset,” in 2011 8th...
- Z. Sahri, R. Yusof, and J. Watada, “FINNIM: Iterative Imputation of Missing Values in #x00A0;Dissolved Gas Analysis Dataset,” IEEE Trans....
- P. Keerin, W. Kurutach, and T. Boongoen, “An improvement of missing value imputation in DNA microarray data using cluster-based LLS method,”...
- F. O. de França, G. P. Coelho, and F. J. Von Zuben, “Predicting missing values with biclustering: A coherence-based approach,” Pattern Recognit.,...
- W. Insuwan, U. Suksawatchon, and J. Suksawatchon, “Improving missing values imputation in collaborative filtering with user-preference genre...
- T.-P. Hong and C.-W. Wu, “Mining rules from an incomplete dataset with a high missing rate,” Expert Syst. Appl., vol. 38, no. 4, pp. 3931-3936,...
- K. Jiang, H. Chen, and S. Yuan, “Classification for Incomplete Data Using Classifier Ensembles,” in International Conference on Neural Networks...
- C.-H. Wu, C.-H. Wun, and H.-J. Chou, “Using association rules for completing missing data,” in Fourth International Conference on Hybrid Intelligent...
- A. C. Yang, H.-H. Hsu, and M.-D. Lu, “Imputing missing values in microarray data with ontology information,” in 2010 IEEE International Conference...
- R. Blagus and L. Lusa, “Evaluation of SMOTE for High-Dimensional Class-Imbalanced Microarray Data,” in 2012 11th International Conference...
- F. Koto, “SMOTE-Out, SMOTE-Cosine, and Selected-SMOTE: An enhancement strategy to handle imbalance in data level,” in 2014 International Conference...
- Y. Cheung and F. Gu, “A direct search algorithm based on kernel density estimator for nonlinear optimization,” in 2014 10th International...
- M. B. Abidine, N. Yala, B. Fergani, and L. Clavier, “Soft margin SVM modeling for handling imbalanced human activity datasets in multiple...
- A. Adam, I. Shapiai, Z. Ibrahim, M. Khalid, L. C. Chew, L. W. Jau, and J. Watada, “A Modified Artificial Neural Network Learning Algorithm...
- A. Adam, L. C. Chew, M. I. Shapiai, L. W. Jau, Z. Ibrahim, and M. Khalid, “A Hybrid Artificial Neural Network-Naive Bayes for solving imbalanced...
- N. A. Abolkarlou, A. A. Niknafs, and M. K. Ebrahimpour, “Ensemble imbalance classification: Using data preprocessing, clustering algorithm...
- C. Galarda Varassin, A. Plastino, H. C. Da Gama Leitao, and B. Zadrozny, “Undersampling Strategy Based on Clustering to Improve the Performance...
- J. Liang, L. Bai, C. Dang, and F. Cao, “The -Means-Type Algorithms Versus Imbalanced Data Distributions,” IEEE Trans. Fuzzy Syst., vol. 20,...
- G. Y. Wong, F. H. F. Leung, and S.-H. Ling, “A novel evolutionary preprocessing method based on over-sampling and under-sampling for imbalanced...
- W. Mingnan, J. Watada, Z. Ibrahim, and M. Khalid, “Building a Memetic Algorithm Based Support Vector Machine for Imbalaced Classification,”...
- T. Z. Tan, G. S. Ng, and C. Quek, “Complementary Learning Fuzzy Neural Network: An Approach to Imbalanced Dataset,” in International Joint...
- G. Y. Wong, F. H. F. Leung, and S.-H. Ling, “An under-sampling method based on fuzzy logic for large imbalanced dataset,” in 2014 IEEE International...
- J. A. Olvera-López, J. A. Carrasco-Ochoa, J. F. Martínez-Trinidad, and J. Kittler, “A review of instance selection methods,” Artif. Intell....
- S. Khalid, T. Khalil, and S. Nasreen, “A survey of feature selection and feature extraction techniques in machine learning,” in Science and...
- G. Kalpana, R. P. Kumar, and T. Ravi, “Classifier based duplicate record elimination for query results from web databases,” in Trendz in Information...
- B. Martins, H. Galhardas, and N. Goncalves, “Using Random Forest classifiers to detect duplicate gazetteer records,” in 2012 7th Iberian Conference...
- Y. Pei, J. Xu, Z. Cen, and J. Sun, “IKMC: An Improved K-Medoids Clustering Method for Near-Duplicated Records Detection,” in International...
- X. Mansheng, L. Youshi, and Z. Xiaoqi, “A property optimization method in support of approximately duplicated records detecting,” in IEEE...
- L. D. Avendaño-Valencia, J. D. Martínez-Vargas, E. Giraldo, and G. Castellanos-Domíngue, “Reduction of irrelevant and redundant data from...
- Q. Hua, M. Xiang, and F. Sun, “An optimal feature selection method for approximately duplicate records detecting,” in 2010 The 2nd IEEE International...
- M. Finger and F. S. Da Silva, “Temporal data obsolescence: modelling problems,” in Fifth International Workshop on Temporal Representation...
- A. Maydanchik, Data Quality Assessment. Technics Publications, 2007.
- J. Debenham, “Knowledge Decay in a Normalised Knowledge Base,” in Database and Expert Systems Applications, M. Ibrahim, J. Küng, and N. Revell,...
- G. Cormode, V. Shkapenyuk, D. Srivastava, and B. Xu, “Forward Decay: A Practical Time Decay Model for Streaming Systems,” in Proceedings of...
- M. Placide and Y. Lasheng, “Information Decay in Building Predictive Models Using Temporal Data,” in 2010 International Symposium on Information...
- M. E. Cintra, C. A. A. Meira, M. C. Monard, H. A. Camargo, and L. H. A. Rodrigues, “The use of fuzzy decision trees for coffee rust warning...
- D. C. Corrales, A. J. P. Q, C. León, A. Figueroa, and J. C. Corrales, “Early warning system for coffee rust disease based on error correcting...
- D. C. Corrales, A. Ledezma, A. J. P. Q, J. Hoyos, A. Figueroa, and J. C. Corrales, “A new dataset for coffee rust detection in Colombian crops...
- D. C. C. Corrales, J. C. Corrales, and A. Figueroa-Casas, “Toward detecting crop diseases and pest by supervised learning,” Ing. Univ., vol....
- D. C. Corrales, A. Figueroa, A. Ledezma, and J. C. Corrales, “An Empirical Multi-classifier for Coffee Rust Detection in Colombian Crops,”...