Missing information in datasets is a common scenario in the fields of statistical surveys and machine learning. Although the data that are missing may refer to numeric or categorical variables this paper deals with the case of categorical variables because it displays significant features which are not always taken into account.
Techniques to tackle the problem have been developed across time and literature on the issue is widely available especially on the topic of nonresponse in surveys. In the domain of machine learning and data mining it has more recently become an area of dynamic research. These two approaches - statistical surveys and machine learning - address the problem with different perspectives and have different objectives, which generates discrepancies in the classification of the procedures and its criteria of evaluation.
On the one hand, the traditional statistical inference approach considers the dataset to be a random sample from a probabilistic distribution and the aim is to estimate some of the parameters that characterize this distribution, being nonresponse an estimation problem to be dealt with from different perspectives. On the other hand, there are a number of techniques within the machine learning area - neural networks, decision trees... - that may be used to deal with missing data substituting their values by estimations obtained from the observed values. Generally speaking, prediction procedures are used to impute continuous numeric data and classifiers to impute categorical data, being the classes matched up with the categories of the categorical variable to be imputed in the latter case.
The starting point for this thesis has been the need to improve the results of the state-of-the-art procedures to treat missing categorical data in opinion polls. It has been found that some of the proposed solutions from the statistical inference perspective have working assumptions far from real situations and that are not suitable enough but just reuse procedures originally designed for numeric variables. This led us in a straightforward way to test other methods from the field of machine learning, extending and adapting existing procedures when needed to work with the kind of input data the polls have. The design of new neural networks architecture and learning algorithm based on previous model is the most novel contribution that arose as a result.
Giving the lack of a suitable taxonomy, the study starts by proposing a classification trying to cover all techniques to tackle missing data problems in numeric and categorical variables. Three groups of techniques not mutually exclusive - deletion procedures, imputation procedures and tolerance procedures - with its main applications are described and criteria to evaluate and compare the methods are presented. It is also shown that contradictions seem to exist between the theoretical framework underlying the statistical approach and the practical recommendations to approximate categorical data models for categorical data imputation.
The fuzzy min-max neural networks represent a machine learning hybrid procedure. Neuro-fuzzy computing is one of the most popular hybridisations in the artificial intelligence literature because it offers the generic benefits of neural networks - like massive parallelism and robustness - and, at the same time, uses fuzzy logic to model vague or qualitative knowledge and convey uncertainty. However, all input variables in the network are required to correspond to continuously valued variables, which is a constraint to its use as classifier in polls categorical data imputation where there are numeric and categorical data. The problem lies in the lack of a suitable measure of distance among the categories of the input categorical variables. This thesis proposes the definition of such a distance which allows for the definition of new fuzzy sets membership functions. New neural networks architecture and new learning algorithms are built from these.
Subsequent chapters show the specific missing data problem in a single categorical variable from an opinion poll to be solved, and establish the basis of experiments to be conducted to assess performance of methods. Results of extended fuzzy min-max neural networks imputations are compared with those of the most recommended method from the statistical inference approach - the logistic regression multiple imputation - achieving a remarkable improvement.
Following steps to continue on the way of increasing missing data imputation in surveys using machine learning procedures and further research are suggested in concluding remarks. Advances on this direction may have a significant impact in the computation of official statistical figures: the carrying out of surveys and studies can take advantage of the possibilities for automation and the robust estimates of missing data the machine learning methods seem to produce.
© 2008-2024 Fundación Dialnet · Todos los derechos reservados