Ir al contenido

Documat


Resumen de Towards data wrangling automation through dynamically-selected background knowledge

Lidia Contreras Ochando

  • Data science is essential for the extraction of value from data. However, the most tedious part of the process, data wrangling, implies a range of mostly manual formatting, identification and cleansing manipulations. Data wrangling still resists automation partly because the problem strongly depends on domain information, which becomes a bottleneck for state-of-the-art systems as the diversity of domains, formats and structures of the data increases.

    In this thesis we focus on generating algorithms that take advantage of the domain knowledge for the automation of parts of the data wrangling process. We illustrate the way in which general program induction techniques, instead of domain-specific languages, can be applied flexibly to problems where knowledge is important, through the dynamic use of domain-specific knowledge. More generally, we argue that a combination of knowledge-based and dynamic learning approaches leads to successful solutions. We propose several strategies to automatically select or construct the appropriate background knowledge for several data wrangling scenarios. The key idea is based on choosing the best specialised background primitives according to the context of the particular problem to solve.

    We address two scenarios. In the first one, we handle personal data (names, dates, telephone numbers, etc.) that are presented in very different string formats and have to be transformed into a unified format. The problem is how to build a compositional transformation from a large set of primitives in the domain (e.g., handling months, years, days of the week, etc.). We develop a system (BK-ADAPT) that guides the search through the background knowledge by extracting several meta-features from the examples characterising the column domain. In the second scenario, we face the transformation of data matrices in generic programming languages such as R, using an input matrix and some cells of the output matrix as examples. We also develop a system guided by a tree-based search (AUTOMAT[R]IX) that uses several constraints, prior primitive probabilities and textual hints to efficiently learn the transformations.

    With these systems, we show that the combination of inductive programming with the dynamic selection of the appropriate primitives from the background knowledge is able to improve the results of other state-of-the-art ¿and more specific¿ data wrangling approaches.


Fundación Dialnet

Mi Documat