Ir al contenido

Documat


Resumen de Link prediction in large directed graphs

Dario García Gasulla

  • The first chapter introduces an approach to machine learning (ML) were data is understood as a network of connected entities. This strategy seeks inter-entity information for knowledge discovery, in contrast with traditional intra-entity approaches based on instances and their features. We discuss the importance of this connectivist ML (which we refer to as graph mining) in the current context where large, topology-based data sets have been made available. Chapter ends by introducing the Link Prediction (LP) problem, together with its current computational and performance limitations. The second chapter discusses early contributions to graph mining, and introduces problems frequently tackled through this paradigm. Later the chapter focuses on the state-of-the-art of LP. It presents three different approaches to the problem of finding links in a relational set, and argues about the importance of the most computationally scalable one: similarity-based algorithms. It categorizes similarity-based algorithms in three types of LP scores. For the most scalable type, local similarity-based algorithms, the chapter identifies and formally describes the most competitive proposals according to the bibliography. Chapter three analyses the LP problem, partly as a classic binary classification problem. A list of graph properties such as directionality, weights and time are discussed in the context of LP. Follows a formal time and space complexity analysis of similarity-based scores of LP. The chapter ends with an study of the class imbalance found in LP problems. In chapter four a novel similarity-based score of LP is introduced. The chapter first elaborates on the importance of hierarchies for representing knowledge through directed graphs. Several modifications to the proposed score are also defined. This chapter presents a modified version of the most competitive undirected scores of LP, to adapt them to directed graphs. The evaluation methodologies of LP are analyzed in the fifth chapter. It starts by discussing the problem of evaluating domains with a huge class imbalance, identifying the most appropriate methodologies for it. A modification of the most appropriate evaluation methodology according to the bibliography is presented, with the goal of focusing on relevant predictions. Follows a discussion on the faithful estimation of the precision of predictors. Chapter six describes the graphs used for score evaluation, as well as how data was transformed into a directed graph. Reasons on why these particular domains were chosen are given, making a special case of webgraphs and their well known relation with hierarchies. The most basic properties of each resultant graph are shown. Tests performed are presented in chapter seven. The three most competitive LP scores currently available are tested among themselves, and against a proposed version of those same scores for directed graphs. Our proposed score and its modifications are tested against the scores obtaining the best results in the previous tests. The case of LP in webgraphs is considered separately, testing six different webgraphs. The chapter ends with a discussion on the limitations of this formal analysis, showing examples of predictions obtained. Chapter eight includes the computational aspects of the work done. It starts with a discussion on the importance of memory management for determining the computational cost of LP algorithms. A proposal on how to reduce this cost through precision reduction is presented. Follows a section focused on the parallelization of code, which includes two different implementations on one graph-specific programming model (Pregel) and on one generic programming model (OpenMP). The chapter ends with a specification of the computational resources used for the tests done. The conclusions of this thesis proposal are presented in nine. Chapter ten contains several future lines of work.


Fundación Dialnet

Mi Documat