Ir al contenido

Documat


Resumen de On the convergence of big data analytics and high-performance computing: a novel approach for runtime interoperability

Silvina Caino Lores

  • The information technology ecosystem is currently in transition to a new generation of applications requiring intensive data acquisition, processing and storage. As a result of this shift towards data-intensive computing, there is a growing confluence between high-performance computing (HPC) and Big Data analytics (BDA), given that many HPC applications produce Big Data to be manipulated with analytics techniques, while BDA is a growing consumer of HPC capabilities.

    More precisely, HPC scientific applications are key tools in many research areas that rely on multiple, diverse, and distributed operations over various datasets, usually yielding significant computational complexity and data dependencies. Nowadays, HPC applications are increasingly demanding data analysis and visualisation over major datasets, which is shifting these originally computationally intensive systems towards parallel data-intensive problems. On the other hand, BDA applications are demanding the performance level of the supercomputing ecosystem, thus requiring acceleration and increased scalability. As a result, this general trend is leading to greater confluence between the HPC and BDA paradigms.

    Nevertheless, HPC and BDA systems have been traditionally built to solve different problems: HPC focuses on computationally-intensive tightly-coupled applications, and BDA tackles large volumes of loosely-coupled tasks. These objectives have determined the underlying architectures of HPC and BDA infrastructures. In a typical HPC infrastructure, compute and data subsystems are totally decoupled, using parallel file systems for data storage, but connected through high-speed interconnections, as in grids or clusters. On the other hand, BDA systems co-locate computation and data on the same node and focus on elasticity, thus clouds become their preferred infrastructure.

    The tools and cultures of HPC and BDA have also diverged to solve their canonical problems. However, between both worlds there are different degrees of intermediate HPC-BDA applications that portray mixed requirements. These applications could be executed on both platforms, but none of them are fully ideal in their current state, mainly due to requirements such as scalability, performance, and resource efficiency. In this scenario, upcoming applications will suffer the lack of an ideal environment able to cope with their computing and data requirements. Recent works have suggested the opportunity of combining the HPC and BDA approaches to alleviate this issue. For example, typical BDA programming models have been considered to substitute MPI parallelism induction mechanisms, following a data-centric approach. In addition, we can also see this opportunity affecting the underlying computing infrastructures. Indeed, typical BDA infrastructures like clouds could inspire hybrid platforms for exascale scientific workflows. Many challenges remain unsolved and this situation has been worsened by the appearance of new application domains that are completely hybrid in nature, like autonomous vehicles, surveillance, e-science with Big Data sources, monitoring of large scale infrastructures, and smart cities, to name a few. These domains have in common the need to support the simulation of very complex models, assimilating voluminous and variable real-time data in order to generate refined models for better understanding of the domain, to prescribe pattern-based control actions, or to predict a future behaviour. In this circumstances, borrowing features from the other paradigm proves insufficient, and deeper convergence becomes necessary to cope with mixed requirements, new infrastructures, and upcoming performance expectations. Applying some of these BDA mechanisms can improve scalability in parameter-based HPC applications relying on a large pool of loosely-coupled tasks. However, other types of applications were not able to benefit from this, as they did not fit the prototypical structural model of BDA platforms. Due to the former reasons, there is currently an increasing agreement on the need for those ecosystems to converge to produce environments that have the performance of HPC and the usability and flexibility of the BDA stack. As a consequence, our research question is how can we build a platform able to manage applications built for computationally-intensive simulations, data-intensive analysis, or both, without hurting performance and data-awareness?.

    To answer this question, this thesis explores the key features of BDA and HPC ecosystems, with a focus on the platform layer and the core runtimes that support BDA and HPC processing frameworks, which we will refer to in this document as data-centric and process-centric runtimes, respectively. We took this knowledge as baseline to elaborate a theoretical frame for the development of generalist solutions for runtime convergence of BDA and HPC platforms. The main goal of this thesis was to research new approaches to facilitate the convergence of HPC and BDA paradigms by providing common abstractions and mechanisms for improving scalability, data locality exploitation, and execution adaptivity on large scale systems, while preserving the most relevant features for their corresponding communities, in order to provide a system suitable for the composition of applications with BDA and HPC stages.

    From the accomplishment of the former objective, this thesis yields significant contributions, namely a data-centric enablement methodology aimed at reshaping HPC iterative scientific applications to match the data-centric paradigm of BDA platforms; a formal definition of a generic unified distributed data abstraction (UDDA) and its associated unified operational model (UOM), which sets the foundation of a theoretical frame for the analysis and definition of composite HPC-BDA applications; a generalist runtime interoperability architecture for HPC-BDA applications; an implementation of the former architecture –based on Spark and MPI–, which we named Spark-DIY; and an implementation of a real-world use case from the hydrogeology domain enriched with features enabled by our architecture.


Fundación Dialnet

Mi Documat