Ir al contenido

Documat


Efficient distributed algorithms for distance join queries in spark-based spatial analytics systems

  • Francisco García-García [1] ; Antonio Corral [1] Árbol académico ; Luis Iribarne [1] Árbol académico ; Michael Vassilakopoulos [2]
    1. [1] Universidad de Almería

      Universidad de Almería

      Almería, España

    2. [2] University Of Thessaly

      University Of Thessaly

      Dimos Volos, Grecia

  • Localización: Actas de las XXVII Jornadas de Ingeniería del Software y Bases de Datos (JISBD 2023) / coord. por Amador Durán Toro Árbol académico, 2023
  • Idioma: inglés
  • Texto completo no disponible (Saber más ...)
  • Resumen
    • Apache Sedona (formerly GeoSpark) is a new in-memory cluster computing system for processing large-scale spatial data, which extends the core of Apache Spark to support spatial datatypes, spatial partitioning techniques, spatial indexes, and spatial operations (e.g., spatial range, nearest neighbor, and spatial join queries). It is actively under development by the Apache Software Foundation, and it has been recently graduated to as Apache Top Level Project. Other Spark-based spatial analytics systems have been also proposed in the literature, like Simba and LocationSpark, but currently they are not updated for long time. Distance-based Join Queries (DJQs), like nearest neighbor join (kNNJQ) or closest pairs queries (kCPQ), are used in numerous spatial applications (e.g., GIS, location-based systems, continuous monitoring streaming systems, etc.), but they are not supported by Apache Sedona. Therefore, in this paper, we investigate how to design and implement efficient DJQ distributed algorithms in Apache Sedona, using the most appropriate spatial partitioning, spatial indexing, and other optimization techniques (e.g., repartitioning and less data). The results of an extensive set of experiments with real-world datasets are presented, demonstrating that the proposed kNNJQ and kCPQ distributed algorithms are efficient (in terms of total execution time and memory requirements), scalable (varying k values, sizes of datasets and number of executors), and robust in Apache Sedona. Moreover, we have also experimentally compared Apache Sedona, LocationSpark and Simba, showing Apache Sedona the best performance for kCPQ in all cases, and for kNNJQ when the joined datasets are medium-sized, whereas LocationSpark is the winner for kNNJQ when the combined datasets are large-sized, and Simba shows the lowest performance in all considered cases. Finally, we can conclude that Apache Sedona shows the best performance for kCPQ and competitive results for kNNJQ.


Fundación Dialnet

Mi Documat

Opciones de artículo

Opciones de compartir

Opciones de entorno