Efficient distributed algorithms for distance join queries in spark-based spatial analytics systems

Francisco García García; Antonio Corral Liria; Luis Fernando Iribarne Martínez; Michael Vassilakopoulos

Ayuda

Efficient distributed algorithms for distance join queries in spark-based spatial analytics systems

Francisco García-García ^[1] ; Antonio Corral ^[1] ; Luis Iribarne ^[1] ; Michael Vassilakopoulos ^[2]
1. [1] Universidad de Almería
  
  Universidad de Almería
  
  Almería, España
2. [2] University Of Thessaly
  
  University Of Thessaly
  
  Dimos Volos, Grecia
Localización: Actas de las XXVII Jornadas de Ingeniería del Software y Bases de Datos (JISBD 2023) / coord. por Amador Durán Toro , 2023
Idioma: inglés
Texto completo no disponible (Saber más ...)
Resumen
- Apache Sedona (formerly GeoSpark) is a new in-memory cluster computing system for processing large-scale spatial data, which extends the core of Apache Spark to support spatial datatypes, spatial partitioning techniques, spatial indexes, and spatial operations (e.g., spatial range, nearest neighbor, and spatial join queries). It is actively under development by the Apache Software Foundation, and it has been recently graduated to as Apache Top Level Project. Other Spark-based spatial analytics systems have been also proposed in the literature, like Simba and LocationSpark, but currently they are not updated for long time. Distance-based Join Queries (DJQs), like nearest neighbor join (kNNJQ) or closest pairs queries (kCPQ), are used in numerous spatial applications (e.g., GIS, location-based systems, continuous monitoring streaming systems, etc.), but they are not supported by Apache Sedona. Therefore, in this paper, we investigate how to design and implement efficient DJQ distributed algorithms in Apache Sedona, using the most appropriate spatial partitioning, spatial indexing, and other optimization techniques (e.g., repartitioning and less data). The results of an extensive set of experiments with real-world datasets are presented, demonstrating that the proposed kNNJQ and kCPQ distributed algorithms are efficient (in terms of total execution time and memory requirements), scalable (varying k values, sizes of datasets and number of executors), and robust in Apache Sedona. Moreover, we have also experimentally compared Apache Sedona, LocationSpark and Simba, showing Apache Sedona the best performance for kCPQ in all cases, and for kNNJQ when the joined datasets are medium-sized, whereas LocationSpark is the winner for kNNJQ when the combined datasets are large-sized, and Simba shows the lowest performance in all considered cases. Finally, we can conclude that Apache Sedona shows the best performance for kCPQ and competitive results for kNNJQ.