Abstract
In this paper we present a detailed description of a high-performance distributed-memory implementation of balancing domain decomposition preconditioning techniques. This coverage provides a pool of implementation hints and considerations that can be very useful for scientists that are willing to tackle large-scale distributed-memory machines using these methods. On the other hand, the paper includes a comprehensive performance and scalability study of the resulting codes when they are applied for the solution of the Poisson problem on a large-scale multicore-based distributed-memory machine with up to 4096 cores. Well-known theoretical results guarantee the optimality (algorithmic scalability) of these preconditioning techniques for weak scaling scenarios, as they are able to keep the condition number of the preconditioned operator bounded by a constant with fixed load per core and increasing number of cores. The experimental study presented in the paper complements this mathematical analysis and answers how far can these methods go in the number of cores and the scale of the problem to still be within reasonable ranges of efficiency on current distributed-memory machines. Besides, for those scenarios where poor scalability is expected, the study precisely identifies, quantifies and justifies which are the main sources of inefficiency.
Similar content being viewed by others
Notes
The spaces \(\widehat{V}\) and V i can also be understood in a functional setting as the global and local spaces of discrete harmonic functions (see [6]).
In a functional setting, functions in \(\widehat{V}\) are uni-valued on Γ. On the contrary, since every node p in Γ h is replicated n(p) times in \(\varPi_{i=1}^{n_{\mathrm{sbd}}}\varGamma_{i}\), functions in V can take different values at different subdomains. As in [42], \(\widehat{\left . \cdot \right . }\) is used to denote uni-valued functions on Γ.
It is based on the observation that SI 0 v 0 for v 0∈H 0 can readily be obtained as a linear combination of Sϕ i quantities, that have already been computed when assembling S 0 at the preconditioner set-up. This observation, combined with a slight modification of the PCG recurrence described in [2], spares one Schur complement-vector product per PCG iteration.
This definition of {G a :a=1,…,n cts } generates a unique partition of Γ h .
BDD methods can certainly be reformulated as preconditioners for the global linear system (1). This enables approximate solvers (e.g., AMG [35, 40, 43]) to be used in conjunction with BDD methods. Although this approach relaxes the arithmetic/memory demands of sparse direct solvers (particularly in 3D), it would result in a new source of difficulties (e.g., robustness/complexity trade-off evaluation, parameter tuning, load unbalancing issues, poor flop rates) that deserve further research.
For the BNN method, we assume that the initial value x 0 is such that r 0 is balanced (see [28]).
An explicit assembly of S i is required for some DD preconditioners (see e.g. [17]). Besides, this approach allows one to exploit the dense level 2 BLAS for the Schur complement-vector product, which can only compensate the expensive set-up of S i for non-scalable preconditioners with high iteration counts.
For example, PARDISO is based on the sparse Cholesky factorization without pivoting for symmetric positive definite problems, while for symmetric indefinite problems, it uses a more expensive sparse LDL T factorization which, for numerical stability purposes, combines static (prior-to-factorization) pivoting via symmetric weighted matchings and classical Bunch-Kaufman dynamic (during factorization) pivoting only applied inside the supernodes [36, 37].
References
Badia S, Codina R (2008) Algebraic pressure segregation methods for the incompressible Navier-Stokes equations. Arch Comput Methods Eng 15:343–369
Badia S, Martín AF, Príncipe J (2012) Enhanced balancing Neumann-Neumann preconditioning in computational fluid and solid mechanics. Int J Numer Methods Eng (in press)
Baker AH, Gamblin T, Schulz M, Yang UM (2011) Challenges of scaling algebraic multigrid across modern multicore architectures In: Parallel distributed processing symposium (IPDPS). 2011 IEEE international, pp 275–286
Balay S, Brown J, Buschelman K, Eijkhout V, Gropp WD, Kaushik D, Knepley MG, McInnes LC, Smith BF, Zhang H (2012) PETSc users manual. Technical report ANL-95/11, Argonne National Laboratory
Balay S, Brown J, Buschelman K, Gropp WD, Kaushik D, Kne- pley MG, McInnes LC, Smith BF, Zhang H (2012). PETSc web page. http://www.mcs.anl.gov/petsc
Brenner SC, Scott R (2010) The mathematical theory of finite element methods, 3rd edn. Springer, Berlin
Conceição D, Goldfeld P, Sarkis M (2007) Robust two-level lower-order preconditioners for a higher-order stokes discretization with highly discontinuous viscosities. In: High performance computing for computational science—VECPAR 2006, pp 319–333
Davis TA (2006) Direct methods for sparse linear systems, vol 2. SIAM, Philadelphia
De Roeck YH, Le Tallec P (1991) Analysis and test of a local domain decomposition preconditioner. In: Fourth international symposium on domain decomposition methods for partial differential equations, 112 pp
Dohrmann CR (2003) A preconditioner for substructuring based on constrained energy minimization. SIAM J Sci Comput 25(1):246–258
Farhat C, Roux F-X (1991) A method of finite element tearing and interconnecting and its parallel solution algorithm. Int J Numer Methods Eng 32(6):1205–1227
Farhat C, Pierson K, Lesoinne M (2000) The second generation FETI methods and their application to the parallel solution of large-scale linear and geometrically non-linear structural analysis problems. Comput Methods Appl Mech Eng 184(2–4):333–374
Ferencz RM, Hughes TJR (1998) Iterative finite element solutions in nonlinear solid mechanics. In: Numerical methods for solids (part 3). Handbook of numerical analysis, vol VI
Filippone S, Buttari A (2012) Object-oriented techniques for sparse matrix computations in Fortran 2003. ACM Trans Math Softw 38(4):23:1–23:20
Filippone S, Colajanni M (2000) PSBLAS: a library for parallel linear algebra computation on sparse matrices. ACM Trans Math Softw 26(4):527–550
George A (1973) Nested dissection of a regular finite element mesh. SIAM J Numer Anal 10(2):345–363
Giraud L, Haidar A, Watson LT (2008) Parallel scalability study of hybrid preconditioners in three dimensions. Parallel Comput 34(6–8):363–379
Glowinski R, Wheeler MF (1988) Domain decomposition and mixed finite element methods for elliptic problems. In: First international symposium on domain decomposition methods for partial differential equations, pp 144–172
Goldfeld P (2003) Balancing Neumann-Neumann preconditioners for the mixed formulation of almost-incompressible linear elasticity. PhD thesis
Goldfeld P, Pavarino LF, Widlund OB (2003) Balancing Neumann-Neumann preconditioners for mixed approximations of heterogeneous problems in linear elasticity. Numer Math 95:283–324
Gropp W, Lusk E, Skjellum A (1999) Using MPI: portable parallel programming with the message passing interface, vol 1. MIT Press, Cambridge
Gropp W, Lusk E, Thakur R (1999) Using MPI-2: advanced features of the message passing interface. MIT Press, Cambridge
Heroux MA, Willenbring JM (2003) Trilinos users guide. Technical report SAND2003-2952, Sandia National Laboratories
Heroux MA, Bartlett RA, Howle VE, Hoekstra RJ, Hu JJ, Kolda TG, Lehoucq RB, Long KR, Pawlowski RP, Phipps ET, Salinger AG, Thornquist HK, Tuminaro RS, Willenbring JM, Williams A, Stanley KS (2005) An overview of the Trilinos project. ACM Trans Math Softw 31(3):397–423
Hoefler T, Traff JL (2009) Sparse collective operations for MPI. In: Proceedings of the 2009 IEEE international symposium on parallel & distributed processing, pp 1–8
Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20(1):359–392
Lin PT, Shadid JN, Sala M, Tuminaro RS, Hennigan GL, Hoekstra RJ (2009) Performance of a parallel algebraic multilevel preconditioner for stabilized finite element semiconductor device modeling. J Comput Phys 228(17):6250–6267
Mandel J (1993) Balancing domain decomposition. Commun Numer Methods Eng 9(3):233–241
Mandel J, Dohrmann CR (2003) Convergence of a balancing domain decomposition by constraints and energy minimization. Numer Linear Algebra Appl 10(7):639–659
Mandel J, Sousedík B, Dohrmann C (2008) Multispace and multilevel BDDC. Computing 83:55–85
Saad Y (1995) Data structures and algorithms for domain decomposition and distributed sparse matrix computations. Technical report 95-014, Department of Computer Science, University of Minnesota, Minneapolis, MN
Saad Y (2003) Iterative methods for sparse linear systems, 2nd edn. SIAM, Philadelphia
Saad Y, Sosonkina M (1999) Distributed Schur complement techniques for general sparse linear systems. SIAM J Sci Comput 21(4):1337–1356
Sahni O, Carothers CD, Shephard MS, Jansen KE (2009) Strong scaling analysis of a parallel, unstructured, implicit solver and the influence of the operating system interference. Sci Program 17(3):261–274
Sala M, Tuminaro R (2008) A new Petrov-Galerkin smoothed aggregation preconditioner for nonsymmetric linear systems. SIAM J Sci Comput 31(1):143–166
Schenk O, Gärtner K (2004) Solving unsymmetric sparse systems of linear equations with PARDISO. Future Gener Comput Syst 20(3):475–487
Schenk O, Gärtner K (2006) On fast factorization pivoting methods for sparse symmetric indefinite systems. Electron Trans Numer Anal 23:158–179
Šístek J, Sousedík B, Burda P, Mandel J, Novotný J (2011) Application of the parallel BDDC preconditioner to the Stokes flow. Comput Fluids 46(1):429–435
Strang G (2006) Linear algebra and its applications. Thomson Brooks/Cole, Belmont
Stüben K (2001) A review of algebraic multigrid. J Comput Appl Math 128(1–2):281–309
Tang J, Nabben R, Vuik C, Erlangga Y (2009) Comparison of two-level preconditioners derived from deflation, domain decomposition and multigrid methods. J Sci Comput 39(3):340–370
Toselli A, Widlund O (2005) Domain decomposition methods—algorithms and theory. Springer, Berlin
Vaněk P, Mandel J, Brezina M (1996) Algebraic multigrid by smoothed aggregation for second and fourth order elliptic problems. Computing 56:179–196
Acknowledgements
This work has been funded by the European Research Council under the FP7 Programme Ideas through the Starting Grant No. 258443—COMFUS: Computational Methods for Fusion Technology and the FUSSIM project (Ref. ENE2011-285556), financed by the Spanish Ministry of Science and Innovation (MICINN), under the programme “Proyectos de Investigación fundamental no orientada”. A.F. Martín was also partially funded by the UPC postdoctoral grants under the programme “BKC-5-Atracció i Fidelització de talent al BKC”. The authors thankfully acknowledge the computer resources, technical expertise and assistance provided by the Red Española de Supercomputación and the Juelich Supercomputing Centre in the exploitation of the HPC for Fusion (HPC-FF) under the EFDA HPC Implementing Agreement (EFDA (08) 39/4.1).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Badia, S., Martín, A.F. & Principe, J. Implementation and Scalability Analysis of Balancing Domain Decomposition Methods. Arch Computat Methods Eng 20, 239–262 (2013). https://doi.org/10.1007/s11831-013-9086-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11831-013-9086-4