Linear algebra software for large-scale accelerated multicore computing*

A. Abdelfattah; H. Anzt; J. Dongarra; M. Gates; A. Haidar; J. Kurzak; P. Luszczek; S. Tomov; I. Yamazaki; A. YarKhan

doi:10.1017/S0962492916000015

Linear algebra software for large-scale accelerated multicore computing*

Published online by Cambridge University Press: 23 May 2016

A. Abdelfattah ,

H. Anzt ,

J. Dongarra ,

M. Gates ,

A. Haidar ,

J. Kurzak ,

P. Luszczek ,

S. Tomov ,

I. Yamazaki and

A. YarKhan

Show author details

A. Abdelfattah: Affiliation:
Innovative Computing Laboratory, University of Tennessee, 1122 Volunteer BoulevardSuite 203 Claxton, Knoxville, TN 37996, USA E-mail: ahmad@icl.utk.edu, hanzt@icl.utk.edu, dongarra@icl.utk.edu, mgates3@icl.utk.edu, haidar@icl.utk.edu, kurza@icl.utk.edu, luszczek@icl.utk.edu, tomov@icl.utk.edu, iyamazak@icl.utk.edu, yarkhan@icl.utk.edu
H. Anzt: Affiliation:
Innovative Computing Laboratory, University of Tennessee, 1122 Volunteer BoulevardSuite 203 Claxton, Knoxville, TN 37996, USA E-mail: ahmad@icl.utk.edu, hanzt@icl.utk.edu, dongarra@icl.utk.edu, mgates3@icl.utk.edu, haidar@icl.utk.edu, kurza@icl.utk.edu, luszczek@icl.utk.edu, tomov@icl.utk.edu, iyamazak@icl.utk.edu, yarkhan@icl.utk.edu
J. Dongarra: Affiliation:
Innovative Computing Laboratory, University of Tennessee, 1122 Volunteer BoulevardSuite 203 Claxton, Knoxville, TN 37996, USA E-mail: ahmad@icl.utk.edu, hanzt@icl.utk.edu, dongarra@icl.utk.edu, mgates3@icl.utk.edu, haidar@icl.utk.edu, kurza@icl.utk.edu, luszczek@icl.utk.edu, tomov@icl.utk.edu, iyamazak@icl.utk.edu, yarkhan@icl.utk.edu
M. Gates: Affiliation:
Innovative Computing Laboratory, University of Tennessee, 1122 Volunteer BoulevardSuite 203 Claxton, Knoxville, TN 37996, USA E-mail: ahmad@icl.utk.edu, hanzt@icl.utk.edu, dongarra@icl.utk.edu, mgates3@icl.utk.edu, haidar@icl.utk.edu, kurza@icl.utk.edu, luszczek@icl.utk.edu, tomov@icl.utk.edu, iyamazak@icl.utk.edu, yarkhan@icl.utk.edu
A. Haidar: Affiliation:
Innovative Computing Laboratory, University of Tennessee, 1122 Volunteer BoulevardSuite 203 Claxton, Knoxville, TN 37996, USA E-mail: ahmad@icl.utk.edu, hanzt@icl.utk.edu, dongarra@icl.utk.edu, mgates3@icl.utk.edu, haidar@icl.utk.edu, kurza@icl.utk.edu, luszczek@icl.utk.edu, tomov@icl.utk.edu, iyamazak@icl.utk.edu, yarkhan@icl.utk.edu
J. Kurzak: Affiliation:
Innovative Computing Laboratory, University of Tennessee, 1122 Volunteer BoulevardSuite 203 Claxton, Knoxville, TN 37996, USA E-mail: ahmad@icl.utk.edu, hanzt@icl.utk.edu, dongarra@icl.utk.edu, mgates3@icl.utk.edu, haidar@icl.utk.edu, kurza@icl.utk.edu, luszczek@icl.utk.edu, tomov@icl.utk.edu, iyamazak@icl.utk.edu, yarkhan@icl.utk.edu
P. Luszczek: Affiliation:
Innovative Computing Laboratory, University of Tennessee, 1122 Volunteer BoulevardSuite 203 Claxton, Knoxville, TN 37996, USA E-mail: ahmad@icl.utk.edu, hanzt@icl.utk.edu, dongarra@icl.utk.edu, mgates3@icl.utk.edu, haidar@icl.utk.edu, kurza@icl.utk.edu, luszczek@icl.utk.edu, tomov@icl.utk.edu, iyamazak@icl.utk.edu, yarkhan@icl.utk.edu
S. Tomov: Affiliation:
Innovative Computing Laboratory, University of Tennessee, 1122 Volunteer BoulevardSuite 203 Claxton, Knoxville, TN 37996, USA E-mail: ahmad@icl.utk.edu, hanzt@icl.utk.edu, dongarra@icl.utk.edu, mgates3@icl.utk.edu, haidar@icl.utk.edu, kurza@icl.utk.edu, luszczek@icl.utk.edu, tomov@icl.utk.edu, iyamazak@icl.utk.edu, yarkhan@icl.utk.edu
I. Yamazaki: Affiliation:
Innovative Computing Laboratory, University of Tennessee, 1122 Volunteer BoulevardSuite 203 Claxton, Knoxville, TN 37996, USA E-mail: ahmad@icl.utk.edu, hanzt@icl.utk.edu, dongarra@icl.utk.edu, mgates3@icl.utk.edu, haidar@icl.utk.edu, kurza@icl.utk.edu, luszczek@icl.utk.edu, tomov@icl.utk.edu, iyamazak@icl.utk.edu, yarkhan@icl.utk.edu
A. YarKhan: Affiliation:
Innovative Computing Laboratory, University of Tennessee, 1122 Volunteer BoulevardSuite 203 Claxton, Knoxville, TN 37996, USA E-mail: ahmad@icl.utk.edu, hanzt@icl.utk.edu, dongarra@icl.utk.edu, mgates3@icl.utk.edu, haidar@icl.utk.edu, kurza@icl.utk.edu, luszczek@icl.utk.edu, tomov@icl.utk.edu, iyamazak@icl.utk.edu, yarkhan@icl.utk.edu

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Many crucial scientific computing applications, ranging from national security to medical advances, rely on high-performance linear algebra algorithms and technologies, underscoring their importance and broad impact. Here we present the state-of-the-art design and implementation practices for the acceleration of the predominant linear algebra algorithms on large-scale accelerated multicore systems. Examples are given with fundamental dense linear algebra algorithms – from the LU, QR, Cholesky, and LDLT factorizations needed for solving linear systems of equations, to eigenvalue and singular value decomposition (SVD) problems. The implementations presented are readily available via the open-source PLASMA and MAGMA libraries, which represent the next generation modernization of the popular LAPACK library for accelerated multicore systems.

To generate the extreme level of parallelism needed for the efficient use of these systems, algorithms of interest are redesigned and then split into well-chosen computational tasks. The task execution is scheduled over the computational components of a hybrid system of multicore CPUs with GPU accelerators and/or Xeon Phi coprocessors, using either static scheduling or light-weight runtime systems. The use of light-weight runtime systems keeps scheduling overheads low, similar to static scheduling, while enabling the expression of parallelism through sequential-like code. This simplifies the development effort and allows exploration of the unique strengths of the various hardware components. Finally, we emphasize the development of innovative linear algebra algorithms using three technologies – mixed precision arithmetic, batched operations, and asynchronous iterations – that are currently of high interest for accelerated multicore systems.

Type: Research Article
Information: Acta Numerica , Volume 25 , 01 May 2016 , pp. 1 - 160

DOI: https://doi.org/10.1017/S0962492916000015 [Opens in a new window]
Copyright: © Cambridge University Press, 2016

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

REFERENCES18

Aasen, J. (1971), ‘On the reduction of a symmetric matrix to tridiagonal form’, BIT 11, 233–242.CrossRef Google Scholar

Agullo, E., Augonnet, C., Dongarra, J., Ltaief, H., Namyst, R., Thibault, S. and Tomov, S. (2010), Faster, cheaper, better: A hybridization methodology to develop linear algebra software for GPUs. In GPU Computing Gems (Mei, W. and Hwu, W., eds), Vol. 2, Morgan Kaufmann.Google Scholar

Agullo, E., Demmel, J., Dongarra, J., Hadri, B., Kurzak, J., Langou, J., Ltaief, H., Luszczek, P. and Tomov, S. (2009a), ‘Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects’, J. Phys. Conf. Ser. 180, #1.Google Scholar

Agullo, E., Hadri, B., Ltaief, H. and Dongarra, J. (2009b), Comparative study of one-sided factorizations with multiple software packages on multi-core hardware. In SC ’09: Proc. Conference on High Performance Computing Networking, Storage and Analysis, ACM, #20.Google Scholar

Alvarado, F. L. and Schreiber, R. (1993), ‘Optimal parallel solution of sparse triangular systems’, SIAM J. Sci. Comput. 14, 446–460.CrossRef Google Scholar

Amestoy, P. R., Duff, I. S. and L’Excellent, J.-Y. (2000), ‘Multifrontal parallel distributed symmetric and unsymmetric solvers’, Comput. Methods Appl. Mech. Eng. 184, 501–520.Google Scholar

Amestoy, P. R., Duff, I. S., L’Excellent, J.-Y. and Koster, J. (2001), ‘A fully asynchronous multifrontal solver using distributed dynamic scheduling’, SIAM J. Matrix Anal. Appl. 23, 15–41.Google Scholar

Amestoy, P. R., Guermouche, A., L’Excellent, J.-Y. and Pralet, S. (2006), ‘Hybrid scheduling for the parallel solution of linear systems’, Parallel Comput. 32, 136–156.CrossRef Google Scholar

Anderson, E. and Dongarra, J. (1989), Evaluating block algorithm variants in LAPACK. In Proc. 4th Conference on Parallel Processing for Scientific Computing, pp. 3–8.Google Scholar

Anderson, E. and Dongarra, J. (1990a) Implementation guide for LAPACK. Technical report UT-CS-90-101, Computer Science Department, University of Tennessee. LAPACK Working Note 18.Google Scholar

Anderson, E. and Dongarra, J. (1990b), Evaluating block algorithm variants in LAPACK. Technical report UT-CS-90-103, Computer Science Department, University of Tennessee. LAPACK Working Note 19.Google Scholar

Anderson, E. and Saad, Y. (1989), ‘Solving sparse triangular systems on parallel computers’, Int. J. High Speed Comput. 1, 73–96.Google Scholar

Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Ducroz, J., Greenbaum, A., Hammarling, S., McKenney, A. and Sorensen, D. (1999), LAPACK Users’ Guide, third edition, SIAM.Google Scholar

Anderson, M., Sheffield, D. and Keutzer, K. (2012), A predictive model for solving small linear algebra problems in GPU registers. In Proc. IEEE 26th International Parallel and Distributed Processing Symposium: IPDPS 2012, pp. 2–13.Google Scholar

Andrews, H. C. and Patterson, C. L. (1976), ‘Singular value decompositions and digital image processing’, IEEE Trans. Acoust. Speech Signal Process. 24, 26–53.Google Scholar

Anzt, H. (2012) Asynchronous and multiprecision linear solvers: Scalable and fault-tolerant numerics for energy efficient high performance computing. PhD thesis, Institute for Applied and Numerical Mathematics, Karlsruhe Institute of Technology.Google Scholar

Anzt, H., Chow, E. and Dongarra, J. (2015), Iterative sparse triangular solves for preconditioning. In Euro-Par 2015: Parallel Processing, (Träff, J. L., Hunold, S. and Versaci, F., eds), Vol. 9233 of Lecture Notes in Computer Science, Springer, pp. 650–661.Google Scholar

Anzt, H., Tomov, S., Dongarra, J. and Heuveline, V. (2013), ‘A block-asynchronous relaxation method for graphics processing units’, J. Parallel Distrib. Comput. 73, 1613–1626.Google Scholar

Arioli, M. and Duff, I. S. (2008) Using FGMRES to obtain backward stability in mixed precision. Technical report RAL-TR-2008-006, Rutherford Appleton Laboratory.Google Scholar

Arioli, M., Demmel, J. W. and Duff, I. S. (1989), ‘Solving sparse linear systems with sparse backward error’, SIAM J. Matrix Anal. Appl. 10, 165–190.Google Scholar

Ashcraft, C., Grimes, R. and Lewis, J. (1998), ‘Accurate symmetric indefinite linear equation solvers’, SIAM J. Matrix Anal. Appl. 20, 513–561.CrossRef Google Scholar

Ashcraft, C., Grimes, R., Lewis, J., Peyton, B. W. and Simon, H. (1987), ‘Progress in sparse matrix methods in large sparse linear systems on vector supercomputers’, Int. J. Supercomput. Appl. 1, 10–30.Google Scholar

Axelsson, O. and Vassilevski, P. S. (1991), ‘A black box generalized conjugate gradient solver with inner iterations and variable-step preconditioning’, SIAM J. Matrix Anal. Appl. 12, 625–644.Google Scholar

Baboulin, M., Becker, D. and Dongarra, J. (2012), A parallel tiled solver for dense symmetric indefinite systems on multicore architectures. In Proc. IEEE 26th International Parallel and Distributed Processing Symposium: IPDPS 2012, pp. 14–24. LAPACK Working Note 261.Google Scholar

Baboulin, M., Buttari, A., Dongarra, J., Kurzak, J., Langou, J., Langou, J., Luszczek, P. and Tomov, S. (2009), ‘Accelerating scientific computations with mixed precision algorithms’, Comput. Phys. Comm. 180, 2526–2533.Google Scholar

Baboulin, M., Dongarra, J., Herrmann, J. and Tomov, S. (2013), ‘Accelerating linear system solutions using randomization techniques’, ACM Trans. Math. Softw. 39, #8.Google Scholar

Ballard, G., Becker, D., Demmel, J., Dongarra, J., Druinsky, A., Peled, I., Schwartz, O., Toledo, S. and Yamazaki, I. (2013), Implementing a blocked Aasen’s algorithm with a dynamic scheduler on multicore architectures. In Proc. IEEE 27th International Parallel and Distributed Processing Symposium: IPDPS, pp. 895–907.Google Scholar

Ballard, G., Becker, D., Demmel, J., Dongarra, J., Druinsky, A., Peled, I., Schwartz, O., Toledo, S. and Yamazaki, I. (2014), ‘Communication-avoiding symmetric-indefinite factorization’, SIAM J. Matrix Anal. Appl. 35, 1364–1406.Google Scholar

Barrett, R., Berry, M., Chan, T. F., Demmel, J., Donato, J., Dongarra, J., Eijkhout, V., Pozo, R., Romine, C. and van der Vorst, H. (1994a), Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, second edition, SIAM. Postscript file available at www.netlib.org/templates/Templates.html Google Scholar

Barrett, R. F., Chan, T. H. F., D’Azevedo, E. F., Jaeger, E. F., Wong, K. and Wong, R. Y. (2010), ‘Complex version of high performance computing LINPACK benchmark (HPL)’, Concurrency Computat. Pract. Exper. 22, 573–587.Google Scholar

Bartels, R. H., Golub, G. H. and Saunders, M. (1971), Numerical techniques in mathematical programming. In Nonlinear Programming, Academic Press, pp. 123–176.Google Scholar

Becker, D., Baboulin, M. and Dongarra, J. (2012), Reducing the amount of pivoting in symmetric indefinite systems. In 9th International Conference on Parallel Processing and Applied Mathematics: PPAM 2011, (Wyrzykowski, R., Dongarra, J., Karczewski, K. and Wasniewski, J., eds), Vol. 7203 of Lecture Notes in Computer Science, Springer, pp. 133–142.Google Scholar

Bendali, A., Boubendir, Y. and Fares, M. (2007), ‘A FETI-like domain decomposition method for coupling finite elements and boundary elements in large-size problems of acoustic scattering’, Comput. Struct. 85, 526–535.CrossRef Google Scholar

Benzi, M., Joubert, W. and Mateescu, G. (1999), ‘Numerical experiments with parallel orderings for ILU preconditioners’, Electron. Trans. Numer. Anal. 8, 88–114.Google Scholar

Bergman, K. et al. (2008), Exascale computing study: Technology challenges in achieving exascale systems. DARPA IPTO ExaScale Computing Study.Google Scholar

Bientinesi, P., Igual, F. D., Kressner, D. and Quintana-Ortí, E. S. (2010), Reduction to condensed forms for symmetric eigenvalue problems on multi-core architectures. In Proc. 8th international Conference on Parallel Processing and Applied Mathematics: PPAM 2009, Part I, Springer, pp. 387–395.CrossRef Google Scholar

Bischof, C. (1993), A summary of block schemes for reducing a general matrix to Hessenberg form. Technical report ANL/MCS-TM-175, Argonne National Laboratory.Google Scholar

Bischof, C. and Van Loan, C. (1987), ‘The WY representation for products of Householder matrices’, SIAM J. Sci. Statist. Comput. 8, s2–s13.Google Scholar

Bischof, C. H., Lang, B. and Sun, X. (2000), ‘Algorithm 807: The SBR Toolbox – software for successive band reduction’, ACM Trans. Math. Softw. 26, 602–616.Google Scholar

Björck, Å (1996), Numerical Methods for Least Squares Problems, SIAM.CrossRef Google Scholar

Blackford, L. S., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J. J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D. and Whaley, R. C. (1997), ScaLAPACK Users’ Guide, SIAM. www.netlib.org/scalapack/slug/Google Scholar

Blumofe, R. D., Joerg, C. F., Kuszmaul, B. C., Leiserson, C. E., Randall, K. H. and Zhou, Y. (1995), Cilk: An efficient multithreaded runtime system. In Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming: PPOPP ’95, ACM, pp. 207–216.Google Scholar

Braman, K., Byers, R. and Mathias, R. (2002a), ‘The multishift QR algorithm I: Maintaining well-focused shifts and level 3 performance’, SIAM J. Matrix Anal. Appl. 23, 929–947.Google Scholar

Braman, K., Byers, R. and Mathias, R. (2002b), ‘The multishift QR algorithm II: Aggressive early deflation’, SIAM J. Matrix Anal. Appl. 23, 948–973.Google Scholar

Broquedis, F., Clet-Ortega, J., Moreaud, S., Furmento, N., Goglin, B., Mercier, G., Thibault, S. and Namyst, R. (2010), hwloc: a generic framework for managing hardware affinities in HPC applications. In Proc. IEEE 18th Euromicro International Conference on Parallel, Distributed and Network-Based Computing: PDP 2010, pp. 180–186.Google Scholar

Bunch, J. and Kaufman, L. (1977), ‘Some stable methods for calculating inertia and solving symmetric linear systems’, Math. Comp. 31, 163–179.Google Scholar

Buttari, A., Dongarra, J., Kurzak, J., Langou, J., Luszczek, P. and Tomov, S. (2006), The impact of multicore on math software. In Applied Parallel Computing: State of the Art in Scientific Computing, PARA 2006, (Kågström, B., Elmroth, E., Dongarra, J. and Wasniewski, J., eds), Vol. 4699 of Lecture Notes in Computer Science, Springer, pp. 1–10.Google Scholar

Buttari, A., Dongarra, J., Kurzak, J., Luszczek, P. and Tomov, S. (2008a), ‘Using mixed precision for sparse matrix computations to enhance the performance while achieving 64-bit accuracy’, ACM Trans. Math. Softw. 34, #17.Google Scholar

Buttari, A., Dongarra, J., Langou, J., Langou, J., Luszczek, P. and Kurzak, J. (2007), ‘Mixed precision iterative refinement techniques for the solution of dense linear systems’, Int. J. High Performance Comput. Appl. 21, 457–466.Google Scholar

Buttari, A., Langou, J., Kurzak, J. and Dongarra, J. (2008b), ‘Parallel tiled QR factorization for multicore architectures’, Concurrency Computat. Pract. Exper. 20, 1573–1590.CrossRef Google Scholar

Buttari, A., Langou, J., Kurzak, J. and Dongarra, J. (2009), ‘A class of parallel tiled linear algebra algorithms for multicore architectures’, Parallel Comput. Syst. Appl. 35, 38–53.Google Scholar

Castaldo, A. and Whaley, R. (2010), Scaling LAPACK panel operations using parallel cache assignment. In Proc. 15th AGM SIGPLAN Symposium on Principle and Practice of Parallel Programming, pp. 223–232.Google Scholar

Chan, E., Quintana-Orti, E. S., Quintana-Orti, G. and van de Geijn, R. (2007), Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. In SPAA ’07: Proc. 19th Annual ACM Symposium on Parallel Algorithms and Architectures, ACM, pp. 116–125.Google Scholar

Choi, J. (1995), A proposal for a set of parallel basic linear algebra subprograms. Technical report UT-CS-95-292, University of Tennessee, Knoxville. LAPACK Working Note 100.Google Scholar

Chow, E. and Patel, A. (2015), ‘Fine-grained parallel incomplete LU factorization’, SIAM J. Sci. Comput. 37, C169–C193.Google Scholar

Chow, E., Anzt, H. and Dongarra, J. (2015), Asynchronous iterative algorithm for computing incomplete factorizations on GPUs. In High Performance Computing, Vol. 9137 of Lecture Notes in Computer Science, pp. 1–16.Google Scholar

Cook, R. D., Malkus, D. S., Plesha, M. E. and Witt, R. J. (2007), Concepts and Applications of Finite Element Analysis, Wiley.Google Scholar

Cuthill, E. and Mckee, J. (1969), Reducing the bandwidth of sparse symmetric matrices. In Proc. 1969 24th National Conference: ACM ’69, ACM, pp. 157–172.Google Scholar

Datta, B. D. (1995), Numerical Linear Algebra and Applications, Brooks Cole.Google Scholar

Davis, T. A. and Hu, Y. (1994), ‘The University of Florida sparse matrix collection’, ACM Trans. Math. Softw. 38, #1.Google Scholar

Demmel, J. W. (1997), Applied Numerical Linear Algebra, SIAM.CrossRef Google Scholar

Demmel, J., Grigori, L., Hoemmen, M. and Langou, J. (2008a), Implementing communication-optimal parallel and sequential QR factorizations. arXiv:0809.2407 Google Scholar

Demmel, J., Grigori, L., Hoemmen, M. and Langou, J. (2012), ‘Communication-optimal parallel and sequential QR and LU factorizations’, SIAM J. Sci. Comput. 34, A206–A239. Also available as technical report UCB/EECS-2008-89, EECS Department, University of California, Berkeley.Google Scholar

Demmel, J., Hida, Y., Kahan, W., Li, X. S., Mukherjee, S. and Riedy, E. J. (2006), ‘Error bounds from extra-precise iterative refinement’, ACM Trans. Math. Softw. 32, 325–351.Google Scholar

Demmel, J., Hida, Y., Li, X. S. and Riedy, E. J. (2007), Extra-precise iterative refinement for overdetermined least squares problems. Technical report EECS-2007-77, UC Berkeley. LAPACK Working Note 188.Google Scholar

Demmel, J., Marques, O., Parlett, B. N. and Vomel, C. (2008b), ‘Performance and accuracy of LAPACK’s symmetric tridiagonal eigensolvers’, SIAM J. Sci. Comput. 30, 1508–1526.Google Scholar

Doi, S. (1991), ‘On parallelism and convergence of incomplete LU factorizations’, Appl. Numer. Math. 7, 417–436.Google Scholar

Dong, T., Dobrev, V., Kolev, T., Rieben, R., Tomov, S. and Dongarra, J. (2014), A step towards energy efficient computing: Redesigning a hydrodynamic application on CPU–GPU. In Proc. IEEE 28th International Parallel and Distributed Processing Symposium: IPDPS 2014, pp. 972–981.Google Scholar

Dongarra, J. J. (1983), ‘Improving the accuracy of computed singular values’, SIAM J. Sci. Statist. Comput. 4, 712–719.Google Scholar

Dongarra, J. and Whaley, R. C. (1995), A user’s guide to the BLACS v1.1, Technical report UT-CS-95-281, University of Tennessee, Knoxville. LAPACK Working Note 94, updated 5 May 1997 (version 1.1).Google Scholar

Dongarra, J. J., Bunch, J. R., Moler, C. B. and Stewart, G. W. (1979), LINPACK Users’ Guide, SIAM.Google Scholar

Dongarra, J. J., Du Croz, J., Hammarling, S. and Duff, I. S. (1990), ‘A set of Level 3 Basic Linear Algebra Subprograms’, ACM Trans. Math. Softw. 16, 1–17.Google Scholar

Dongarra, J., Faverge, M., Ltaief, H. and Luszczek, P. (2011), Achieving numerical accuracy and high performance using recursive tile LU factorization. Technical report ICL-UT-11-08, Computer Science Department, University of Tennessee, Knoxville.Google Scholar

Dongarra, J., Faverge, M., Ltaief, H. and Luszczek, P. (2012), ‘Exploiting fine-grain parallelism in recursive LU factorization’, Advances in Parallel Computing Special Issue 22, 429–436.Google Scholar

Dongarra, J. J., Gustavson, F. G. and Karp, A. (1984), ‘Implementing linear algebra algorithms for dense matrices on a vector pipeline machine’, SIAM Review 26, 91–112.CrossRef Google Scholar

Dongarra, J., Haidar, A., Kurzak, J., Luszczek, P., Tomov, S. and Yarkhan, A. (2014), ‘Model-driven one-sided factorizations on multicore accelerated systems’, Int. J. Supercomputing Frontiers and Innovations 1, #1.Google Scholar

Dongarra, J. J., Moler, C. B. and Wilkinson, J. H. (1983), ‘Improving the accuracy of computed eigenvalues and eigenvectors’, SIAM J. Numer. Anal. 20, 23–45.Google Scholar

Dongarra, J. J., Sorensen, D. C. and Hammarling, S. J. (1989), ‘Block reduction of matrices to condensed forms for eigenvalue computations’, J. Comput. Appl. Math. 27, 215–227.Google Scholar

Ducroz, J., Dongarra, J. J. and Higham, N. J. (1992), ‘Stability of methods for matrix inversion’, IMA J. Numer. Anal. 12, 1–19.Google Scholar

Duff, I. S. and Meurant, G. A. (1989), ‘The effect of ordering on preconditioned conjugate gradients’, BIT 29, 635–657.CrossRef Google Scholar

Duff, I. S. and Reid, J. K. (1983), ‘The multifrontal solution of indefinite sparse symmetric linear equations’, ACM Trans. Math. Softw. 9, 302–325.Google Scholar

Edelman, A. (1993), ‘Large dense numerical linear algebra in 1993: the parallel computing influence’, Int. J. High Performance Comput. Appl. 7, 113–128.Google Scholar

Farra, V. and Madariaga, R. (1988), ‘Non-linear reflection tomography’, Geophys. J. Int. 95, 135–147.Google Scholar

Frommer, A. and Szyld, D. B. (2000), ‘On asynchronous iterations’, J. Comput. Appl. Math. 123, 201–216.Google Scholar

Gansterer, W. N., Kvasnicka, D. F. and Ueberhuber, C. W. (1999), Multi-sweep algorithms for the symmetric eigenproblem. In Vector and Parallel Processing: VECPAR’98, Vol. 1573 of Lecture Notes in Computer Science, Springer, pp. 20–28.Google Scholar

Gates, M., Haidar, A. and Dongarra, J. (2014), Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem. In High Performance Computing for Computational Science: VECPAR 2014, Springer, pp. 182–191.Google Scholar

Golub, G. H. and Kahan, W. (1965), ‘Calculating the singular values and pseudoinverse of a matrix’, SIAM J. Numer. Anal. 2, 205–224.Google Scholar

Golub, G. H. and Reinsch, C. (1971), Singular value decomposition and least squares solutions. In Handbook for Automatic Computation II: Linear Algebra (Wilkinson, J. and Reinsch, C., eds), Springer, pp. 403–420.Google Scholar

Golub, G. H. and Van Loan, C. (1996), Matrix Computations, third edition, Johns Hopkins University Press.Google Scholar

Golub, G. H. and Wilkinson, J. H. (1976), ‘Ill-conditioned eigensystems and the computation of the Jordan canonical form’, SIAM Rev. 18, 578–619.Google Scholar

Golub, G. H. and Ye, Q. (2000), ‘Inexact preconditioned conjugate gradient method with inner–outer iteration’, SIAM J. Sci. Comput. 21, 1305–1320.Google Scholar

Grcar, J. F. (2011), ‘Mathematicians of Gaussian elimination’, Notices Amer. Math. Soc. 58, 782–792.Google Scholar

Grigori, L., Demmel, J. and Xiang, H. (2011), ‘CALU: a communication optimal LU factorization algorithm’, SIAM. J. Matrix Anal. Appl. 32, 1317–1350.Google Scholar

Gu, M. and Eisenstat, S. C. (1995), ‘A divide-and-conquer algorithm for the bidiagonal SVD’, SIAM J. Matrix Anal. Appl. 16, 79–92.Google Scholar

Gustavson, F. (1997), ‘Recursive leads to automatic variable blocking for dense linear-algebra algorithms’, IBM J. Research and Development 41, 737–755.Google Scholar

Hadri, B., Ltaief, H., Agullo, E. and Dongarra, J. (2009), Enhancing parallelism of tile QR factorization for multicore architectures. LAPACK Working Note 222.Google Scholar

Hadri, B., Ltaief, H., Agullo, E. and Dongarra, J. (2010), Tile QR factorization with parallel panel processing for multicore architectures. In Proc. 24th IEEE International Parallel and Distributed Processing Symposium: IPDPS 2010. INRIA HAL Technical report inria-00548899: https://hal.inria.fr/inria-00548899 Google Scholar

Haidar, A., Cao, C., Yarkhan, A., Luszczek, P., Tomov, S., Kabir, K. and Dongarra, J. (2014), Unified development for mixed multi-GPU and multi-coprocessor environments using a lightweight runtime environment. In Proc. IEEE 28th International Parallel and Distributed Processing Symposium: IPDPS, pp. 491–500.Google Scholar

Haidar, A., Kurzak, J. and Luszczek, P. (2013a), An improved parallel singular value algorithm and its implementation for multicore hardware. In Proc. SC ’13: International Conference for High Performance Computing, Networking, Storage and Analysis, #90.Google Scholar

Haidar, A., Ltaief, H. and Dongarra, J. (2011), Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels. In Proc. SC ’11: International Conference for High Performance Computing, Networking, Storage and Analysis, ACM, #8.Google Scholar

Haidar, A., Ltaief, H., Luszczek, P. and Dongarra, J. (2012), A comprehensive study of task coalescing for selecting parallelism granularity in a two-stage bidiagonal reduction. In Proc. IEEE 26th International Parallel and Distributed Processing Symposium: IPDPS 2012, pp. 25–35.Google Scholar

Haidar, A., Luszczek, P. and Dongarra, J. (2014b), New algorithm for computing eigenvectors of the symmetric eigenvalue problem. In 15th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing: PDSEC 2014 (Best Paper), pp. 1150–1159.Google Scholar

Haidar, A., Solcà, R., Gates, M., Tomov, S., Schulthess, T. C. and Dongarra, J. (2013b), Leading edge hybrid multi-GPU algorithms for generalized eigenproblems in electronic structure calculations. In Supercomputing, (Kunkel, J. M., Ludwig, T. and Meuer, H. W., eds), Vol. 7905 of Lecture Notes in Computer Science, Springer, pp. 67–80.Google Scholar

Haidar, A., Tomov, S., Dongarra, J., Solcà, R. and T., Schulthess (2014), ‘A novel hybrid CPU–GPU generalized eigensolver for electronic structure calculations based on fine-grained memory aware tasks’, Int. J. High Performance Comput. Appl. 28, 196–209.Google Scholar

Hammarling, S., Dongarra, J., Du Croz, J. and Hanson, R. (1988), ‘An extended set of Fortran Basic Linear Algebra Subprograms’, ACM Trans. Math. Softw. 14, 1–32.Google Scholar

Hammond, S. W. and Schreiber, R. (1992), ‘Efficient ICCG on a shared memory multiprocessor’, Int. J. High Speed Comput. 4, 1–21.Google Scholar

Hanson, R. J. (1971), ‘A numerical method for solving Fredholm integral equations of the first kind using singular values’, SIAM J. Numer. Anal. 8, 616–626.Google Scholar

Harrington, R. (1990), ‘Origin and development of the method of moments for field computation’, IEEE Antennas Propag. 32, 31–35.Google Scholar

Hess, J. L. (1990), ‘Panel methods in computational fluid dynamics’, Annu. Rev. Fluid Mech. 22, 255–274.Google Scholar

Higham, N. J. (2002), Accuracy and Stability of Numerical Algorithms, second edition, SIAM.Google Scholar

Hotelling, H. (1933), ‘Analysis of a complex of statistical variables into principal components’, J. Educ. Psych. 24, 417–441; 498–520.Google Scholar

Hotelling, H. (1935), ‘Simplified calculation of principal components’, Psychometrica 1, 27–35.Google Scholar

Im, E.-J., Yelick, K. and Vuduc, R. (2004), ‘Sparsity: Optimization framework for sparse matrix kernels’, Int. J. High Perform. Comput. Appl. 18, 135–158.Google Scholar

Intel (2014) , Intel^®64 and IA-32 architectures software developer’s manual. http://download.intel.com/products/processor/manual/Google Scholar

Jaeger, E., Harvey, R., Berry, L., Myra, J., Dumont, R., Philips, C., Smithe, D., Barrett, R., Batchelor, D., Bonoli, P., Carter, M., D’Azevedo, E., D’Ippolito, D., Moore, R. and Wright, J. (2006), ‘Global-wave solutions with self-consistent velocity distributions in ion cyclotron heated plasmas’, Nuclear Fusion 46, S397–S408.Google Scholar

Jessup, E. R. and Sorensen, D. C. (1989), A divide and conquer algorithm for computing the singular value decomposition. In Proc. Third SIAM Conference on Parallel Processing for Scientific Computing, SIAM, pp. 61–66.Google Scholar

Jiang, E. P. and Berry, M. W. (1998), Information filtering using the Riemannian SVD (R-SVD). In Solving Irregularly Structured Problems in Parallel: Proc. 5th International Symposium, IRREGULAR ’98, (Ferreira, A., Rolim, J. D. P., Simon, H. D. and Teng, S.-H., eds), Vol. 1457 of Lecture Notes in Computer Science, Springer, pp. 386–395.Google Scholar

Joffrain, T., Quintana-Orti, E. S. and van de Geijn, R. A. (2006), Rapid development of high-performance out-of-core solvers. In Applied Parallel Computing: State of the Art in Scientific Computing, (Dongarra, J., Madsen, K. and Waśniewski, J., eds), Vol. 3732 of Lecture Notes in Computer Science, Springer, pp. 413–422.Google Scholar

Kabir, K., Haidar, A., Tomov, S. and Dongarra, J. (2015), Performance analysis and design of a Hessenberg reduction using stabilized blocked elementary transformations for new architectures. In Proc. Symposium on High Performance Computing, Society for Computer Simulation International, pp. 135–142.Google Scholar

Kågström, B., Kressner, D. and Shao, M. (2012), On aggressive early deflation in parallel variants of the QR algorithm. In Applied Parallel and Scientific Computing, (Jónasson, K., ed.), Vol. 7133 of Lecture Notes in Computer Science, Springer, pp. 1–10.Google Scholar

Karlsson, L. and Kågström, B. (2011), ‘Parallel two-stage reduction to Hessenberg form using dynamic scheduling on shared-memory architectures’, Parallel Computing 37, 771–782.Google Scholar

Kurzak, J. and Dongarra, J. J. (2007), ‘Implementation of mixed precision in solving systems of linear equations on the Cell processor’, Concurrency Computat. Pract. Exper. 19, 1371–1385.Google Scholar

Kurzak, J., Buttari, A. and Dongarra, J. J. (2008), ‘Solving systems of linear equations on the CELL processor using Cholesky factorization’, IEEE Trans. Parallel and Distributed Systems 19, 1–11.Google Scholar

Lang, B. (1999), ‘Efficient eigenvalue and singular value computations on shared memory machines’, Parallel Computing 25, 845–860.CrossRef Google Scholar

Langou, J., Langou, J., Luszczek, P., Kurzak, J., Buttari, A. and Dongarra, J. J. (2006), Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy. In Proc. 2006 ACM/IEEE Conference on Supercomputing.Google Scholar

Lawson, C. L., Hanson, R. J., Kincaid, D. and Krogh, F. T. (1979), ‘Basic linear algebra subprograms for Fortran usage’, ACM Trans. Math. Softw. 5, 308–323.Google Scholar

Ltaief, H., Luszczek, P., Haidar, A. and Dongarra, J. (2012), Enhancing parallelism of tile bidiagonal transformation on multicore architectures using tree reduction. In 9th International Conference on Parallel Processing and Applied Mathematics: PPAM 2011, (Wyrzykowski, R., Dongarra, J., Karczewski, K and Wasniewski, J., eds), Vol. 7203 of Lecture Notes in Computer Science, Springer, pp. 661–670.Google Scholar

Lukarski, D. (2012), Parallel sparse linear algebra for multi-core and many-core platforms: Parallel solvers and preconditioners. PhD thesis, Karlsruhe Institute of Technology (KIT), Germany.Google Scholar

Luszczek, P. and Dongarra, J. (2012), Anatomy of a globally recursive embedded LINPACK benchmark. In Proc. 2012 IEEE Conference on High Performance Extreme Computing Conference: HPEC 2012.Google Scholar

Luszczek, P., Ltaief, H. and Dongarra, J. (2011), Two-stage tridiagonal reduction for dense symmetric matrices using tile algorithms on multicore architectures. In IEEE International Parallel and Distributed Processing Symposium: IPDPS 2011, IEEE, pp. 944–955.Google Scholar

Mayer, J. (2009), ‘Parallel algorithms for solving linear systems with sparse triangular matrices’, Computing 86, 291–312.Google Scholar

Messer, O., Harris, J., Parete-Koon, S. and Chertkow, M. (2012), Multicore and accelerator development for a leadership-class stellar astrophysics code. In Proc. PARA 2012: Applied Parallel and Scientific Computing, (Manninen, P. and Öster, P., eds), Vol. 7782 of Lecture Notes in Computer Science, pp. 92–106.Google Scholar

Meuer, H. W., Strohmaier, E., Dongarra, J. J. and Simon, H. D. (2011), TOP500 Supercomputer Sites, 38th edition. www.netlib.org/benchmark/top500.html Google Scholar

Moler, C. B. (1967), ‘Iterative refinement in floating point’, J. Assoc. Comput. Mach. 14, 316–321.Google Scholar

Moler, C. B. (1972), ‘Matrix computations with Fortran and paging’, Comm. Assoc. Comput. Mach. 15, 268–270.Google Scholar

Moore, B. C. (1981), ‘Principal component analysis in linear systems: Controllability, observability, and model reduction’, IEEE Trans. Automatic Control 26, 17–32.Google Scholar

Moore, G. E. (1965), ‘Cramming more components onto integrated circuits’, Electronics 38(8).Google Scholar

Naumov, M. (2011), Parallel solution of sparse triangular linear systems in the preconditioned iterative methods on the GPU. Technical report NVR-2011-001, NVIDIA.Google Scholar

Nédélec, J.-C. (2001), Acoustic and Electromagnetic Equations: Integral Representations for Harmonic Problems, Springer.Google Scholar

Notay, Y. (2000), ‘Flexible conjugate gradients’, SIAM J. Sci. Comput. 22, 1444–1460.Google Scholar

Oettli, W. and Prager, W. (1964), ‘Compatibility of approximate solution of linear equations with given error bounds for coefficients and right-hand sides’, Numer. Math. 6, 405–409.Google Scholar

Park, J., Smelyanskiy, M., Sundaram, N. and Dubey, P. (2014), Sparsifying synchronization for high-performance shared-memory sparse triangular solver. In Supercomputing, Vol. 8488 of Lecture Notes in Computer Science, Springer, pp. 124–140.Google Scholar

Parker, D. S. (1995a), Random butterfly transformations with applications in computational linear algebra. Technical report CSD-950023, Computer Science Department, University of California.Google Scholar

Parker, D. S. (1995b), A randomizing butterfly transformation useful in block matrix computations. Technical report CSD-950024, Computer Science Department, University of California.Google Scholar

Poole, E. L. and Ortega, J. M. (1987), ‘Multicolor ICCG methods for vector computers’, SIAM J. Numer. Anal. 24, 1394–1417.Google Scholar

Pothen, A. and Alvarado, F. (1992), ‘A fast reordering algorithm for parallel sparse triangular solution’, SIAM J. Sci. Statist. Comput. 13, 645–653.Google Scholar

Quaranta, E. and Drikakis, D. (2009), ‘Noise radiation from a ducted rotor in a swirling-translating flow’, J. Fluid Mech. 641, 463–473.Google Scholar

Rotem, E., Naveh, A., Rajwan, D., Ananthakrishnan, A. and Weissmann, E. (2012), ‘Power-management architecture of the Intel microarchitecture code-named Sandy Bridge’, IEEE Micro 32, 20–27.Google Scholar

Rozloz̆ník, M., Shklarski, G. and Toledo, S. (2011), ‘Partitioned triangular tridiagonalization’, ACM Trans. Math. Softw. 37, 1–16.Google Scholar

Saad, Y. (1993), ‘A flexible inner–outer preconditioned GMRES algorithm’, SIAM J. Sci. Comput. 14, 461–469.CrossRef Google Scholar

Saad, Y. (2003), Iterative Methods for Sparse Linear Systems, SIAM.Google Scholar

Saltz, J. H. (1990), ‘Aggregation methods for solving sparse triangular systems on multiprocessors’, SIAM J. Sci. Statist. Comput. 11, 123–144.Google Scholar

Schreiber, R. and Van Loan, C. (1989), ‘A storage-efficient WY representation for products of Householder transformations’, SIAM J. Sci. Statist. Comput. 10, 53–57.Google Scholar

Simoncini, V. and Szyld, D. B. (2003), ‘Flexible inner–outer Krylov subspace methods’, SIAM J. Numer. Anal. 40, 2219–2239.Google Scholar

Singh, D. J. (1994), Planewaves, Pseudopotentials, and the LAPW Method, Kluwer.Google Scholar

Skeel, R. D. (1980), ‘Iterative refinement implies numerical stability for Gaussian elimination’, Math. Comput. 35, 817–832.Google Scholar

Solcà, R., Kozhevnikov, A., Haidar, A., Tomov, S., Dongarra, J. and Schulthess, T. C. (2015), Efficient implementation of quantum materials simulations on distributed CPU–GPU systems. In Proc. International Conference for High Performance Computing, Networking, Storage and Analysis, ACM, pp. 1–10.Google Scholar

Stewart, G. W. (1973), Introduction to Matrix Computations, Academic Press.Google Scholar

Toledo, S. (1997), ‘Locality of reference in LU decomposition with partial pivoting’, SIAM J. Matrix Anal. Appl. 18, 1065–1081.Google Scholar

Tomov, S., Nath, R. and Dongarra, J. (2010a), ‘Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing’, Parallel Computing 36, 645–654.Google Scholar

Tomov, S., Nath, R., Du, P. and Dongarra, J. (2009), MAGMA version 0.2 Users’ Guide.Google Scholar

Tomov, S., Nath, R., Ltaief, H. and Dongarra, J. (2010b), Dense linear algebra solvers for multicore with GPU accelerators. In Proc. IEEE IPDPS ’10, pp. 1–8.Google Scholar

Trefethen, L. and Schreiber, R. (1990), ‘Average case analysis of Gaussian elimination’, SIAM J. Math. Anal. Appl. 11, 335–360.Google Scholar

Tuma, M. and Benzi, M. (1998), ‘A comparative study of sparse approximate inverse preconditioners’, Appl. Numer. Math 30, 305–340.Google Scholar

Turk, M. A. and Pentland, A. P. (1991), Face recognition using eigenfaces. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition 1991: Proc. CVPR’91, IEEE, pp. 586–591.Google Scholar

Turner, K. and Walker, H. F. (1992), ‘Efficient high accuracy solutions with GMRES(m)’, SIAM J. Sci. Statist. Comput. 13, 815–825.Google Scholar

van Duin, A. C. N. (1996), ‘Scalable parallel preconditioning with the sparse approximate inverse of triangular matrices’, SIAM J. Matrix Anal. Appl. 20, 987–1006.Google Scholar

van den Eshof, J., Sleijpen, G. L. G. and van Gijzen, M. B. (2005), ‘Relaxation strategies for nested Krylov methods’, J. Comput. Appl. Math. 177, 347–365.Google Scholar

van der Vorst, H. (1982), ‘A vectorizable variant of some ICCG methods’, SIAM J. Sci. Statist. Comput. 3, 350–356.Google Scholar

van der Vorst, H. A. and Vuik, C. (1994), ‘GMRESR: A family of nested GMRES methods’, Numer. Linear Algebra Appl. 1, 369–386.Google Scholar

Villa, O., Fatica, M., Gawande, N. A. and Tumeo, A. (2013a), Power/performance trade-offs of small batched LU based solvers on GPUs. In 19th International Conference on Parallel Processing: Euro-Par 2013, Vol. 8097 of Lecture Notes in Computer Science, pp. 813–825.Google Scholar

Villa, O., Gawande, N. A. and Tumeo, A. (2013b), Accelerating subsurface transport simulation on heterogeneous clusters. In 2013 IEEE International Conference on Cluster Computing: CLUSTER IEEE, pp. 1–8.Google Scholar

Volkov, V. and Demmel, J. W. (2008), LU, QR and Cholesky factorizations using vector capabilities of GPUs. Technical report UCB/EECS-2008-49, University of California, Berkeley. LAPACK Working Note 202.Google Scholar

Vuik, C. (1995), ‘New insights in GMRES-like methods with variable preconditioners’, J. Comput. Appl. Math. 61, 189–204.Google Scholar

Wainwright, I. and Sweden, H. P. C. (2013), Optimized LU-decomposition with full pivot for small batched matrices. In 2013 NVIDIA GPU Tech. Conf.Google Scholar

Wilkinson, J. H. (1963), Rounding Errors in Algebraic Processes, Prentice Hall.Google Scholar

Wilkinson, J. H. (Ed.) (1988), The Algebraic Eigenvalue Problem, Oxford University Press.Google Scholar

Wolf, M., Heroux, M. and Boman, E. (2011), Factors impacting performance of multithreaded sparse triangular solve. In High Performance Computing for Computational Science: VECPAR 2010, Vol. 6449 of Lecture Notes in Computer Science, Springer, pp. 32–44.Google Scholar

Yamazaki, I., Tomov, S. and Dongarra, J. (2015), ‘Mixed-precision Cholesky QR factorization and its case studies on multicore CPU with multiple GPUs’, SIAM J. Sci. Comput. 37, C307–C330.Google Scholar

Yamazaki, I., Tomov, S., Dong, T. and Dongarra, J. (2014), Mixed-precision orthogonalization scheme and adaptive step size for improving the stability and performance of CA-GMRES on GPUs. In 11th International Conference on High Performance Computing for Computational Science: VECPAR 2014, Revised Selected Papers (Best Paper Award), pp. 17–30.Google Scholar

YarKhan, A., Kurzak, J. and Dongarra, J. (2011), QUARK Users’ Guide, Innovative Computing Laboratory, Electrical Engineering and Computer Science, University of Tennessee.Google Scholar

Yeralan, S. N., Davis, T. A. and Ranka, S. (2013), Sparse multifrontal QR on the GPU. Technical report, University of Florida.Google Scholar

Yeung, M.-C. and Chan, T. F. (1995), Probabilistic analysis of Gaussian elimination without pivoting. Technical report CAM95-29, Department of Mathematics, University of California, Los Angeles.Google Scholar

Ypma, T. J. (1995), ‘Historical development of the Newton–Raphson method’, SIAM Review 37, 531–551.Google Scholar

Zhang, Y., Taylor, M., Sarkar, T., Moon, H. and Yuan, M. (2008), ‘Solving large complex problems using a higher-order basis: Parallel in-core and out-of-core integral-equation solvers’, IEEE Antennas Propag. 50, 13–30.Google Scholar

Article contents

Linear algebra software for large-scale accelerated multicore computing*

Abstract

Access options

References

REFERENCES18

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests