Please use any of these publications to reference MAGMA.
RELEASE NOTES
A neural network library in c++ aimed at providing a simple, modularized framework for deep learning that is accelerated for heterogeneous architectures. MagmaDNN's releases are located at https://bitbucket.org/icl/magmadnn (and here).
Latest Version 1.6.0
In version 1.0 MagmaDNN offers a strong tensor core with standard machine learning and DNN functionalities built around it.
Version 1.6.0 now offers support for CuDNN LSTM and Attention operations.
MagmaDNN is optimized towards heterogeneous architectures (multi-core CPU and GPU), so it is advised to use with a modern NVIDIA GPU. However, MagmaDNN does support a CPU only install. This is mainly meant for testing and is not nearly as optimized as the GPU version.
The documentation can be found on the docs site. For the most recent version of the documentation see the build & install tutorial on how to build the docs from source. The todo page contains information on the future of the package and the troubleshooting page walks through common issues and there solution.
There are several tutorials in docs/tutorials. These give an introduction into installing and using the library.
For examples of what MagmaDNN code looks like see the examples/ folder. If MagmaDNN is downloaded and installed, then the examples can be made and run with make examples.
current contributers: Stephen Qiu, Julian Halloy, Pierluigi Cambie-Fabris, Kwai Wong, Stanimire Tomov
previous contributers: Daniel Nichols, Florent Lopez, Sedrick Keh, Rocco Febbo, Jerry Ho, Joshua Zingale, Edward Karak, Spencer Smith, Nicole Chow
| “Accelerating Homotopy Continuation with GPUs: Application to Trifocal Pose Estimation,” 2025 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Milano, Italy, IEEE, July 2025. DOI: 10.1109/IPDPS64566.2025.00110 |
| “MAGMA: Enabling exascale performance with accelerated BLAS and LAPACK for diverse GPU architectures,” The International Journal of High Performance Computing Applications, June 2024. DOI: 10.1177/10943420241261960 |
| “GPU-based LU Factorization and Solve on Batches of Matrices with Band Structure,” SC-W 2023: Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, Denver, CO, ACM, November 2023. DOI: 10.1145/3624062.3624247 |
| “PAQR: Pivoting Avoiding QR factorization,” 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS), St. Petersburg, FL, USA, IEEE, 2023. DOI: 10.1109/IPDPS54959.2023.00040 |
|
“A Python Library for Matrix Algebra on GPU and Multicore Architectures,”
2022 IEEE 19th International Conference on Mobile Ad Hoc and Smart Systems (MASS), Denver, CO, IEEE, December 2022.
DOI: 10.1109/MASS56207.2022.00121 |
|
“Extending MAGMA Portability with OneAPI,”
The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC22), Ninth Workshop on Accelerator Programming Using Directives (WACCPD 2022), Dallas, TX, November 2022.
|
|
Extending MAGMA Portability with OneAPI
, Dallas, TX, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC22), ACM Student Research Competition, November 2022.
|
|
“Mixed-Precision Algorithm for Finding Selected Eigenvalues and Eigenvectors of Symmetric and Hermitian Matrices,”
ICL Technical Report, no. ICL-UT-21-05, August 2021.
|
|
“Exploiting Block Structures of KKT Matrices for Efficient Solution of Convex Optimization Problems,”
IEEE Access, 2021.
DOI: 10.1109/ACCESS.2021.3106054 |
|
“Evaluating the Performance of NVIDIAâs A100 Ampere GPU for Sparse and Batched Computations,”
2020 IEEE/ACM Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS): IEEE, November 2020.
|
|
“High-Order Finite Element Method using Standard and Device-Level Batch GEMM on GPUs,”
2020 IEEE/ACM 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA): IEEE, November 2020.
|
| “MAGMA Templates for Scalable Linear Algebra on Emerging Architectures,” The International Journal of High Performance Computing Applications, vol. 34, issue 6, pp. 645-658, November 2020. DOI: 10.1177/1094342020938421 |
|
“Matrix Multiplication on Batches of Small Matrices in Half and Half-Complex Precisions,”
Journal of Parallel and Distributed Computing, vol. 145, pp. 188-201, November 2020.
DOI: 10.1016/j.jpdc.2020.07.001 |
|
“Mixed-Precision Iterative Refinement using Tensor Cores on GPUs to Accelerate Solution of Linear Systems,”
Proceedings of the Royal Society A, vol. 476, issue 2243, November 2020.
DOI: 10.1098/rspa.2020.0110 |
|
“Design, Optimization, and Benchmarking of Dense Linear Algebra Algorithms on AMD GPUs,”
2020 IEEE High Performance Extreme Computing Virtual Conference: IEEE, September 2020.
|
|
“Mixed Precision LU Factorization on GPU Tensor Cores: Reducing Data Movement and Memory Footprint,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-20-13: University of Tennessee, September 2020.
|
|
“Design, Optimization, and Benchmarking of Dense Linear Algebra Algorithms on AMD GPUs,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-20-12: University of Tennessee, August 2020.
|
|
“Ginkgo: A High Performance Numerical Linear Algebra Library,”
Journal of Open Source Software, vol. 5, issue 52, August 2020.
DOI: 10.21105/joss.02260 |
|
“Integrating Deep Learning in Domain Sciences at Exascale,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-20-10: University of Tennessee, August 2020.
|
| “Integrating Deep Learning in Domain Sciences at Exascale,” 2020 Smoky Mountains Computational Sciences and Engineering Conference (SMC 2020), August 2020. |
| “Multiprecision Block-Jacobi for Iterative Triangular Solves,” European Conference on Parallel Processing (Euro-Par 2020): Springer, August 2020. DOI: 10.1007/978-3-030-57675-2_34 |
|
“Translational Process: Mathematical Software Perspective,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-20-11, August 2020.
|
| hipMAGMA v2.0 : Zenodo, July 2020. DOI: 10.5281/zenodo.3928667 |
|
“heFFTe: Highly Efficient FFT for Exascale,”
International Conference on Computational Science (ICCS 2020), Amsterdam, Netherlands, June 2020.
DOI: 10.1007/978-3-030-50371-0_19 |
|
“Sparse Linear Algebra on AMD and NVIDIA GPUsâThe Race is On,”
ISC High Performance: Springer, June 2020.
DOI: 10.1007/978-3-030-50743-5_16 |
|
“Asynchronous SGD for DNN Training on Shared-Memory Parallel Architectures,”
Workshop on Scalable Deep Learning over Parallel And Distributed Infrastructures (ScaDL 2020), May 2020.
|
|
“CEED ECP Milestone Report: Improve Performance and Capabilities of CEED-Enabled ECP Applications on Summit/Sierra,”
ECP Milestone Reports: Zenodo, May 2020.
DOI: 10.5281/zenodo.3860804 |
|
“Reducing the Amount of out-of-core Data Access for GPU-Accelerated Randomized SVD,”
Concurrency and Computation: Practice and Experience, April 2020.
DOI: 10.1002/cpe.5754 |
|
“Asynchronous SGD for DNN Training on Shared-Memory Parallel Architectures,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-20-04: University of Tennessee, Knoxville, March 2020.
|
| hipMAGMA v1.0 : Zenodo, March 2020. DOI: 10.5281/zenodo.3908549 |
|
“Load-Balancing Sparse Matrix Vector Product Kernels on GPUs,”
ACM Transactions on Parallel Computing, vol. 7, issue 1, March 2020.
DOI: 10.1145/3380930 |
|
MATEDOR: MAtrix, TEnsor, and Deep-learning Optimized Routines
, Seattle, WA, 2020 NSF Cyberinfrastructure for Sustained Scientific Innovation (CSSI) Principal Investigator Meeting, February 2020.
|
|
“FFT-ECP API and High-Performance Library Prototype for 2-D and 3-D FFTs on Large-Scale Heterogeneous Systems with GPUs,”
ECP Milestone Report, no. FFT-ECP STML13-27: Innovative Computing Laboratory, University of Tennessee, January 2020.
|
|
“Project-Based Research and Training in High Performance Data Sciences, Data Analytics, and Machine Learning,”
The Journal of Computational Science Education, vol. 11, issue 1, pp. 36-44, January 2020.
DOI: 10.22369/issn.2153-4136/11/1/7 |
|
“Impacts of Multi-GPU MPI Collective Communications on Large FFT Computation,”
Workshop on Exascale MPI (ExaMPI) at SC19, Denver, CO, November 2019.
|
|
“Towards Half-Precision Computation for Complex Matrices: A Case Study for Mixed Precision Solvers on GPUs,”
ScalA19: 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, Denver, CO, IEEE, November 2019.
|
|
CEED ECP Milestone Report: Performance Tuning of CEED Software and 1st and 2nd Wave Apps
: Zenodo, October 2019.
DOI: 10.5281/zenodo.3477618 |
|
“FFT-ECP Implementation Optimizations and Features Phase,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-19-12: University of Tennessee, October 2019.
|
|
“GPUDirect MPI Communications and Optimizations to Accelerate FFTs on Exascale Systems,”
EuroMPI'19 Posters, Zurich, Switzerland, no. icl-ut-19-06: ICL, September 2019.
|
|
“Increasing Accuracy of Iterative Refinement in Limited Floating-Point Arithmetic on Half-Precision Accelerators,”
IEEE High Performance Extreme Computing Conference (HPEC 2019), Best Paper Finalist, Waltham, MA, IEEE, September 2019.
|
|
“Progressive Optimization of Batched LU Factorization on GPUs,”
IEEE High Performance Extreme Computing Conference (HPECâ19), Waltham, MA, IEEE, September 2019.
|
|
“Distributed-Memory Lattice H-Matrix Factorization,”
The International Journal of High Performance Computing Applications, vol. 33, issue 5, pp. 1046â1063, August 2019.
DOI: 10.1177/1094342019861139 |
|
“MagmaDNN: Accelerated Deep Learning Using MAGMA,”
Practice and Experience in Advanced Research Computing (PEARC â19), Chicago, IL, ACM, July 2019.
|
|
“OpenDIEL: A Parallel Workflow Engine and DataAnalytics Framework,”
Practice and Experience in Advanced Research Computing (PEARC â19), Chicago, IL, ACM, July 2019.
|
|
“Hands-on Research and Training in High-Performance Data Sciences, Data Analytics, and Machine Learning for Emerging Environments,”
ISC High Performance, Frankfurt, Germany, Springer International Publishing, June 2019.
|
|
“MagmaDNN: Towards High-Performance Data Analytics and Machine Learning for Data-Driven Scientific Computing,”
ISC High Performance, Frankfurt, Germany, Springer International Publishing, June 2019.
DOI: 10.1007/978-3-030-34356-9_37 |
|
“Fast Batched Matrix Multiplication for Small Sizes using Half Precision Arithmetic on GPUs,”
33rd IEEE International Parallel and Distributed Processing Symposium (IPDPS), Rio de Janeiro, Brazil, IEEE, May 2019.
|
|
CEED ECP Milestone Report: Public release of CEED 2.0
: Zenodo, April 2019.
DOI: 10.5281/zenodo.2641316 |
|
“Design and Implementation for FFT-ECP on Distributed Accelerated Systems,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-19-05: University of Tennessee, April 2019.
|
|
Optimizing Batch HGEMM on Small Sizes Using Tensor Cores
, San Jose, CA, GPU Technology Conference (GTC), March 2019.
|
|
“Algorithms and Optimization Techniques for High-Performance Matrix-Matrix Multiplications of Very Small Matrices,”
Parallel Computing, vol. 81, pp. 1â21, January 2019.
DOI: 10.1016/j.parco.2018.10.003 |
|
FFT-ECP Fast Fourier Transform
, Houston, TX, 2019 ECP Annual Meeting (Research Poster), January 2019.
|
|
“Variable-Size Batched Gauss-Jordan Elimination for Block-Jacobi Preconditioning on Graphics Processors,”
Parallel Computing, vol. 81, pp. 131-146, January 2019.
DOI: 10.1016/j.parco.2017.12.006 |
|
“Analysis and Design Techniques towards High-Performance and Energy-Efficient Dense Linear Solvers on GPUs,”
IEEE Transactions on Parallel and Distributed Systems, vol. 29, issue 12, pp. 2700â2712, December 2018.
DOI: 10.1109/TPDS.2018.2842785 |
|
Accelerating 2D FFT: Exploit GPU Tensor Cores through Mixed-Precision
, Dallas, TX, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18), ACM Student Research Poster, November 2018.
|
|
MATEDOR: MAtrix, TEnsor, and Deep-learning Optimized Routines
, Dallas, TX, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18), Research Poster, November 2018.
|
|
“The Singular Value Decomposition: Anatomy of Optimizing an Algorithm for Extreme Scale,”
SIAM Review, vol. 60, issue 4, pp. 808â865, November 2018.
DOI: 10.1137/17M1117732 |
|
“Evaluation and Design of FFT for Distributed Accelerated Systems,”
ECP WBS 2.3.3.09 Milestone Report, no. FFT-ECP ST-MS-10-1216: Innovative Computing Laboratory, University of Tennessee, October 2018.
|
| “Task Based Cholesky Decomposition on Xeon Phi Architectures using OpenMP,” International Journal of Computational Science and Engineering (IJCSE), vol. 17, no. 3, October 2018. DOI: http://dx.doi.org/10.1504/IJCSE.2018.095851 |
|
“Algorithms and Optimization Techniques for High-Performance Matrix-Matrix Multiplications of Very Small Matrices,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-18-09: Innovative Computing Laboratory, University of Tennessee, September 2018.
|
|
“Optimizing GPU Kernels for Irregular Batch Workloads: A Case Study for Cholesky Factorization,”
IEEE High Performance Extreme Computing Conference (HPECâ18), Waltham, MA, IEEE, September 2018.
|
|
“Variable-Size Batched Condition Number Calculation on GPUs,”
SBAC-PAD, Lyon, France, September 2018.
|
|
Using GPU FP16 Tensor Cores Arithmetic to Accelerate Mixed-Precision Iterative Refinement Solvers and Reduce Energy Consumption
, Frankfurt, Germany, ISC High Performance (ISC18), Best Poster Award, June 2018.
|
|
“A Guide for Achieving High Performance with Very Small Matrices on GPUs: A Case Study of Batched LU and Cholesky Factorizations,”
IEEE Transactions on Parallel and Distributed Systems, vol. 29, issue 5, pp. 973â984, May 2018.
DOI: 10.1109/TPDS.2017.2783929 |
|
“Accelerating the SVD Bi-Diagonalization of a Batch of Small Matrices using GPUs,”
Journal of Computational Science, vol. 26, pp. 237â245, May 2018.
DOI: 10.1016/j.jocs.2018.01.007 |
|
“Accelerating the SVD Two Stage Bidiagonal Reduction and Divide and Conquer Using GPUs,”
Parallel Computing, vol. 74, pp. 3â18, May 2018.
DOI: 10.1016/j.parco.2017.10.004 |
|
“Analyzing Performance of BiCGStab with Hierarchical Matrix on GPU Clusters,”
IEEE International Parallel and Distributed Processing Symposium (IPDPS), Vancouver, BC, Canada, IEEE, May 2018.
|
|
“Batched One-Sided Factorizations of Tiny Matrices Using GPUs: Challenges and Countermeasures,”
Journal of Computational Science, vol. 26, pp. 226â236, May 2018.
DOI: 10.1016/j.jocs.2018.01.005 |
|
MAtrix, TEnsor, and Deep-learning Optimized Routines (MATEDOR)
, Washington, DC, NSF PI Meeting, Poster, April 2018.
DOI: 10.6084/m9.figshare.6174143.v3 |
|
Harnessing GPU's Tensor Cores Fast FP16 Arithmetic to Speedup Mixed-Precision Iterative Refinement Solvers and Achieve 74 Gflops/Watt on Nvidia V100
, San Jose, CA, GPU Technology Conference (GTC), Poster, March 2018.
|
|
“Optimization and Performance Evaluation of the IDR Iterative Krylov Solver on GPUs,”
The International Journal of High Performance Computing Applications, vol. 32, no. 2, pp. 220â230, March 2018.
DOI: 10.1177/1094342016646844 |
|
Tensor Contractions using Optimized Batch GEMM Routines
, San Jose, CA, GPU Technology Conference (GTC), Poster, March 2018.
|
|
“A Jaccard Weights Kernel Leveraging Independent Thread Scheduling on GPUs,”
SBAC-PAD, Lyon, France, IEEE, 2018.
|
|
“POMPEI: Programming with OpenMP4 for Exascale Investigations,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-17-09: University of Tennessee, December 2017.
|
|
“Sampling Algorithms to Update Truncated SVD,”
IEEE International Conference on Big Data, Boston, MA, IEEE, December 2017.
|
|
“Flexible Batched Sparse Matrix-Vector Product on GPUs,”
8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA '17), Denver, CO, ACM Press, November 2017.
DOI: http://dx.doi.org/10.1145/3148226.3148230 |
|
“Out of Memory SVD Solver for Big Data,”
2017 IEEE High Performance Extreme Computing Conference (HPEC'17), Waltham, MA, IEEE, September 2017.
|
|
“Power-aware Computing: Measurement, Control, and Performance Analysis for Intel Xeon Phi,”
2017 IEEE High Performance Extreme Computing Conference (HPEC'17), Best Paper Finalist, Waltham, MA, IEEE, September 2017.
DOI: 10.1109/HPEC.2017.8091085 |
|
“Towards Numerical Benchmark for Half-Precision Floating Point Arithmetic,”
2017 IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, IEEE, September 2017.
DOI: 10.1109/HPEC.2017.8091031 |
|
“A Framework for Out of Memory SVD Algorithms,”
ISC High Performance 2017, pp. 158â178, June 2017.
DOI: 10.1007/978-3-319-58667-0_9 |
|
“Factorization and Inversion of a Million Matrices using GPUs: Challenges and Countermeasures,”
Procedia Computer Science, vol. 108, pp. 606â615, June 2017.
DOI: 10.1016/j.procs.2017.05.250 |
|
“Novel HPC Techniques to Batch Execution of Many Variable Size BLAS Computations on GPUs,”
International Conference on Supercomputing (ICS '17), Chicago, Illinois, ACM, June 2017.
DOI: 10.1145/3079079.3079103 |
|
“Optimizing the SVD Bidiagonalization Process for a Batch of Small Matrices,”
International Conference on Computational Science (ICCS 2017), Zurich, Switzerland, Procedia Computer Science, June 2017.
DOI: 10.1016/j.procs.2017.05.237 |
|
“Fast Cholesky Factorization on GPUs for Batch and Native Modes in MAGMA,”
Journal of Computational Science, vol. 20, pp. 85â93, May 2017.
DOI: 10.1016/j.jocs.2016.12.009 |
|
“Structure-aware Linear Solver for Realtime Convex Optimization for Embedded Systems,”
IEEE Embedded Systems Letters, vol. 9, issue 3, pp. 61â64, May 2017.
DOI: 10.1109/LES.2017.2700401 |
|
“With Extreme Computing, the Rules Have Changed,”
Computing in Science & Engineering, vol. 19, issue 3, pp. 52-62, May 2017.
DOI: 10.1109/MCSE.2017.48 |
|
“Small Tensor Operations on Advanced Architectures for High-Order Applications,”
University of Tennessee Computer Science Technical Report, no. UT-EECS-17-749: Innovative Computing Laboratory, University of Tennessee, April 2017.
|
|
“High-performance Cholesky Factorization for GPU-only Execution,”
Proceedings of the General Purpose GPUs (GPGPU-10), Austin, TX, ACM, February 2017.
DOI: 10.1145/3038228.3038237 |
|
“Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers,”
ScalA17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, Denver, CO, ACM.
|
| “Non-GPU-resident Dense Symmetric Indefinite Factorization,” Concurrency and Computation: Practice and Experience, November 2016. DOI: 10.1002/cpe.4012 |
|
“Towards Achieving Performance Portability Using Directives for Accelerators,”
The International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16), Third Workshop on Accelerator Programming Using Directives (WACCPD), Salt Lake City, Utah, Innovative Computing Laboratory, University of Tennessee, November 2016.
|
| “Stability and Performance of Various Singular Value QR Implementations on Multicore CPU with a GPU,” ACM Transactions on Mathematical Software (TOMS), vol. 43, issue 2, October 2016. |
|
Accelerating Tensor Contractions for High-Order FEM on CPUs, GPUs, and KNLs
, Gatlinburg, TN, moky Mountains Computational Sciences and Engineering Conference (SMC16), Poster, September 2016.
|
|
“LU, QR, and Cholesky Factorizations: Programming Model, Performance Analysis and Optimization Techniques for the Intel Knights Landing Xeon Phi,”
IEEE High Performance Extreme Computing Conference (HPEC'16), Waltham, MA, IEEE, September 2016.
|
|
“Performance Analysis and Acceleration of Explicit Integration for Large Kinetic Networks using Batched GPU Computations,”
2016 IEEE High Performance Extreme Computing Conference (HPEC â16), Waltham, MA, IEEE, September 2016.
|
| “High-performance Matrix-matrix Multiplications of Very Small Matrices,” 22nd International European Conference on Parallel and Distributed Computing (Euro-Par'16), Grenoble, France, Springer International Publishing, August 2016. |
|
“MAGMA Batched: A Batched BLAS Approach for Small Matrix Factorizations and Applications on GPUs,”
Innovative Computing Laboratory Technical Report, no. ICL-UT-16-02: University of Tennessee, August 2016.
|
|
“High-Performance Tensor Contractions for GPUs,”
International Conference on Computational Science (ICCS'16), San Diego, CA, June 2016.
|
|
“Performance Tuning and Optimization Techniques of Fixed and Variable Size Batched Cholesky Factorization on GPUs,”
International Conference on Computational Science (ICCS'16), San Diego, CA, June 2016.
|
|
“Performance, Design, and Autotuning of Batched GEMM for GPUs,”
The International Supercomputing Conference (ISC High Performance 2016), Frankfurt, Germany, June 2016.
|
|
“Efficiency of General Krylov Methods on GPUs â An Experimental Study,”
The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), Chicago, IL, IEEE, May 2016.
DOI: 10.1109/IPDPSW.2016.45 |
|
“Hessenberg Reduction with Transient Error Resilience on GPU-Based Hybrid Architectures,”
30th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Chicago, IL, IEEE, May 2016.
|
|
“Heterogeneous Streaming,”
The Sixth International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), IPDPS 2016, Chicago, IL, IEEE, May 2016.
|
| “Linear Algebra Software for Large-Scale Accelerated Multicore Computing,” Acta Numerica, vol. 25, pp. 1-160, May 2016. DOI: 10.1017/S0962492916000015 |
|
“On the Development of Variable Size Batched Computation for Heterogeneous Parallel Architectures,”
The 17th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2016), IPDPS 2016, Chicago, IL, IEEE, May 2016.
|
|
A Standard for Batched BLAS Routines
, Paris, France, 17th SIAM Conference on Parallel Processing for Scientific Computing (SIAM PP16), April 2016.
|