TNL: NUMERICAL LIBRARY FOR MODERN PARALLEL ARCHITECTURES
DOI:
https://doi.org/10.14311/AP.2021.61.0122Keywords:
Parallel computing, GPU, explicit schemes, semi–implicit schemes, C templatesAbstract
We present Template Numerical Library (TNL, www.tnl-project.org) with native support of modern parallel architectures like multi–core CPUs and GPUs. The library offers an abstract layer for accessing these architectures via unified interface tailored for easy and fast development of high-performance algorithms and numerical solvers. The library is written in C++ and it benefits from template meta–programming techniques. In this paper, we present the most important data structures and algorithms in TNL together with scalability on multi–core CPUs and speed–up on GPUs supporting CUDA.
Downloads
References
J. Cheng, M. Grossman, T. McKercher. Professional CUDA C Programming. Wrox, 2014.
Cublas. https://developer.nvidia.com/cublas.
Cusparse. https://developer.nvidia.com/cusparse.
Thrust. https://developer.nvidia.com/thrust.
Kokkos. https://github.com/kokkos/kokkos.
Viennacl. http://viennacl.sourceforge.net.
Petsc. https://www.mcs.anl.gov/petsc.
Eigen. http://eigen.tuxfamily.org/index.php.
M. Harris, S. Sengupta, J. D. Owens. GPU gems 3, chap. Parallel prefix sum (scan) with CUDA, pp. 851–876. 2007.
N. Bell, M. Garland. Efficient sparse matrix-vector multiplication on CUDA. Tech. Rep. Technical Report NVR-2008-004, NVIDIA Corporation, 2008.
T. Oberhuber, A. Suzuki, J. Vacata. New Row-grouped CSR format for storing the sparse matrices on GPU with implementation in CUDA. Acta Technica 56:447–466, 2011.
A. Monakov, A. Lokhmotov, A. Avetisyan. Automatically tuning sparse matrix-vector multiplication for GPU architectures. In HiPEAC 2010, pp. 111–125. Springer-Verlag Berlin Heidelberg, 2010.
M. Heller, T. Oberhuber. Improved Row-grouped CSR Format for Storing of Sparse Matrices on GPU. In A. H. nad Z. Minarechová, D. Ševcovic (eds.), Proceedings of Algoritmy 2012, pp. 282–290. 2012.
C. Zheng, S. Gu, T.-X. Gu, et al. Biell: A bisection ellpack-based storage format for optimizing spmv on gpus. Journal of Parallel and Distributed Computing 74(7):2639 – 2647, 2014.
R. Fučík, J. Klinkovský, J. Solovský, et al. Multidimensional mixed-hybrid finite element method for compositional two-phase flow in heterogeneous porous media and its parallel implementation on GPU. Computer Physics Communications 238:165–180, 2019.
T. Oberhuber, A. Suzuki, V. Žabka. The CUDA implementation of the method of lines for the curvature dependent flows. Kybernetika 47:251–272, 2011.
T. Oberhuber, A. Suzuki, J. Vacata, V. Žabka. Image segmentation using CUDA implementations of the Runge-Kutta-Merson and GMRES methods. Journal of Math-for-Industry 3:73–79, 2011.
Y. Yamamoto, Y. Hirota. A parallel algorithm for incremental orthogonalization based on compact WY representation. JSIAM Letters 3(0):89–92, 2011.
Rocm. https://github.com/RadeonOpenCompute/ROCm.
Julia. https://julialang.org.
Downloads
Published
Issue
Section
License
Copyright (c) 2021 Tomáš Oberhuber, Jakub Klinkovský, Radek Fučík
This work is licensed under a Creative Commons Attribution 4.0 International License.
How to Cite
Accepted 2020-05-11
Published 2021-02-10