TNL: NUMERICAL LIBRARY FOR MODERN PARALLEL ARCHITECTURES

Tomáš Oberhuber; Jakub Klinkovský; Radek Fučík

doi:10.14311/AP.2021.61.0122

Authors

Tomáš Oberhuber Czech Technical University in Prague, Faculty of Nuclear Sciences and Physical Engineering, Department of Mathematics, Trojanova 13, 120 00 Praha , Czech Republic https://orcid.org/0000-0001-8374-6892
Jakub Klinkovský Czech Technical University in Prague, Faculty of Nuclear Sciences and Physical Engineering, Department of Mathematics, Trojanova 13, 120 00 Praha , Czech Republic
Radek Fučík Czech Technical University in Prague, Faculty of Nuclear Sciences and Physical Engineering, Department of Mathematics, Trojanova 13, 120 00 Praha , Czech Republic

DOI:

https://doi.org/10.14311/AP.2021.61.0122

Keywords:

Parallel computing, GPU, explicit schemes, semi–implicit schemes, C templates

Abstract

We present Template Numerical Library (TNL, www.tnl-project.org) with native support of modern parallel architectures like multi–core CPUs and GPUs. The library offers an abstract layer for accessing these architectures via unified interface tailored for easy and fast development of high-performance algorithms and numerical solvers. The library is written in C++ and it benefits from template meta–programming techniques. In this paper, we present the most important data structures and algorithms in TNL together with scalability on multi–core CPUs and speed–up on GPUs supporting CUDA.

Downloads

References

J. Cheng, M. Grossman, T. McKercher. Professional CUDA C Programming. Wrox, 2014.

Cublas. https://developer.nvidia.com/cublas.

Cusparse. https://developer.nvidia.com/cusparse.

Thrust. https://developer.nvidia.com/thrust.

Kokkos. https://github.com/kokkos/kokkos.

Viennacl. http://viennacl.sourceforge.net.

Petsc. https://www.mcs.anl.gov/petsc.

Eigen. http://eigen.tuxfamily.org/index.php.

M. Harris, S. Sengupta, J. D. Owens. GPU gems 3, chap. Parallel prefix sum (scan) with CUDA, pp. 851–876. 2007.

N. Bell, M. Garland. Efficient sparse matrix-vector multiplication on CUDA. Tech. Rep. Technical Report NVR-2008-004, NVIDIA Corporation, 2008.

T. Oberhuber, A. Suzuki, J. Vacata. New Row-grouped CSR format for storing the sparse matrices on GPU with implementation in CUDA. Acta Technica 56:447–466, 2011.

A. Monakov, A. Lokhmotov, A. Avetisyan. Automatically tuning sparse matrix-vector multiplication for GPU architectures. In HiPEAC 2010, pp. 111–125. Springer-Verlag Berlin Heidelberg, 2010.

M. Heller, T. Oberhuber. Improved Row-grouped CSR Format for Storing of Sparse Matrices on GPU. In A. H. nad Z. Minarechová, D. Ševcovic (eds.), Proceedings of Algoritmy 2012, pp. 282–290. 2012.

C. Zheng, S. Gu, T.-X. Gu, et al. Biell: A bisection ellpack-based storage format for optimizing spmv on gpus. Journal of Parallel and Distributed Computing 74(7):2639 – 2647, 2014.

R. Fučík, J. Klinkovský, J. Solovský, et al. Multidimensional mixed-hybrid finite element method for compositional two-phase flow in heterogeneous porous media and its parallel implementation on GPU. Computer Physics Communications 238:165–180, 2019.

T. Oberhuber, A. Suzuki, V. Žabka. The CUDA implementation of the method of lines for the curvature dependent flows. Kybernetika 47:251–272, 2011.

T. Oberhuber, A. Suzuki, J. Vacata, V. Žabka. Image segmentation using CUDA implementations of the Runge-Kutta-Merson and GMRES methods. Journal of Math-for-Industry 3:73–79, 2011.

Y. Yamamoto, Y. Hirota. A parallel algorithm for incremental orthogonalization based on compact WY representation. JSIAM Letters 3(0):89–92, 2011.

Rocm. https://github.com/RadeonOpenCompute/ROCm.

Julia. https://julialang.org.