TNL: NUMERICAL LIBRARY FOR MODERN PARALLEL ARCHITECTURES
Keywords:Parallel computing, GPU, explicit schemes, semi–implicit schemes, C templates
We present Template Numerical Library (TNL, www.tnl-project.org) with native support of modern parallel architectures like multi–core CPUs and GPUs. The library offers an abstract layer for accessing these architectures via unified interface tailored for easy and fast development of high-performance algorithms and numerical solvers. The library is written in C++ and it benefits from template meta–programming techniques. In this paper, we present the most important data structures and algorithms in TNL together with scalability on multi–core CPUs and speed–up on GPUs supporting CUDA.
J. Cheng, M. Grossman, T. McKercher. Professional CUDA C Programming. Wrox, 2014.
M. Harris, S. Sengupta, J. D. Owens. GPU gems 3, chap. Parallel prefix sum (scan) with CUDA, pp. 851–876. 2007.
N. Bell, M. Garland. Efficient sparse matrix-vector multiplication on CUDA. Tech. Rep. Technical Report NVR-2008-004, NVIDIA Corporation, 2008.
T. Oberhuber, A. Suzuki, J. Vacata. New Row-grouped CSR format for storing the sparse matrices on GPU with implementation in CUDA. Acta Technica 56:447–466, 2011.
A. Monakov, A. Lokhmotov, A. Avetisyan. Automatically tuning sparse matrix-vector multiplication for GPU architectures. In HiPEAC 2010, pp. 111–125. Springer-Verlag Berlin Heidelberg, 2010.
M. Heller, T. Oberhuber. Improved Row-grouped CSR Format for Storing of Sparse Matrices on GPU. In A. H. nad Z. Minarechová, D. Ševcovic (eds.), Proceedings of Algoritmy 2012, pp. 282–290. 2012.
C. Zheng, S. Gu, T.-X. Gu, et al. Biell: A bisection ellpack-based storage format for optimizing spmv on gpus. Journal of Parallel and Distributed Computing 74(7):2639 – 2647, 2014.
R. Fučík, J. Klinkovský, J. Solovský, et al. Multidimensional mixed-hybrid finite element method for compositional two-phase flow in heterogeneous porous media and its parallel implementation on GPU. Computer Physics Communications 238:165–180, 2019.
T. Oberhuber, A. Suzuki, V. Žabka. The CUDA implementation of the method of lines for the curvature dependent flows. Kybernetika 47:251–272, 2011.
T. Oberhuber, A. Suzuki, J. Vacata, V. Žabka. Image segmentation using CUDA implementations of the Runge-Kutta-Merson and GMRES methods. Journal of Math-for-Industry 3:73–79, 2011.
Y. Yamamoto, Y. Hirota. A parallel algorithm for incremental orthogonalization based on compact WY representation. JSIAM Letters 3(0):89–92, 2011.
Copyright (c) 2021 Tomáš Oberhuber, Jakub Klinkovský, Radek Fučík
This work is licensed under a Creative Commons Attribution 4.0 International License.Authors who publish with this journal agree to the following terms:
1. Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).