CLOCK MATH — A SYSTEM FOR SOLVING SLES EXACTLY

In this paper, we present a GPU-accelerated hybrid system that solves ill-conditioned systems of linear equations exactly. Exactly means without rounding errors due to using integer arithmetics. First, we scale floating-point numbers up to integers, then we solve dozens of SLEs within different modular arithmetics and then we assemble sub-solutions back using the Chinese remainder theorem. This approach effectively bypasses current CPU floating-point limitations. The system is capable of solving Hilbert’s matrix without losing a single bit of precision, and with a significant speedup compared to existing CPU solvers.


Introduction
Solving linear algebraic equations solution is quite a frequent task in numerical mathematics.One may often find difficulties, while solving problems of the illconditioned matrix.Stability of the solution cannot be ensured for large dense sets of linear equations.Rounding error during the numerical computation cannot be tolerated.Methods have been developed that minimize the influence of rounding errors on the solution.
One method that we use relies on modular arithmetics [2] in order to solve dense systems of linear equations precisely.The underlying idea sounds quite simple -bypass floating point rounding limitations by using integer arithmetics.It consists of three partsconverting floating point numbers into integers, solving multiple systems of linear equations within their modules, and finally converting sub-solutions back using the Chinese remainder theorem.
Nowadays, GPGPU (General-Purpose computing on the GPU) is a trending topic in high-performance computing.Since GPGPU began in 2006 hundreds of articles have been published every year.GPGPU is usually used to accelerate data-parallel algorithms, and that is our case.Dozens of systems of linear equations emerge during the second step of our computation.Each of them can be solved in parallel on the GPU, hence speedup is achieved.
In this paper, we present a GPU-accelerated solver of ill-conditioned systems of linear equations.Section 3 gives a brief overview of the mathematics that is used.Section 4 describes the GPU architecture in general, and presents issues that we faced while optimizing the computation.Finally, Section 5 shows our measured results -the speedup and a comparison with existing systems solving similar problem.

State of the Art
Nearly one half of the problems solved in numerical mathematics lead to problems in linear algebra.The primary objective of the numerical methods used in linear algebra is to solve sets of linear equations . . .They are solved in a chosen computer arithmetic, most often floating-point arithmetic.Floating-point arithmetic's well known advantages have led to its widespread usage, yet it also has its important disadvantages, sometimes resulting in severe problems.
The disadvantages of floating-point arithmetic include mainly the generation and accumulation of rounding errors during the calculation, and nonuniform distribution of values in the real number subset.There are a variety of alternatives to floating point representation and associated arithmetic operations, including modular, logarithmic, p-adic, and continued fraction arithmetic.
Many numerical problems lead to SLEs with dense and large matrices.The rounding error sensitivity of such systems can be so grave that they can be considered to be ill-conditioned.To solve the problems, many numerical direct and iterative methods have been developed that tackle these systems with higher or lower success.
Apart from many numerical methods, various programming and computing tools have been developed for solving rounding error sensitive SLEs.There are many computer libraries that allow the user to choose an optimal virtual length of a computer word that fully or partially eliminates the destructive influence of rounding errors.There are also arithmetic units that tackle this problem by grouping arithmetic operations, and thus they create a longer computer word for operations sensitive to rounding.Software tools that eliminate rounding errors include libraries for precision computing, such as GMP (GNU Multiple Precision Arithmetic Library).Unfortunately, these tools lead to such great time complexity when they are used to solve rounding error sensitive SLEs that they may not be applicable to practical problems.
Another limiting factor when solving SLEs with a square matrix is the inherent algorithmic time complexity O(n 3 ), where n is the dimension of the matrix.To solve large SLEs with a large number of equations, special computers and processors have been developed.These are vector computers.Until recently, the vector processors available on the market were CRAY SV1ex (2001), Fujitsu VPP5000 (1999) and VMIPS (2001).IBM Virtual Vector Architecture brings vector processing features to its most recent POWER6 mainframe processors (2007).Another suitable architecture for solving rounding error sensitive SLEs is SIMD (Single Instruction, Multiple Data).Today the SIMD features can be found as extensions of the standard processor instruction set, for example, Intel-SSE, AMD-3DNow!or Motorola-Altivec.These are used in multimedia applications such as video, 3D graphics, animation, audio, etc.The recent trend is GPGPU.Today, at least three of ten top supercomputers in the top 500 list use GPU for numerical acceleration.Both major GPU manufacturers, NVIDIA and AMD, offer the possibility to run general purpose computation on their GPUs.
At present there are no single-purpose numerical accelerators for solving rounding error sensitive SLEs.We use common NVIDIA Fermi graphics cards together with Intel Core processors to accelerate the solution.

Mathematical Background
Let us take a system of linear equations (see [1]): where A ∈ R N ×N is the matrix of a system of N equations of N unknowns, b ∈ R N is the right-side vector and x ∈ R N is the desired vector, the solution of our system of linear equations.

Matrix Scaling
Matrix scaling is the first thing to do.The goal of matrix scaling is to adjust all the floating-point numbers of the matrix to their corresponding integer versions.Basically, every matrix row is multiplied by its scaling constant (scalar multiplication).This has to be done without losing a single bit of precision.This condition will be achieved only when the scaling constant is 2 n , where n ∈ N 1 .
The question now is how to determine the scaling constant.First, we should continue by finding the smallest absolute value element (closest to zero) in each row.Then, we extract the absolute value of its exponent and multiply it by the 2 53 constant, because a significant mantissa bit size in the IEEE 754 [6] double precision floating point number format is 53 bits long.Finally, the scaling constant s is computed: where exp is a function that extracts and returns the exponent (as an integer -power of 2) and min row is the element in the row closest to zero.The approach being used is explained in greater detail (and with alternatives) in [2].

Solving a System of Linear Equations
From the previous step we have an properly scaled-up system of linear equations: where a ij matrix A elements and x i and b i vector elements are just big integers.We solve it using a multi-modulus arithmetics over the commutative ring (Z β , ⊕, ) with a base vector β that is equivalent to the single-modulus arithmetics over (Z M , +, •) and module M .
M has to be a big enough positive integer to avoid rounding errors during our computation.Hadamard's determinant D estimate of matrix A can be used to estimate the maximum M value: The highest value of M that could appear during the computation: and gcd(M, D) = 1.
We also need the following conditions for the vector β = (m 1 , m 2 , . . ., m r ) and module M to be satisfied: • m 1 , m 2 , . . ..m r are prime numbers.
The following condition for the SLE (given by Eq. ( 3)) determinant is satisfied when: then the SLE (from Eq. ( 3)) solved within (Z β , ⊕, ) due to vector β can be expressed as: or, for individual modules m i of vector β: The following expression is also valid for Eq. ( 3) within (Z β , ⊕, ): and where E is the identity matrix.To solve the SLE from Eq. ( 9) within the specific modular arithmetics we will use the Gauss-Jordan elimination algorithm with non-zero pivotization.The difference from the original Gauss-Jordan elimination is the usage of modular arithmetics for all algorithm steps.There is also pivotization simplification -we do not need to find the greatest element in the elimination step, a nonzero element is good enough.
Using the GJ elimination algorithm, let us have an matrix W of dimensions n × (n + 1) consisting of matrix A and vector b: The goal of the algorithm is to eliminate all W elements one by one to get the resulting x vector to Eq. ( 11) after ≈ n 3 elimination steps.

Inverse Transformation
After completing the two previous steps, we have a set of vectors x within each module from β = (m 1 , m 2 , . . ., m r ).They represent a sub-solution for the specific (Z mi , +, •) arithmetic.Now we advance to the inverse transformation back to (Z β , ⊕, ).The necessary condition for each β module solution x is its non-zero determinant from Eq. ( 7).
The algorithm used for this transformation is the Chinese remainder theorem.The product of our last step consists of the vector x elements in the form of fractions that contain the solution of SLE (3).

GPU and Optimizations
GPGPU is the area of our research, so we optimize on the PC graphical hardware.There were several platforms to choose from: • NVIDIA CUDA, the mainstream platform today, was rejected; it is proprietary, and its usage is limited to NVIDIA hardware;

Profiling
The whole SLE solving process is a rather timeconsuming task (Fig. 1), so optimizations had to be made.First, all the big number ALU operations -Eq.( 5) and Section 3.3 -are performed using the GMP library, which is tuned for this platform.Solving dozens of systems of linear equations using modular arithmetics (Section 3.2) within the β vector is the most computation-intensive part of our task, see Fig. 1.
According to Ahmdal's Law (expressed in Eq. ( 13), the benefit (overall speedup) from optimizing part of the computation is equal to: where P is the proportion of the computation we are optimizing and S is its speedup.Because modular arithmetics SLEs take a great amount of time to solve, we optimize the matrix elimination.

GPU architecture
All modern common GPUs have a similar architecture (from our optimization type point of view).They consist of a global memory (1-4 GB) and several (2-24) SMP 1 processors (example in Fig. 2).SMP features a scheduler of lightweight threads that have zero overhead.Each SMP contains dozens (8-48) of processor cores that share the SMP memory (shared memory ≈ 64 kB) and execute the same code.If it happens that one processor core branches differently, the performance goes down dramatically.Each processor core also has its own integer and floating-point unit.More details with examples and a description of the CUDA platform are given in [5].
The processor cores have access to the GPU global RAM, but the memory access has to be aligned.Otherwise performance will be affected.We used SMP processor cores to load part of the matrix (the row Figure 2. NVIDIA Fermi SMP [5]. that contains the current pivot).We used it for computation and then stored the elimination step results back to global memory.

GPU Kernels, Optimizations
The OpenCL framework defines the term kernel.It is an function written in slightly modified C99 language which is to be executed on an OpenCL capable device.
In parallel, OpenCL runtime executes the same kernel on every processor core within an SMP.In the example of the SAXPY function, we are able to process (8-48) vector elements at once.However, our case is not so simple.
Our system is currently limited to the matrix size of 4096 × 4096.The row with the current pivot is used in n multiply-add-modulo vector operations.Hence, it is quite useful performance-wise to cache it in the SMP shared memory.Current generation NVIDIA GPUs have shared memory size of 48 kB.One matrix row fits the local memory easily (4096 * 8 = 32758 B).The processor cores a matrix row from global memory and compute the inverse of the pivot element.Then the processor cores multiply the cached row and multiply-add-modulo other rows in global memory.More details, including the processor core synchronizations and source code samples, are available online2 .

Results
After embedding architecture-related optimizations into our Clock Math solver and verifying the correctness of our results, we were finally able to benchmark them.There are not many up-to-date solvers like ours currently.We benchmarked primarily against linSolve-0.7.15 [7], a highly optimized solver with performance critical parts written in x86 assembler.The results (Tab. 1) are very satisfying.Clock Math solver outperforms linSolve-0.7.15 almost twice for larger matrices.

Conclusion
We have presented the working system for solving ill-conditioned systems of linear equations exactly.We have managed to effectively bypass floating-point rounding error by using modular arithmetics.Most of the entire computation has taken place by solving SLEs within their corresponding modules.This part of the computation can be (and is) solved in parallel on the common-grade GPU.GPU is capable of solving several SLEs at once (8-16, depending on its memory size) and also each system in parallel on its SPE (8-40 cores).Double parallelism has been utilized, so significant speedup is achieved compared to other implementations.We are currently working on advanced kernel optimizations with larger matrix support.

Future Work
As we now have a working system, we would like to proceed to: • add support for both single and double precision for matrix elimination, as the β vector module count and GPU performance may differ; • add support for larger matrices than 4096 × 4096optimize SMP shared memory usage; • add support for automatic kernel group size tuning for larger matrices, as the group size lowers the memory access time on different GPUs/architectures; • examine the modulus operation performance across different GPU architectures and further opti-mize the SAXPY, respectively DAXPY functions (mod m); • utilize OpenMPI library to add cluster support; • adjust and run the solver on our university STAR cluster to test AMD's OpenCL CPU implementation.

Table 1 .
Measured results: comparison of execution times for Clock Math Solver, for linSolve-0.7.15 and corresponding speedups.