FPU-Supported Running Error Analysis

A-posteriori forward rounding error analyses tend to give sharper error estimates than a-priori ones, as they use actual data quantities. One of such a-posteriori analysis – running error analysis – uses expressions consisting of two parts; one generates the error and the other propagates input errors to the output. This paper suggests replacing the error generating term with an FPU-extracted rounding error estimate, which produces a sharper error bound.


Introduction
Rounding error analyses are used to discover the stability of numerical algorithms and provide information on whether the computed result is valid.They are typically performed in forward or backward direction, and either a-priori or a-posteriori.Backward error analysis [6] treats all operations as if they were exact but with a perturbed data, and we ask for which input data we have solved our problem.Using backward error analysis is preferred, as each algorithm which is backward stable is automatically numerically stable [5], while this does not hold for forward error analyses.On the other hand, forward error analyses treat operations as sources of errors, and tell us about the size of the solution that our input data corresponds to.Since error analyses performed a-priori can often be difficult if the algorithm being analyzed is complex, a-posteriori analyses may provide a more feasible alternative (MSC 2000 Classification: 65G20, 65G50).
An objective of an a-priori analysis is to find the worst case bound for an algorithm, if it exists, while a-posteriori analyses calculate the error bound concurrently with the evaluation of the result and work with actual data quantities, providing a tighter error bound.
This paper presents a replacement for an error generating term in error bounds calculated with forward a-posteriori error estimates (also known as running error analysis) to obtain sharper error bounds by exploiting the behavior of the computer FPU.We expect the calculation to be portable (conforming strictly to [3]) and assume an Intel x86 compatible FPU which supports double extended precision (80-bit) floating point registers, which are essential for this method.

Running error analysis and error bounds
Running error analysis [6,7] is a type of a-posteriori forward error analysis.The algorithm being analyzed is extended to calculate the partial error bound along-side the normal calculation.As the algorithm proceeds, bounds are accumulated, making a total error bound estimate.From here on, a binary double precision floating point arithmetic with a guard digit with round to nearest rounding mode is assumed.
Definition 1 Let F t ⊂ Q be a binary floating point set with a precision of t, where Q denotes the rational number set.
Definition 2 Let y = ±m 2 e−t ∈ F t be a binary floating point number, where m stands for mantissa, e for exponent, and t for precision.Let ex(y) = e.
Lemma 3 Let x = fl(x) be a floating point representation of an exact number x, which is obtained by rounding x to the nearest element in F t .The rounding process can be described as: where u t = 2 −t is unit roundoff in precision t.To conserve some space, and for clarity, u without a subscript means u 53 .
Assumption 4 (Standard model) We assume that operations ♦ ∈ {+, −, •, /} and the square root follow the Standard model [5] and are evaluated with error no greater than u: where we use s as a computed result of an operation.
The Standard model will also be used in the form of , where useful.
Table 1 summarizes the error bounds for basic operations and the square root with products of errors being neglected.We also assume that the following holds: where α, β, and σ stand for absolute (static) errors.s i 2 −i < ∼ 2u , assuming that s i refers to the i-th bit of s.
2. The error generation term always generates an error even in the case when no physical error was committed.
The expression u| s| tends to give a higher error bound estimate than we would need.However, we can revise this expression if we are interested in obtaining a tighter error bound estimate, and we can do so by exploiting the FPU.

Analysis
Many computers use a processor with Intel x86 architecture [4], which features an FPU with 8 double extended precision floating point registers.Once a number is loaded into the FPU, regardless of its precision, it gets automatically converted into the double extended precision format.Further calculations are performed in this format but rounded to a precision specified in the floating point control word register (FCW).The default settings for FCW specify double extended precision, round to the nearest rounding mode and mask out all floating point exceptions.If necessary, they can be changed with an fstcw instruction.Once the result gets stored back into a memory location, it is rounded to (or extended if FCW specified single precision) a defined precision, depending on the type of store instruction.

Obtaining a tighter bound from the FPU
When rounding to the nearest element of F t , the relative error is no worse than u t , and we can extract the static error from the FPU by subtracting the computed value in double extended precision from its rounded value.This idea looks similar to compensated summation [5], which uses an entirely software approach in contrast to our paper.Subtracting the values provides an 11-bit error estimate of the real rounding error, which is no greater than u.The entire idea can be written as: Theorem 5 Let S be the result of an operation performed in double extended precision and let s be the result in double precision, which we obtained by rounding S to 53 bits by rounding to the nearest as defined by the IEEE 754 Standard.The absolute rounding error for an operation in floating point can be calculated rather as Δ = | s − S|, where |Δ| ≤ u| s|.
PROOF.We can extract 64-bit mantissa intermediate results ( S) from the FPU before they get rounded to 53 bits ( s), and calculate where The following figure shows the principle: In all cases, the relative error can be computed in the same way as | s − S|/| s| ≤ u and is always bounded by u. 2 Note 1 Theorem 5 is proved for positive numbers, but it can be proved for negatives too.

Corollary 6
The expression Δ = | s − S| ≤ u| s| provides an 11-bit estimate of the static rounding error and since |Δ| ≤ u| s|, it can be safely substituted for all occurrences of u| s| in Tab. 1, providing a tighter error bound.

Implementation highlights
Our approach is implemented in C++ language as a class fdouble with GCC AT&T inline assembly lan-guage FPU statements [9].The fdouble class overloads most operators and some standard functions2 which are beyond the scope of this paper, providing a handy replacement for the double data type.The transition from double data type to fdouble is performed just by changing the name of the data type.The rest is handled by the class.For an illustration of how the class works, the source of a plus operator fdouble::operator+ and its assembly portion op_plus performing the FPU-supported running error analysis with the terms above follows: "m"(b), // 13. %1 = b "m"(a), // 14. %2 = a "m"(result), // 15. %3 = result "m"(a_err), // 16. %4 = a_err "m"(b_err) // 17. %5 = b_err ); } The op_plus first loads both double precision a and b (lines 1 and 2) on the floating point stack and the FPU extends them to the double extended precision.The sum is then calculated (line 3) popping a from the FPU stack and replacing b by the sum.Since the result is still in double extended precision, it has to be stored back into memory in order to get a rounded result according to the Standard that we have to round the result ( S) after each operation (line 4) and rounded result ( s) is pushed back onto the floating point stack (line 5).S − s is calculated (line 6) and its absolute value is determined (line 7).Errors are propagated according to Tab. 1 (lines 8 through 11), and the requested absolute error is found at the top of the FPU stack (line 12).Lines 13 through 17 specify input operand constraints for the __asm__ directive.Other operators and functions are implemented in a similar way.
Traditional running error analysis is implemented alike, and provides class radouble.The source code for radouble::operator+ follows: Here, we can see that no assembly language inlines are necessary, but it is necessary to provide special compiler flags in order to obtain the same results with the fdouble class.These GCC flags are: -ffloat-store, which ensures that the result of each floating point operation is stored back into memory and rounded to a defined precision (required by IEEE 754), and the second flag -mfpmath=387, which assures that the floating point unit is used.Without the second flag, the optimization could use SSE instructions and provide less accurate results.Readers interested in obtaining the complete source code for all classes should refer to [9] but cite this paper.

Complexity and implementation notes
The traditional running error analysis needs to calculate u| s|, and four x86 floating point instructions are necessary 3 .These are: fld instruction to load u onto the floating point stack, fmul to calculate u s, fabs to obtain |u s| = u| s|, and fstl to store the result back into memory and round it to double precision.Our approach suggests to substitute | s − S| for u| s| and a typical evaluation requires an fldl instruction to load s onto the floating point stack, fsub to calculate s− S, fabs to obtain its absolute value | s− S|, and fstl to store the result into memory while rounding it to double precision.
According to instruction latency tables [1], fsub instruction latency is 3 cycles, while fmul latency is 5.The following table compares the two approaches from the latency point of view, proving that there is a speed enhancement: Our approach not only provides a better error bound estimate, but also at a higher speed and we should expect about 20 per cent speed up.

Case study
The case study compares the presented approach to traditional running error analysis, and also compares it to a forward error bound determined a-priori on the following mathematical identity: C++ and Mathematica [8] software are used to verify the results.A rounding error analysis of ( 5) for a double precision floating point arithmetic for N addends is given with the following assumptions: The computed y N is expressed as: where and The backward error is immediately visible from equation (7), in which we can consider the sum as an exact sum of perturbed data entries by a relative value certainly bounded by γ n .The forward a-priori error bound is then calculated as: and presents the worst case error estimate, which can be far from true error bound.The following section demonstrates this statement.

Results
Results are provided for two summation orders.The first case performs the summation in decreasing order of magnitude, where a poor, insufficiently accurate result is quickly visible.The figure 2 depicts this scenario, where N stands for number of iterations and the y-axis shows a base 10 logarithm of the order of the error.The first line, marked with 2 , presents the absolute error, which is log 10 (| y N − log 2|) and it gets smaller with each further accumulation of the sum.Rounding errors go against this error from bottom to top, and are represented with the remaining three lines.From top to bottom: The × +line presents the worst case, that is, the a-priori forward error bound which is obtained from the right hand side of ( 9).The second line, marked with +, stands for an a-posteriori running error bound.We can see that it is slightly better than the a-priori bound, and the last line (×) presents our approach, which accounts real rounding errors, and that is the reason why the first two iterations have no error (cf. the × at the top).
We can also observe that if we use an a-priori bound, we will have to terminate the accumulation somewhere after 2 26 iterations, and after 2 27 iterations with the running error bound and approximately 2 28 iterations with our approach.Calculating more iterations beyond the bound does not make sense in this case, because each accumulant is smaller than the error of the result.The second case performs the sum in the reverse order, i.e. going from numbers of the smallest magnitude towards greater numbers (see figure 3).Note that this scenario demonstrates the claim that an error bound obtained a-priori (i.e. the forward error bound) is often too pessimistic and presents the worst case.Using it would lead to premature termination of the sum evaluation.
One more thing is demonstrated, i.e., that we should accumulate numbers in increasing order of magnitude.If we do not do that, numbers with small magnitude start to contribute less and less to a proportionally huge sum value, and we sooner or later find that further additions do not change the value of the sum.
Due to this fact, the harmonic sequence converges in finite arithmetics, and has a finite sum depending on the type of arithmetic.
The following table presents selected portions of the evaluation time for all two a-posteriori error analyses including the original computation.In addition, the table shows the run time for the objdouble class, which wraps the double data class that is used after subtracting the double column to determine the amount of the C++ object overhead.
The results were obtained with the POSIX getrusage API which provides, besides other information, the amount of used time in seconds and microseconds since the start of the process.For measuring purposes, these are internally converted into milliseconds and a difference of two millisecond values performs a measurement without interference with other processes and the operating system.Each value was measured 50 times and the results were statistically evaluated [2].
When we compare the traditional running error analysis run time to a simple evaluation in double precision, we obtain that object-oriented running error analysis with the radouble class is approximately 2.81 times slower.This slowdown includes the C++ overhead, consisting of object construction and operator calls.The overhead was measured with the objdouble class and the results show that the evaluation with objdouble class took 2 times longer than with the double data type.Our approach is only 2.35 times slower than the original approach with the double data type, and it is 1.19 times faster than the traditional approach.This speed up is what we have been expecting from Tab. 2.

Conclusions
The traditional running error analysis approach uses the u| s| term to get a rounding error estimate and, as we have shown, this estimate can be up to two times larger than the actual rounding error.Moreover, this term always generates an error, regardless of whether an error was physically committed.By exploiting the floating point unit's behavior of the Intel x86 platform, we are able to obtain an 11bit error estimate by subtracting the rounded result s from its not yet rounded equivalent S, and when divided by S, this estimate is always less than or equal to the unit roundoff u.Our approach is very similar to the traditional approach; it also needs four FPU instructions, but it replaces multiplication by subtraction and that saves 2 CPU cycles per evaluation, obtaining almost 20 per cent speedup over traditional running error analysis.
We encourage the use of running error analysis in all iterative tasks where critical cancellation can occur, such as during evaluation of a numeric derivative.As we have seen in the case study, an error bound determined a-priori can be far from the actual error and, especially with the provided classes, running error analysis provides a very quick and feasible replacement.This, however, costs 2.85 times the evaluation time, but when replaced by FPU-supported running error analysis, the cost is 2.35 while providing a yet tighter bound.With a tighter bound, we are able to calculate more iterations and be sure that the result is still valid.

Fig. 1 :
Fig. 1: Rounding a double extended precision number to double precision.

Fig. 2 :
Fig. 2: Calculated value of yN and its error bounds (forward sum direction)

Fig. 3 :
Fig. 3: Calculated value of yN and its error bounds (reverse sum direction)

Table 1 :
Error bound estimates for basic operations The fdouble::operator+ method calls the op_plus function which contains the assembly inline and is common for all other fdouble::operator+ overloads:

Table 2 :
Comparing the FPU instructions necessary to evaluate the absolute error

Table 3 :
Run times for reverse sum evaluations of selected N operations with four data types.These are double data type with no error analysis, the radouble data type, an object-oriented traditional running error analysis, fdouble data type which performs FPU-supported running error analysis, and the objdouble class, which wraps double data type and is used to measure the C++ overhead.All values are rounded to whole milliseconds log 2 N double radouble fdouble objdouble