Accelerating the Solution of Linear Systems by Iterative Refinement in Three Precisions

by Erin Carson and Nick Higham

lu-ir-hsd.jpg
Half precision LU factorization (H,S D) can deliver full single precision accuracy (Algorithm New), just like traditional iterative refinement (S,S,D: Algorithm Trad).

With the growing availability of half precision arithmetic in hardware and quadruple precision arithmetic in software, it is natural to ask whether we can harness these different precisions, along with the standard single and double precisions, to solve problems faster or more accurately.

We have been investigating this question for linear systems Ax = b with a nonsingular matrix A, for which the standard solution process is by LU factorization. By making use of iterative refinement, we are able to harness an LU factorization computed in lower precision to solve the problem up to twice as fast and with greater accuracy than with the standard approach.

Iterative refinement is an old technique for improving the accuracy of an approximate solution to a linear system Ax = b. It was used by Wilkinson and his colleagues in the 1940s in the early days of digital computing. The traditional usage employs two precisions, and fixed precision refinement has also been in use since the late 1970s.

We have found that using three different precisions, rather than the usual two, can bring major benefits. A scenario of particular interest is a mix of half precision, single precision, and double precision, with single precision the working precision in which A, b, and the iterates x_i are stored. Here is the traditional algorithm followed by the new algorithm. All computations are in single precision (unit roundoff 6.0 \times 10^{-8}) except where stated.

Algorithm Trad: two-precision refinement (single, double).
Factorize $PA = LU$.
Solve $Ax_0 = b$ using the LU factors.
for $i=0:\infty$
   Form $r_i = b - Ax_i$ in *double precision*
        and round $r_i$ to *single precision*.
   Solve $Ad_i = r_i$ using the LU factors.
   $x_{i+1} = x_i + d_i$.
end
Algorithm New: three-precision refinement (half, single, double).
Factorize $PA = LU$ in *half precision*.
Solve $Ax_0 = b$ using the LU factors at *half precision*.
for $i=0:\infty$
   Form $r_i = b - Ax_i$ in *double precision*
        and round $r_i$ to *half precision*.
   Solve $Ad_i = r_i$ at *half precision*
        using the LU factors; store $d_i$ in *single*.
 $x_{i+1} = x_i + d_i$.
end

Speed

Algorithm Trad does O(n^3) flops at single precision and O(n^2) flops at double precision. Algorithm New, however, does O(n^3) flops at half precision and O(n^2) flops at single and double precision. Both these statements assume, of course, that iterative refinement converges in a small number of iterations. There is therefore a potential two times speedup of Algorithm New over Algorithm Trad, since half precision runs at twice the speed of single precision on (for example) NVIDIA GPUs and AMD GPUs.

Accuracy

Algorithm Trad converges as long as \kappa_{\infty}(A) \le 10^8 and it yields a forward error (defined by \|x-\widehat{x}\|_{\infty}/\|x\|_{\infty}, where \widehat{x} is the computed solution) and a backward error both of order 10^{-8} (as shown by standard analysis). Our new rounding error analysis shows that Algorithm New has the same error bounds, but has the more stringent requirement \kappa_{\infty}(A) \le 10^4 for convergence.

GMRES-IR

Now we replace the solve step in the loop of Algorithm New by an application of GMRES to the preconditioned system

\widetilde{A} d_i \equiv \widehat{U}^{-1}\widehat{L}^{-1}Ad_i = \widehat{U}^{-1}\widehat{L}^{-1}r_i,

where matrix–vector products with \widetilde{A} are done at double precision and all other computations are done at single precision. Algorithm New now converges as long as \kappa_{\infty}(A) \le 10^8 and it yields forward and backward errors of order 10^{-8}. In other words, it has the same numerical properties as Algorithm Trad but potentially does half the work (depending on the number of GMRES iterations needed to converge).

Other Choices of Precision

Let H, S, D, and Q denote half precision, single precision, double precision, and quadruple precision, respectively. Algorithm New can be described as “HSD”, where the three letters indicate the precision of the factorization, the working precision, and the precision with which residuals are computed, respectively. Various combinations of letters produce feasible algorithms (20 in all, if we include fixed precision refinement algorithms, such as “SSS”), of which HSD, HSQ, HDQ and SDQ use three different precisions. Similar results to those above apply to the latter three combinations.

Outlook

Our MATLAB experiments confirm the predictions of the error analysis regarding the behavior of Algorithm New and its GMRES-IR variant. They also show that the number of GMRES iterations in GMRES-IR can indeed be small.

Iterative refinement in three precisions therefore offers great promise for speeding up the solution of Ax = b. Instead of solving the system by an LU factorization at the working precision, we can factorize A at half the working precision and apply iterative refinement in three precisions, thereby obtaining a more accurate solution at potentially half the cost.

Full details of this work can be found in

Erin Carson and Nicholas J. Higham, Accelerating the Solution of Linear Systems by Iterative Refinement in Three Precisions MIMS EPrint 2017.24, Manchester Institute for Mathematical Sciences, The University of Manchester, UK, July 2017.