Can We Solve Linear Algebra Problems at Extreme Scale and Low Precisions?

The Fugaku supercomputer that tops the HPL-AI mixed-precision benchmark in the June 2021 TOP500 list. It solved a linear system of order 10^7 using IEEE half precision arithmetic for most of the computations.

The largest dense linear systems being solved today are of order $n = 10^7$ , and future exascale computer systems will be able to tackle even larger problems. Rounding error analysis shows that the computed solution satisfies a componentwise backward error bound that, under favorable assumptions, is of order $nu$ , where $u$ is the unit roundoff of the floating-point arithmetic: $u \approx 10^{-16}$ for double precision and $u \approx 10^{-8}$ for single precision. This backward error bound cannot guarantee any stability for single precision solution of today’s largest problems and suggests a loss of half the digits in the backward error for double precision.

Half precision floating-point arithmetic is now readily available in hardware, in both the IEEE binary16 format and the bfloat16 format, and it is increasingly being used in machine learning and in scientific computing more generally. For the computation of the inner product of two $n$ -vectors the backward error bound is again of order $nu$ , and this bound exceeds $1$ for $n \ge 684$ for both half precision formats, suggesting a potentially complete loss of numerical stability. Yet inner products with $n \ge 684$ are successfully used in half precision computations in practice.

The error bounds I have referred to are upper bounds and so bound the worst-case over all possible rounding errors. Their main purpose is to reveal potential instabilities rather than to provide realistic error estimates. Yet we do need to know the limits of what we can compute, and for mission critical applications we need to be able to guarantee a successful computation..

Can we understand the behavior of linear algebra algorithms at extreme scale and in low precision floating-point arithmetics?

To a large extent the answer is yes if we exploit three different features to obtain smaller error bounds.

Blocked Algorithms

Many algorithms are implemented in blocked form. For example, an inner product $x^Ty$ of two $n$ -vectors $x$ and $y$ can computed as

$\notag \begin{aligned} s_i &= x((i-1)b+1:ib)^T y((i-1)b+1:ib), \quad i = 1:k,\\ s &= s_1 + s_2 + \dots + s_k, \end{aligned}$

where $n = kb$ and $b \ll n$ is the block size. The inner product has been broken into $k$ smaller inner products of size $b$ , which are computed independently then summed. Many linear algebra algorithms are blocked in an analogous way, where the blocking is into submatrices with $b$ rows or $b$ columns (or both). Careful analysis of the error analysis shows that a blocked algorithm has an error bound about a factor of $b$ smaller than that for the corresponding unblocked algorithm. Practical block sizes for matrix algorithms are typically $128$ or $256$ , so blocking brings a substantial reduction in the error bounds.

Backward errors for the inner product of two vectors with elements of the form -0.25 + randn, computed in single precision in MATLAB with block size 256.

In fact, one can do even better than an error bound of order $(n/b)u$ . By computing the sum $s= s_1 + s_2 + \dots + s_k$ with a more accurate summation method the error constant is further reduced to $bu + O(u^2)$ (this is the FABsum method of Blanchard et al. (2020)).

Architectural Features

Intel x86 processors support an 80-bit extended precision format with a 64-bit significand, which is compatible with that specified in the IEEE standard. When a compiler uses this format with 80-bit registers to accumulate sums and inner products it is effectively working with a unit roundoff of $2^{-64}$ rather than $2^{-53}$ for double precision, giving error bounds smaller by a factor up to $2^{11} = 2048$ .

Some processors have a fused multiply–add (FMA) operation, which computes a combined multiplication and addition $x + yz$ with one rounding error instead of two. This results in a reduction in error bounds by a factor $2$ .

Mixed precision block FMA operations $D = C + AB$ , with matrices $A,B,C,D$ of fixed size, are available on Google tensor processing units, NVIDIA GPUs, and in the ARMv8-A architecture. For half precision inputs these devices can produce results of single precision quality, which can give a significant boost in accuracy when block FMAs are chained together to form a matrix product of arbitrary dimension.

Probabilistic Bounds

Worst-case rounding error bounds suffer from the problem that they are not attainable for most specific sets of data and are unlikely to be nearly attained. Stewart (1990) noted that

To be realistic, we must prune away the unlikely. What is left is necessarily a probabilistic statement.

Theo Mary and I have recently developed probabilistic rounding error analysis, which makes probabilistic assumptions on the rounding errors and derives bounds that hold with a certain probability. The key feature of the bounds is that they are proportional to $\sqrt{n}u$ when a corresponding worst-case bound is proportional to $nu$ . In the most general form of the analysis (Connolly, Higham, and Mary, 2021), the rounding errors are assumed to be mean independent and of mean zero, where mean independence is a weaker assumption than independence.

Putting the Pieces Together

The different features we have described can be combined to obtain significantly smaller error bounds. If we use a blocked algorithm with block size $b \ll n$ then in an inner product the standard error bound of order $nu$ reduces to a probabilistic bound of order $(\sqrt{n/b})u$ , which is a significant reduction. Block FMAs and extended precision registers provide further reductions.

For example, for a linear system of order $10^7$ solved in single precision with a block size of $256$ , the probabilistic error bound is of order $10^{-5}$ versus $1$ for the standard worst-case bound. If FABsum is used then the bound is further reduced.

Our conclusion is that we can successfully solve linear algebra problems of greater size and at lower precisions than the standard rounding error analysis suggests. A priori bounds will always be pessimistic, though. One should compute a posteriori residuals or backward errors (depending on the problem) in order to assess the quality of a numerical solution.

For full details of the work summarized here, see Higham (2021).

References

Pierre Blanchard, Nicholas Higham, and Theo Mary, A Class of Fast and Accurate Summation Algorithms, SIAM J. Sci. Comput. 42(3), A1541–A1557, 2020.
Pierre Blanchard, Nicholas Higham, Florent Lopez, Theo Mary, and Srikara Pranesh, Mixed Precision Block Fused Multiply-Add: Error Analysis and Application to GPU Tensor Cores, SIAM J. Sci. Comput. 42(3), C124-C141, 2020
Michael P. Connolly, Nicholas J. Higham, and Theo Mary, Stochastic Rounding and Its Probabilistic Backward Error Analysis, SIAM J. Sci. Comput. 43(1), A566–A585, 2021.
Nicholas J. Higham, Numerical Stability of Algorithms at Extreme Scale and Low Precisions, MIMS EPrint 2021.14, Manchester Institute for Mathematical Sciences, The University of Manchester, UK, September 2021.
Nicholas Higham and Theo Mary, A New Approach to Probabilistic Rounding Error Analysis, SIAM J. Sci. Comput. 41(5), A2815–A2835, 2019.
G. W. Stewart, Stochastic Perturbation Theory, SIAM Rev. 32(4), 579–610, 1990.

4 thoughts on “Can We Solve Linear Algebra Problems at Extreme Scale and Low Precisions?”

Nick Higham says:

September 15, 2021 at 7:13 pm

The post “A faster and more accurate sum” by Christine Tobler at https://blogs.mathworks.com/loren/2021/09/07/a-faster-and-more-accurate-sum/ explains that the MATLAB sum function has been using blocked summation for all datatypes since MATLAB R2020b.

Nick says:

September 17, 2021 at 9:35 pm

How much would a 16-bit [Kulish exact dot product](https://people.eecs.berkeley.edu/~biancolin/papers/arith17.pdf) accumulator change this error analysis?

Wen says:

November 5, 2021 at 4:05 am

“The largest dense linear systems being solved today are of order n = 10^7.” I wonder do you mean dense linear systems without any special structure (e.g., not FFT-able)?

1. Nick Higham says:
  
  November 5, 2021 at 8:30 am
  
  Yes.