At least five floating-point arithmetics are available in mainstream hardware:
- the IEEE double precision (fp64), single precision (fp32), and half precision (fp16) formats,
- bfloat16, and
- tf32, introduced in the recently announced NVIDIA A100, which uses the NVIDIA Ampere GPU architecture.
Only fp32 and fp64 are available on current Intel processors and most programming environments support only these two precisions. For many of us it is therefore necessary to simulate other precisions if we wish to explore their properties and the behavior of algorithms when run at these precisions.
Srikara Pranesh and I have written a MATLAB function
chop that allows arithmetic of arbitrary precision and exponent range (both lower than fp64) to be simulated. The idea is very simple: variables are stored in fp32 or fp64 and then rounded to the required precision and exponent range after each operation by calling
chop. The precision and range, as well as various other parameters, can be changed at any point in the computation.
Our approach can be compared with defining a new data type and storage format and converting to and from single precision or double precision in order to carry out the computations. Our approach avoids the overheads of repeatedly converting to and from the special storage format. It gains speed and generality at the cost of requiring chop calls to be inserted in the code.
Chop is available from this GitHub repository and from MathWorks File Exchange. The motivation behind chop is described in this open access paper in SIAM. J. Sci. Comput. We have made use of chop to simulate fp16 and bfloat16 in several recent papers, including on mixed precision algorithms, probabilistic rounding error analysis, and fast and accurate summation.
How to Use Chop
A call to chop has the form
chop(X,options), where the structure
options spceifies how to round the elements of the real array
X to a lower precision arithmetic.
X should be a single precision or double precision array and the output will have the same type. The structure
options controls various aspects of the rounding, as we now explain.
- The arithmetic format is specified by
options.format, which is one of
- ‘b’, ‘bfloat16’: bfloat16,
- ‘h’, ‘half’, ‘fp16’: IEEE half precision (the default),
- ‘s’, ‘single’, ‘fp32’: IEEE single precision,
- ‘d’, ‘double’, ‘fp64’: IEEE double precision,
- ‘c’, ‘custom’: custom format.
In the last case the (base 2) format is defined by
options.params, which is a 2-vector
tis the number of bits in the significand (including the hidden bit) and
emaxis the maximum value of the exponent.
options.subnormalspecifies whether subnormal numbers are supported (if they are not, subnormals are flushed to zero):
- 0 = do not support subnormals (the default for bfloat16),
- 1 = support subnormals (the default for fp16, fp32, and fp64).
- The form of rounding is specified by
- 1: round to nearest using round to even last bit to break ties (the default);
- 2: round towards plus infinity (round up);
- 3: round towards minus infinity (round down);
- 4: round towards zero;
- 5: stochastic rounding—round to the next larger or next smaller floating-point number with probability proportional to 1 minus the distance to those floating-point numbers;
- 6: stochastic rounding—round to the next larger or next smaller floating-point number with equal probability.
For stochastic rounding, exact floating-point numbers are not changed.
options.flip = 1(default 0) then each element of the rounded result has, with probability
options.p(default 0.5), a randomly chosen bit in its significand flipped. This option is useful for simulating soft errors.
options.explim= 0 (default 1) then emax (the maximal exponent) for the specified arithmetic is ignored, so overflow, underflow, or subnormal numbers will be produced only if necessary for the data type of
X. This option is useful for exploring low precisions independent of range limitations.
To avoid repeatedly passing the
options argument, one can issue a call
chop(,options) and the subsequent calls will use the previously given
options parameters. The value of
options can be inspected with
[~,options] = chop.
In the rest of this post we give examples of the use of
Fp16 versus Bfloat
x = 1/3; opt_h.format = 'h'; xh = chop(x,opt_h); err_h = (x - xh)/x opt_b.format = 'b'; xb = chop(x,opt_b); err_b = (x - xb)/x x = 70000; xh = chop(x,opt_h) xb = chop(x,opt_b)
produces the output
err_h = 2.4414e-04 err_b = -1.9531e-03 xh = Inf xb = 70144
which illustrates the higher precison but smaller range of fp16 compared with bfloat16.
opt.format = 'h'; x = 0.1; rng(2) for k = 1:6 opt.round = k; y(k) = chop(x,opt); end errors = y - x diff_y = diff(y)
rounds 0.1 to fp16 using six different rounding modes. The output is
errors = -2.4414e-05 3.6621e-05 -2.4414e-05 -2.4414e-05 -2.4414e-05 3.6621e-05 diff_y = 6.1035e-05 -6.1035e-05 0 0 6.1035e-05
Rounding maps a number to the next larger or next smaller floating-point number, so there are only two possibilties for the error, and they have opposite signs. The
diff_y values are consistent with the spacing of the fp16 floating-point numbers around 0.1, which is .
The overheads of
chop can be minimized by choosing a suitable implementation of a computation. Matrix multiplication provides a good example. The multiplication
C = A*B of -by- matrices involves floating-point operations, each of which needs to be chopped. The following code uses only calls to
chop, because it processes an outer product in a single call, taking advantage of the fact that
chop can take a matrix argument.
A = rand(m,n); B = rand(n,p); C = zeros(m,p); for i = 1:n C = chop(C + chop(A(:,i)*B(i,:))); end
The recently announced NVIDIA A100 GPU supports a new floating-point format called tf32 in its tensor cores. Tf32 is a 19-bit format that has 11 bits in the significand (including the hidden bit) and 8 bits in the exponent. So it has the precision of fp16 and the range of bfloat16, effectively combining the best features of these two formats. We can simulate tf32 using the custom format of
chop. The code
opt.format = 'custom'; % tf32. Significand: 10 bits plus 1 hidden, exponent: 8 bits. t = 11; emax = 127; opt.params = [t,emax]; x = 1/3; xt = chop(x,opt); err_t = (x - xt)/x x = 70000; xt = chop(x,opt)
produces the output
err_t = 2.4414e-04 xt = 70016
The error is the same as for fp16 but the number 70000 now rounds to a finite floating-point number.