# Simulating Low Precision Floating-Point Arithmetics in MATLAB

At least five floating-point arithmetics are available in mainstream hardware:

• the IEEE double precision (fp64), single precision (fp32), and half precision (fp16) formats,
• bfloat16, and
• tf32, introduced in the recently announced NVIDIA A100, which uses the NVIDIA Ampere GPU architecture.

Only fp32 and fp64 are available on current Intel processors and most programming environments support only these two precisions. For many of us it is therefore necessary to simulate other precisions if we wish to explore their properties and the behavior of algorithms when run at these precisions.

Srikara Pranesh and I have written a MATLAB function `chop` that allows arithmetic of arbitrary precision and exponent range (both lower than fp64) to be simulated. The idea is very simple: variables are stored in fp32 or fp64 and then rounded to the required precision and exponent range after each operation by calling `chop`. The precision and range, as well as various other parameters, can be changed at any point in the computation.

Our approach can be compared with defining a new data type and storage format and converting to and from single precision or double precision in order to carry out the computations. Our approach avoids the overheads of repeatedly converting to and from the special storage format. It gains speed and generality at the cost of requiring chop calls to be inserted in the code.

Chop is available from this GitHub repository and from MathWorks File Exchange. The motivation behind chop is described in this open access paper in SIAM. J. Sci. Comput. We have made use of chop to simulate fp16 and bfloat16 in several recent papers, including on mixed precision algorithms, probabilistic rounding error analysis, and fast and accurate summation.

## How to Use Chop

A call to chop has the form `chop(X,options)`, where the structure `options` spceifies how to round the elements of the real array `X` to a lower precision arithmetic. `X` should be a single precision or double precision array and the output will have the same type. The structure `options` controls various aspects of the rounding, as we now explain.

1. The arithmetic format is specified by `options.format`, which is one of
• ‘b’, ‘bfloat16’: bfloat16,
• ‘h’, ‘half’, ‘fp16’: IEEE half precision (the default),
• ‘s’, ‘single’, ‘fp32’: IEEE single precision,
• ‘d’, ‘double’, ‘fp64’: IEEE double precision,
• ‘c’, ‘custom’: custom format.

In the last case the (base 2) format is defined by `options.params`, which is a 2-vector `[t,emax],` where `t` is the number of bits in the significand (including the hidden bit) and `emax` is the maximum value of the exponent.

2. `options.subnormal` specifies whether subnormal numbers are supported (if they are not, subnormals are flushed to zero):
• 0 = do not support subnormals (the default for bfloat16),
• 1 = support subnormals (the default for fp16, fp32, and fp64).
3. The form of rounding is specified by `options.round`:
• 1: round to nearest using round to even last bit to break ties (the default);
• 2: round towards plus infinity (round up);
• 3: round towards minus infinity (round down);
• 4: round towards zero;
• 5: stochastic rounding—round to the next larger or next smaller floating-point number with probability proportional to 1 minus the distance to those floating-point numbers;
• 6: stochastic rounding—round to the next larger or next smaller floating-point number with equal probability.

For stochastic rounding, exact floating-point numbers are not changed.

4. If `options.flip = 1` (default 0) then each element of the rounded result has, with probability `options.p` (default 0.5), a randomly chosen bit in its significand flipped. This option is useful for simulating soft errors.
5. If `options.explim` = 0 (default 1) then emax (the maximal exponent) for the specified arithmetic is ignored, so overflow, underflow, or subnormal numbers will be produced only if necessary for the data type of `X`. This option is useful for exploring low precisions independent of range limitations.

To avoid repeatedly passing the `options` argument, one can issue a call `chop([],options)` and the subsequent calls will use the previously given `options` parameters. The value of `options` can be inspected with `[~,options] = chop`.

In the rest of this post we give examples of the use of `chop`.

## Fp16 versus Bfloat

The code

```x = 1/3;
opt_h.format = 'h'; xh = chop(x,opt_h); err_h = (x - xh)/x
opt_b.format = 'b'; xb = chop(x,opt_b); err_b = (x - xb)/x
x = 70000;
xh = chop(x,opt_h)
xb = chop(x,opt_b)
```

produces the output

```err_h =
2.4414e-04
err_b =
-1.9531e-03
xh =
Inf
xb =
70144
```

which illustrates the higher precison but smaller range of fp16 compared with bfloat16.

## Rounding Modes

The code

```opt.format = 'h'; x = 0.1; rng(2)
for k = 1:6
opt.round = k;
y(k) = chop(x,opt);
end
errors = y - x
diff_y = diff(y)
```

rounds 0.1 to fp16 using six different rounding modes. The output is

```errors =
-2.4414e-05   3.6621e-05  -2.4414e-05  -2.4414e-05  -2.4414e-05
3.6621e-05
diff_y =
6.1035e-05  -6.1035e-05            0            0   6.1035e-05
```

Rounding maps a number to the next larger or next smaller floating-point number, so there are only two possibilties for the error, and they have opposite signs. The `diff_y` values are consistent with the spacing of the fp16 floating-point numbers around 0.1, which is $2^{-11}/8$.

## Matrix Multiplication

The overheads of `chop` can be minimized by choosing a suitable implementation of a computation. Matrix multiplication provides a good example. The multiplication `C = A*B` of $n$-by-$n$ matrices involves $2n^3$ floating-point operations, each of which needs to be chopped. The following code uses only $2n$ calls to `chop`, because it processes an outer product in a single call, taking advantage of the fact that `chop` can take a matrix argument.

```A = rand(m,n); B = rand(n,p);
C = zeros(m,p);
for i = 1:n
C = chop(C + chop(A(:,i)*B(i,:)));
end
```

## Tf32 Arithmetic

The recently announced NVIDIA A100 GPU supports a new floating-point format called tf32 in its tensor cores. Tf32 is a 19-bit format that has 11 bits in the significand (including the hidden bit) and 8 bits in the exponent. So it has the precision of fp16 and the range of bfloat16, effectively combining the best features of these two formats. We can simulate tf32 using the custom format of `chop`. The code

```opt.format = 'custom';
% tf32. Significand: 10 bits plus 1 hidden, exponent: 8 bits.
t = 11; emax = 127;
opt.params = [t,emax];
x = 1/3;
xt = chop(x,opt); err_t = (x - xt)/x
x = 70000;
xt = chop(x,opt)
```

produces the output

```err_t =
2.4414e-04
xt =
70016
```

The error is the same as for fp16 but the number 70000 now rounds to a finite floating-point number.