# What Is Bfloat16 Arithmetic?

Bfloat16 is a floating-point number format proposed by Google. The name stands for “Brain Floating Point Format” and it originates from the Google Brain artificial intelligence research group at Google.

Bfloat16 is a 16-bit, base 2 storage format that allocates 8 bits for the significand and 8 bits for the exponent. It contrasts with the IEEE fp16 (half precision) format, which allocates 11 bits for the significand but only 5 bits for the exponent. In both cases the implicit leading bit of the significand is not stored, hence the “+1” in this diagram:

The motivation for bfloat16, with its large exponent range, was that “neural networks are far more sensitive to the size of the exponent than that of the mantissa” (Wang and Kanwar, 2019).

Bfloat16 uses the same number of bits for the exponent as the IEEE fp32 (single precision) format. This makes conversion between fp32 and bfloat16 easy (the exponent is kept unchanged and the significand is rounded or truncated from 24 bits to 8) and the possibility of overflow in the conversion is largely avoided. Overflow can still happen, though (depending on the rounding mode): the significand of fp32 is longer, so the largest fp32 number exceeds the largest bfloat16 number, as can be seen in the following table. Here, the precision of the arithmetic is measured by the unit roundoff, which is $2^{-t}$, where $t$ is the number of bits in the significand.

Note that although the table shows the minimum positive subnormal number for bfloat16, current implementations of bfloat16 do not appear to support subnormal numbers (this is not always clear from the documentation).

As the unit roundoff values in the table show, bfloat16 numbers have the equivalent of about three decimal digits of precision, which is very low compared with the eight and sixteen digits, respectively, of fp32 and fp64 (double precision).

The next table gives the number of numbers in the bfloat16, fp16, and fp32 systems. It shows that the bfloat16 number system is very small compared with fp32, containing only about 65,000 numbers.

The spacing of the bfloat16 numbers is large far from 1. For example, 65280, 65536, and 66048 are three consecutive bfloat16 numbers.

At the time of writing, bfloat16 is available, or announced, on four platforms or architectures.

• The Google Tensor Processing Units (TPUs, versions 2 and 3) use bfloat16 within the matrix multiplication units. In version 3 of the TPU the matrix multiplication units carry out the multiplication of 128-by-128 matrices.
• The NVIDIA A100 GPU, based on the NVIDIA Ampere architecture, supports bfloat16 in its tensor cores through block fused multiply-adds (FMAs) $C + A*B$ with 8-by-8 $A$ and 8-by-4 $B$.
• Intel has published a specification for bfloat16 and how it intends to implement it in hardware. The specification includes an FMA unit that takes as input two bfloat16 numbers $a$ and $b$ and an fp32 number $c$ and computes $c + a*b$ at fp32 precision, returning an fp32 number.
• The Arm A64 instruction set supports bfloat16. In particular, it includes a block FMA $C + A*B$ with 2-by-4 $A$ and 4-by-2 $B$.

The pros and cons of bfloat16 arithmetic versus IEEE fp16 arithmetic are

• bfloat16 has about one less (roughly three versus four) digit of equivalent decimal precision than fp16,
• bfloat16 has a much wider range than fp16, and
• current bfloat16 implementations do not support subnormal numbers, while fp16 does.

If you wish to experiment with bfloat16 but do not have access to hardware that supports it you will need to simulate it. In MATLAB this can be done with the chop function written by me and Srikara Pranesh.

## References

This is a minimal set of references, which contain further useful references within.

# What Is IEEE Standard Arithmetic?

The IEEE Standard 754, published in 1985 and revised in 2008 and 2019, is a standard for binary and decimal floating-point arithmetic. The standard for decimal arithmetic (IEEE Standard 854) was separate when it was first published in 1987, but it was included with the binary standard from 2008. We focus here on the binary part of the standard.

The standard specifies floating-point number formats, the results of the basic floating-point operations and comparisons, rounding modes, floating-point exceptions and their handling, and conversion between different arithmetic formats.

A binary floating-point number is represented as

$y = \pm m \times 2^{e-t},$

where $t$ is the precision and $e\in [e_{\min},e_{\max}]$ is the exponent. The significand $m$ is an integer satisfying $m \le 2^t-1$. Numbers with $m \ge 2^{t-1}$ are called normalized. Subnormal numbers, for which $0 < m <2^{t-1}$ and $e = e_{\min}$, are supported.

Four formats are defined, whose key parameters are summarized in the following table. The second column shows the number of bits allocated to store the significand and the exponent. I use the prefix “fp” instead of the prefix “binary” used in the standard. The unit roundoff is $u = 2^{-t}$.

Fp32 (single precision) and fp64 (double precision) were in the 1985 standard; fp16 (half precision) and fp128 (quadruple precision) were introduced in 2008. Fp16 is defined only as a storage format, though it is widely used for computation.

The size of these different number systems varies greatly. The next table shows the number of normalized and subnormal numbers in each system.

We see that while one can easily carry out a computation on every fp16 number (to check that the square root function is correctly computed, for example), it is impractical to do so for every double precision number.

A key feature of the standard is that it is a closed system, thanks to the inclusion of NaN (Not a Number) and $\infty$ (usually written as inf in programing languages) as floating-point numbers: every arithmetic operation produces a number in the system. A NaN is generated by operations such as

$0/0, \quad 0 \times \infty, \quad \infty/\infty, \quad (+\infty) + (-\infty), \quad \sqrt{-1}.$

Arithmetic operations involving a NaN return a NaN as the answer. The number $\infty$ obeys the usual mathematical conventions regarding infinity, such as

$\infty+\infty = \infty, \quad (-1)\times\infty = -\infty, \quad ({\textrm{finite}})/\infty = 0.$

This means, for example, that $1 + 1/x$ evaluates as $1$ when $x = \infty$.

The standard specifies that all arithmetic operations (including square root) are to be performed as if they were first calculated to infinite precision and then rounded to the target format. A number is rounded to the next larger or next smaller floating-point number according to one of four rounding modes:

• round to the nearest floating-point number, with rounding to even (rounding to the number with a zero least significant bit) in the case of a tie;
• round towards plus infinity and round towards minus infinity (used in interval arithmetic); and
• round towards zero (truncation, or chopping).

For round to nearest it follows that

$\mathrm{f\kern.2ptl}(x\mathbin{\mathrm{op}} y) = (x \mathbin{\mathrm{op}} y)(1+\delta), \quad |\delta|\le u, \quad \mathbin{\mathrm{op}}\in\{+,-,*,/,\sqrt{}\},$

where $\mathrm{f\kern.2ptl}$ denotes the computed result. The standard also includes a fused multiply-add operation (FMA), $x*y + z$. The definition requires it to be computed with just one rounding error, so that $\mathrm{f\kern.2ptl}(x*y+z)$ is the rounded version of $x*y+z$, and hence satisfies

$\mathrm{f\kern.2ptl}(x*y + z) = (x*y + z)(1+\delta), \quad |\delta|\le u.$

FMAs are supported in some hardware and are usually executed at the same speed as a single addition or multiplication.

The standard recommends the provision of correctly rounded exponentiation ($x^y$) and transcendental functions ($\exp$, $\log$, $\sin$, $\mathrm{acos}$, etc.) and defines domains and special values for them, but these functions are not required.

A new feature of the 2019 standard is augmented arithmetic operations, which compute $\mathrm{fl}(x \mathbin{\mathrm{op}}y)$ along with the error $x\mathbin{\mathrm{op}}y - \mathrm{fl}(x \mathbin{\mathrm{op}}y)$, for $\mathbin{\mathrm{op}} = +,-,*$. These operations are useful for implementing compensated summation and other special high accuracy algorithms.

William (“Velvel”) Kahan of the University of California at Berkeley received the 1989 ACM Turing Award for his contributions to computer architecture and numerical analysis, and in particular for his work on IEEE floating-point arithmetic standards 754 and 854.

## References

This is a minimal set of references, which contain further useful references within.

# Half Precision Arithmetic: fp16 Versus bfloat16

The 2008 revision of the IEEE Standard for Floating-Point Arithmetic introduced a half precision 16-bit floating point format, known as fp16, as a storage format. Various manufacturers have adopted fp16 for computation, using the obvious extension of the rules for the fp32 (single precision) and fp64 (double precision) formats. For example, fp16 is supported by the NVIDIA P100 and V100 GPUs and the AMD Radeon Instinct MI25 GPU, as well as the A64FX Arm processor that will power the Fujitsu Post-K exascale computer.

## Bfloat16

Fp16 has the drawback for scientific computing of having a limited range, its largest positive number being $6.55 \times 10^4$. This has led to the development of an alternative 16-bit format that trades precision for range. The bfloat16 format is used by Google in its tensor processing units. Intel, which plans to support bfloat16 in its forthcoming Nervana Neural Network Processor, has recently (November 2018) published a white paper that gives a precise definition of the format.

The allocation of bits to the exponent and significand for bfloat16, fp16, and fp32 is shown in this table, where the implicit leading bit of a normalized number is counted in the significand.

Format Significand Exponent
bfloat16 8 bits 8 bits
fp16 11 bits 5 bits
fp32 24 bits 8 bits

Bfloat16 has three fewer bits in the significand than fp16, but three more in the exponent. And it has the same exponent size as fp32. Consequently, converting from fp32 to bfloat16 is easy: the exponent is kept the same and the significand is rounded or truncated from 24 bits to 8; hence overflow and underflow are not possible in the conversion.

On the other hand, when we convert from fp32 to the much narrower fp16 format overflow and underflow can readily happen, necessitating the development of techniques for rescaling before conversionâ€”see the recent EPrint Squeezing a Matrix Into Half Precision, with an Application to Solving Linear Systems by me and Sri Pranesh.

The drawback of bfloat16 is its lesser precision: essentially 3 significant decimal digits versus 4 for fp16. The next table shows the unit roundoff $u$, smallest positive (subnormal) number xmins, smallest normalized positive number xmin, and largest finite number xmax for the three formats.

$u$ xmins xmin xmax
bfloat16 3.91e-03 (*) 1.18e-38 3.39e+38
fp16 4.88e-04 5.96e-08 6.10e-05 6.55e+04
fp32 5.96e-08 1.40e-45 1.18e-38 3.40e+38

(*) Unlike the fp16 format, Intel’s bfloat16 does not support subnormal numbers. If subnormal numbers were supported in the same way as in IEEE arithmetic, xmins would be 9.18e-41.

The values in this table (and those for fp64 and fp128) are generated by the MATLAB function float_params that I have made available on GitHub and at MathWorks File Exchange.

## Harmonic Series

An interesting way to compare these different precisions is in summation of the harmonic series $1 + 1/2 + 1/3 + \cdots$. The series diverges, but when summed in the natural order in floating-point arithmetic it converges, because the partial sums grow while the addends decrease and eventually the addend is small enough that it does not change the partial sum. Here is a table showing the computed sum of the harmonic series for different precisions, along with how many terms are added before the sum becomes constant.

Arithmetic Computed Sum Number of terms
bfloat16 $5.0625$ $65$
fp16 $7.0859$ $513$
fp32 $15.404$ $2097152$
fp64 $34.122$ $2.81\dots\times 10^{14}$

The differences are striking! I determined the first three values in MATLAB. The fp64 value is reported by Malone based on a computation that took 24 days, and he also gives analysis to estimate the limiting sum and corresponding number of terms for fp64.

The NVIDIA V100 has tensor cores that can carry out the computation D = C + A*B in one clock cycle for 4-by-4 matrices A, B, and C; this is a 4-by-4 fused multiply-add (FMA) operation. Moreover, C and D can be in fp32. The benefits that the speed and accuracy of the tensor cores can bring over plain fp16 is demonstrated in Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers.

Intel’s bfloat16 format supports a scalar FMA d = c + a*b, where c and d are in fp32.

## Conclusion

A few years ago we had just single precision and double precision arithmetic. With the introduction of fp16 and fp128 in the IEEE standard in 2008, and now bfloat16 by Google and Intel, the floating-point landscape is becoming much more interesting.

# The Rise of Mixed Precision Arithmetic

For the last 30 years, most floating point calculations in scientific computing have been carried out in 64-bit IEEE double precision arithmetic, which provides the elementary operations of addition, subtraction, multiplication, and division at a relative accuracy of about $10^{-16}$. We are now seeing growing use of mixed precision, in which different floating point precisions are combined in order to deliver a result of the required accuracy at minimal cost.

Single precision arithmetic (32 bits) is an attractive alternative to double precision because it halves the costs of storing and transferring data, and on Intel chips the SSE extensions allow single precision arithmetic to run twice as fast as double.

Quadruple precision arithmetic, which was included in the 2008 revision of the IEEE standard, is supported by some compilers, and it can be implemented in terms of double precision arithmetic via double-double arithmetic. Arbitrary precision floating point arithmetic is available through, for example, the GNU MPFR library, the mpmath library for Python, the core data type BigFloat in the new language Julia, VPA arithmetic in the MATLAB Symbolic Math Toolbox, or the Advanpix Multiprecision Computing Toolbox for MATLAB.

Half precision arithmetic, in which a number occupies 16 bits, is supported by the IEEE standard for storage but not for computation. It has been argued that for deep learning half precision, with its relative accuracy of about $10^{-4}$, is good enough for training and running neural networks. Here are some of the ways in which extra precision is currently being used.

• Iterative refinement, in the traditional form that first became popular in the 1960s, improves the quality of an approximate solution to a linear system via updates obtained from residuals computed in extra precision.
• When an algorithm suffers instability it may be possible to overcome it by using extra precision in just a few, key places. This has been done recently in eigensolvers and for matrix orthonormalization.
• Any iterative algorithm that accepts an arbitrary starting point can be run once at a given precision and the solution used to “warm start” a second run of the same algorithm at higher precision. This idea has been used recently in linear programming.
• Numerical integration of differential equations over long time periods may need higher precision in order to allow the phenomena of interest to be observed. A recent example is in the study of Kerr (rotating) black holes, where the underlying hyperbolic partial differential equation is solved using quadruple precision arithmetic running on GPUs.
• When one is developing error bounds or testing algorithms, one needs in principle the exact solution. In practice, a solution computed at high precision and rounded to the working precision is usually adequate, and this is an approach I frequently use in my work in numerical linear algebra.

As the relative costs and ease of computing at different precisions evolve, due to changing architectures and software, as well as the disruptive influence of accelerators such as GPUs, we will see an increasing development and use of mixed precision algorithms. In some ways this is analogous to the increasing interoperability of programming languages (illustrated by C++, Julia, and Python, for example): one uses the main tool (precision) one would like to work with and brings in other tools (precisions) as necessary in order to complete the task.

Update: linear programming link updated December 18, 2018.

# Tiny Relative Errors

Let $x\ne0$ and $y$ be distinct floating point numbers. How small can the relative difference between $x$ and $y$ be? For IEEE double precision arithmetic the answer is $u = 2^{-53} \approx 1.1 \times 10^{-16}$, which is called the unit roundoff.

What if we now let $x$ and $y$ be vectors and define the relative difference as $d(x,y) = \|x-y\|_{\infty}/\|x\|_{\infty}$, where the infinity norm is $\|x\|_{\infty} = \max_i |x_i|$? It does not seem to be well known that $d(x,y)$ can be much less than $u$. I have observed this phenomenon numerous times over the years but had not previously stopped to wonder how it could happen. An example is

$x = \begin{bmatrix}1 \\ 10^{-22} \end{bmatrix}, \quad y = \begin{bmatrix}1 \\ 2\times 10^{-22} \end{bmatrix}, \quad d(x,y) = 10^{-22}$.

Notice that the largest element(s) in magnitude of $x$ agrees with the corresponding element(s) of $y$. This is, in fact, the only way that $d(x,y) < u$ is possible.

Theorem (Dingle & Higham, 2013).
Let $x\ne0$ and $y$ be $n$-vectors of normalized floating point numbers.
If $d(x,y) < u$ then $x_k = y_k$ for all $k$ such that $|x_k| = \|x\|_{\infty}$.

This result, and the scalar case mentioned above, are proved in

Nicholas J. Higham and Nicholas J. Dingle, Reducing the Influence of Tiny Normwise Relative Errors on Performance Profiles, MIMS EPrint 2011.90; to appear in ACM Trans. Math. Soft.

Performance profiles, introduced by Dolan and MorÃ© in 2002, are a popular way to compare competing algorithms according to particular measures of performance, such as relative error or execution time. We show in the paper above that tiny relative errors can result in misleading performance profiles and show how a simple transformation of the data can lead to much more useful profiles. For more, see the paper or the talk Numerical Issues in Testing Linear Algebra Algorithms that I gave at the SIAM Conference on Computational Science and Engineering in Boston last week.