bfloat16

Bfloat16 is a floating-point number format proposed by Google. The name stands for “Brain Floating Point Format” and it originates from the Google Brain artificial intelligence research group at Google.

Bfloat16 is a 16-bit, base 2 storage format that allocates 8 bits for the significand and 8 bits for the exponent. It contrasts with the IEEE fp16 (half precision) format, which allocates 11 bits for the significand but only 5 bits for the exponent. In both cases the implicit leading bit of the significand is not stored, hence the “+1” in this diagram:

The motivation for bfloat16, with its large exponent range, was that “neural networks are far more sensitive to the size of the exponent than that of the mantissa” (Wang and Kanwar, 2019).

Bfloat16 uses the same number of bits for the exponent as the IEEE fp32 (single precision) format. This makes conversion between fp32 and bfloat16 easy (the exponent is kept unchanged and the significand is rounded or truncated from 24 bits to 8) and the possibility of overflow in the conversion is largely avoided. Overflow can still happen, though (depending on the rounding mode): the significand of fp32 is longer, so the largest fp32 number exceeds the largest bfloat16 number, as can be seen in the following table. Here, the precision of the arithmetic is measured by the unit roundoff, which is $2^{-t}$ , where $t$ is the number of bits in the significand.

Note that although the table shows the minimum positive subnormal number for bfloat16, current implementations of bfloat16 do not appear to support subnormal numbers (this is not always clear from the documentation).

As the unit roundoff values in the table show, bfloat16 numbers have the equivalent of about three decimal digits of precision, which is very low compared with the eight and sixteen digits, respectively, of fp32 and fp64 (double precision).

The next table gives the number of numbers in the bfloat16, fp16, and fp32 systems. It shows that the bfloat16 number system is very small compared with fp32, containing only about 65,000 numbers.

The spacing of the bfloat16 numbers is large far from 1. For example, 65280, 65536, and 66048 are three consecutive bfloat16 numbers.

At the time of writing, bfloat16 is available, or announced, on four platforms or architectures.

The Google Tensor Processing Units (TPUs, versions 2 and 3) use bfloat16 within the matrix multiplication units. In version 3 of the TPU the matrix multiplication units carry out the multiplication of 128-by-128 matrices.
The NVIDIA A100 GPU, based on the NVIDIA Ampere architecture, supports bfloat16 in its tensor cores through block fused multiply-adds (FMAs) $C + A*B$ with 8-by-8 $A$ and 8-by-4 $B$ .
Intel has published a specification for bfloat16 and how it intends to implement it in hardware. The specification includes an FMA unit that takes as input two bfloat16 numbers $a$ and $b$ and an fp32 number $c$ and computes $c + a*b$ at fp32 precision, returning an fp32 number.
The Arm A64 instruction set supports bfloat16. In particular, it includes a block FMA $C + A*B$ with 2-by-4 $A$ and 4-by-2 $B$ .

The pros and cons of bfloat16 arithmetic versus IEEE fp16 arithmetic are

bfloat16 has about one less (roughly three versus four) digit of equivalent decimal precision than fp16,
bfloat16 has a much wider range than fp16, and
current bfloat16 implementations do not support subnormal numbers, while fp16 does.

If you wish to experiment with bfloat16 but do not have access to hardware that supports it you will need to simulate it. In MATLAB this can be done with the chop function written by me and Srikara Pranesh.

References

This is a minimal set of references, which contain further useful references within.

Arm A64 Instruction Set Architecture Armv8, for Armv8-A Architecture Profile, ARM Limited, 2019.
Intel Corporation, BFLOAT16—Hardware Numerics Definition, 2018.
IEEE Standard for Floating-Point Arithmetic, IEEE Std 754-2019 (Revision of IEEE 754-2008), The Institute of Electrical and Electronics Engineers, New York, 2019.
NVIDIA Corporation, NVIDIA A100 Tensor Core GPU Architecture, 2020.

Related Blog Posts

BFloat16 Processing for Neural Networks on Armv8-A by Nigel Stephens (2019)
BFloat16: The Secret To High Performance on Cloud TPUs by Shibo Wang and Pankaj Kanwar (2019)
A Multiprecision World (2017)
Half Precision Arithmetic: fp16 Versus bfloat16 (2018)
The Rise of Mixed Precision Arithmetic (2015)
Simulating Low Precision Floating-Point Arithmetics in MATLAB (2020)
What Is Floating-Point Arithmetic? (2020)
What Is IEEE Standard Arithmetic? (2020)

This article is part of the “What Is” series, available from https://nhigham.com/category/what-is and in PDF form from the GitHub repository https://github.com/higham/what-is.

Bfloat16

Fp16 has the drawback for scientific computing of having a limited range, its largest positive number being $6.55 \times 10^4$ . This has led to the development of an alternative 16-bit format that trades precision for range. The bfloat16 format is used by Google in its tensor processing units. Intel, which plans to support bfloat16 in its forthcoming Nervana Neural Network Processor, has recently (November 2018) published a white paper that gives a precise definition of the format.

The allocation of bits to the exponent and significand for bfloat16, fp16, and fp32 is shown in this table, where the implicit leading bit of a normalized number is counted in the significand.

Format	Significand	Exponent
bfloat16	8 bits	8 bits
fp16	11 bits	5 bits
fp32	24 bits	8 bits

Bfloat16 has three fewer bits in the significand than fp16, but three more in the exponent. And it has the same exponent size as fp32. Consequently, converting from fp32 to bfloat16 is easy: the exponent is kept the same and the significand is rounded or truncated from 24 bits to 8; hence overflow and underflow are not possible in the conversion.

On the other hand, when we convert from fp32 to the much narrower fp16 format overflow and underflow can readily happen, necessitating the development of techniques for rescaling before conversion—see the recent EPrint Squeezing a Matrix Into Half Precision, with an Application to Solving Linear Systems by me and Sri Pranesh.

The drawback of bfloat16 is its lesser precision: essentially 3 significant decimal digits versus 4 for fp16. The next table shows the unit roundoff $u$ , smallest positive (subnormal) number xmins, smallest normalized positive number xmin, and largest finite number xmax for the three formats.

	$u$	xmins	xmin	xmax
bfloat16	3.91e-03	(*)	1.18e-38	3.39e+38
fp16	4.88e-04	5.96e-08	6.10e-05	6.55e+04
fp32	5.96e-08	1.40e-45	1.18e-38	3.40e+38

(*) Unlike the fp16 format, Intel’s bfloat16 does not support subnormal numbers. If subnormal numbers were supported in the same way as in IEEE arithmetic, xmins would be 9.18e-41.

The values in this table (and those for fp64 and fp128) are generated by the MATLAB function float_params that I have made available on GitHub and at MathWorks File Exchange.

Harmonic Series

An interesting way to compare these different precisions is in summation of the harmonic series $1 + 1/2 + 1/3 + \cdots$ . The series diverges, but when summed in the natural order in floating-point arithmetic it converges, because the partial sums grow while the addends decrease and eventually the addend is small enough that it does not change the partial sum. Here is a table showing the computed sum of the harmonic series for different precisions, along with how many terms are added before the sum becomes constant.

Arithmetic	Computed Sum	Number of terms
bfloat16	$5.0625$	$65$
fp16	$7.0859$	$513$
fp32	$15.404$	$2097152$
fp64	$34.122$	$2.81\dots\times 10^{14}$

The differences are striking! I determined the first three values in MATLAB. The fp64 value is reported by Malone based on a computation that took 24 days, and he also gives analysis to estimate the limiting sum and corresponding number of terms for fp64.

Fused Multiply-Add

The NVIDIA V100 has tensor cores that can carry out the computation D = C + A*B in one clock cycle for 4-by-4 matrices A, B, and C; this is a 4-by-4 fused multiply-add (FMA) operation. Moreover, C and D can be in fp32. The benefits that the speed and accuracy of the tensor cores can bring over plain fp16 is demonstrated in Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers.

Intel’s bfloat16 format supports a scalar FMA d = c + a*b, where c and d are in fp32.

Tag: bfloat16

What Is Bfloat16 Arithmetic?

References

Related Blog Posts

Half Precision Arithmetic: fp16 Versus bfloat16

Harmonic Series

Fused Multiply-Add

Conclusion

References

Related Blog Posts

Share this:

Bfloat16

Harmonic Series

Fused Multiply-Add

Conclusion

Share this: