What Is Bfloat16 Arithmetic?

Bfloat16 is a floating-point number format proposed by Google. The name stands for “Brain Floating Point Format” and it originates from the Google Brain artificial intelligence research group at Google.

Bfloat16 is a 16-bit, base 2 storage format that allocates 8 bits for the significand and 8 bits for the exponent. It contrasts with the IEEE fp16 (half precision) format, which allocates 11 bits for the significand but only 5 bits for the exponent. In both cases the implicit leading bit of the significand is not stored, hence the “+1” in this diagram:

The motivation for bfloat16, with its large exponent range, was that “neural networks are far more sensitive to the size of the exponent than that of the mantissa” (Wang and Kanwar, 2019).

Bfloat16 uses the same number of bits for the exponent as the IEEE fp32 (single precision) format. This makes conversion between fp32 and bfloat16 easy (the exponent is kept unchanged and the significand is rounded or truncated from 24 bits to 8) and the possibility of overflow in the conversion is largely avoided. Overflow can still happen, though (depending on the rounding mode): the significand of fp32 is longer, so the largest fp32 number exceeds the largest bfloat16 number, as can be seen in the following table. Here, the precision of the arithmetic is measured by the unit roundoff, which is $2^{-t}$ , where $t$ is the number of bits in the significand.

Note that although the table shows the minimum positive subnormal number for bfloat16, current implementations of bfloat16 do not appear to support subnormal numbers (this is not always clear from the documentation).

As the unit roundoff values in the table show, bfloat16 numbers have the equivalent of about three decimal digits of precision, which is very low compared with the eight and sixteen digits, respectively, of fp32 and fp64 (double precision).

The next table gives the number of numbers in the bfloat16, fp16, and fp32 systems. It shows that the bfloat16 number system is very small compared with fp32, containing only about 65,000 numbers.

The spacing of the bfloat16 numbers is large far from 1. For example, 65280, 65536, and 66048 are three consecutive bfloat16 numbers.

At the time of writing, bfloat16 is available, or announced, on four platforms or architectures.

The Google Tensor Processing Units (TPUs, versions 2 and 3) use bfloat16 within the matrix multiplication units. In version 3 of the TPU the matrix multiplication units carry out the multiplication of 128-by-128 matrices.
The NVIDIA A100 GPU, based on the NVIDIA Ampere architecture, supports bfloat16 in its tensor cores through block fused multiply-adds (FMAs) $C + A*B$ with 8-by-8 $A$ and 8-by-4 $B$ .
Intel has published a specification for bfloat16 and how it intends to implement it in hardware. The specification includes an FMA unit that takes as input two bfloat16 numbers $a$ and $b$ and an fp32 number $c$ and computes $c + a*b$ at fp32 precision, returning an fp32 number.
The Arm A64 instruction set supports bfloat16. In particular, it includes a block FMA $C + A*B$ with 2-by-4 $A$ and 4-by-2 $B$ .

The pros and cons of bfloat16 arithmetic versus IEEE fp16 arithmetic are

bfloat16 has about one less (roughly three versus four) digit of equivalent decimal precision than fp16,
bfloat16 has a much wider range than fp16, and
current bfloat16 implementations do not support subnormal numbers, while fp16 does.

If you wish to experiment with bfloat16 but do not have access to hardware that supports it you will need to simulate it. In MATLAB this can be done with the chop function written by me and Srikara Pranesh.

References

This is a minimal set of references, which contain further useful references within.

Arm A64 Instruction Set Architecture Armv8, for Armv8-A Architecture Profile, ARM Limited, 2019.
Intel Corporation, BFLOAT16—Hardware Numerics Definition, 2018.
IEEE Standard for Floating-Point Arithmetic, IEEE Std 754-2019 (Revision of IEEE 754-2008), The Institute of Electrical and Electronics Engineers, New York, 2019.
NVIDIA Corporation, NVIDIA A100 Tensor Core GPU Architecture, 2020.

What Is the Matrix Exponential?

The exponential of a square matrix $A$ is defined by the power series (introduced by Laguerre in 1867)

$\mathrm{e}^A = I + A + \displaystyle\frac{A^2}{2!} + \frac{A^3}{3!} + \cdots.$

That the series converges follows from the convergence of the series for scalars. Various other formulas are available, such as

$\mathrm{e}^A = \displaystyle\lim_{s\to \infty} (I+A/s)^s.$

The matrix exponential is always nonsingular and $(\mathrm{e}^A)^{-1} = \mathrm{e}^{-A}$ .

Much interest lies in the connection between $\mathrm{e}^{A+B}$ and $\mathrm{e}^A \mathrm{e}^B$ . It is easy to show that $\mathrm{e}^{A+B} = \mathrm{e}^A \mathrm{e}^B$ if $A$ and $B$ commute, but commutativity is not necessary for the equality to hold. Series expansions are available that relate $\mathrm{e}^{A+B}$ to $\mathrm{e}^A \mathrm{e}^B$ for general $A$ and $B$ , including the Baker–Campbell–Hausdorff formula and the Zassenhaus formula, both of which involve the commutator $[A,B] = AB - BA$ . For Hermitian $A$ and $B$ the inequality $\mathrm{trace}(\mathrm{e}^{A+B}) \le \mathrm{trace}(\mathrm{e}^A \mathrm{e}^B)$ was proved independently by Golden and Thompson in 1965.

Especially important is the relation

$\mathrm{e}^A = \bigl(\mathrm{e}^{A/2^s}\bigr)^{2^s},$

for integer $s$ , which is used in the scaling and squaring method for computing the matrix exponential.

Another important property of the matrix exponential is that it maps skew-symmetric matrices to orthogonal ones. Indeed if $A = - A^T$ then

$\bigl(\mathrm{e}^A\bigr)^{-1} = \mathrm{e}^{-A} = \mathrm{e}^{A^T} = \bigl(\mathrm{e}^A\bigr)^T.$

This is a special case of the fact that the exponential maps elements of a Lie algebra into the corresponding Lie group.

The matrix exponential plays a fundamental role in linear ordinary differential equations (ODEs). The vector ODE

$\displaystyle\frac{dy}{dt} = A y, \quad y(0) = c$

has solution $y(t) = \mathrm{e}^{At} c$ , while the solution of the ODE in $n \times n$ matrices

$\displaystyle\frac{dY}{dt} = AY + YB, \quad Y(0) = C$

is $Y(t) = \mathrm{e}^{At}Ce^{Bt}$ .

In control theory, the matrix exponential is used in converting from continuous time dynamical systems to discrete time ones. Another application of the matrix exponential is in centrality measures for nodes in networks.

Many methods have been proposed for computing the matrix exponential. See the references for details.

References

This is a minimal set of references, which contain further useful references within.

Awad H. Al-Mohy and Nicholas J. Higham, A New Scaling and Squaring Algorithm for the Matrix Exponential, SIAM J. Matrix Anal. Appl. 31(3), 970–989, 2009.
Ernesto Estrada and Philip A. Knight, A First Course in Network Theory, Oxford University Press, 2015.
Nicholas J. Higham, Functions of Matrices: Theory and Computation, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2008. (Chapter 10).
Cleve B. Moler and Van Loan, Charles F., Nineteen Dubious Ways to Compute the Exponential of a Matrix, Twenty-Five years Later, SIAM Rev. 45(1), 3–49, 2003.
Gilbert Strang, The Matrix Exponential (video), 2016.

What Is a Matrix Square Root?

A square root of an $n\times n$ matrix $A$ is any matrix $X$ such that $X^2 = A$ .

For a scalar $a$ ( $n = 1$ ), there are two square roots (which are equal if $a = 0$ ), and they are real if and only if $a$ is real and nonnegative. For $n \ge 2$ , depending on the matrix there can be no square roots, finitely many, or infinitely many. The matrix

$A = \begin{bmatrix} 0 & 1 \\ 0 & 0 \end{bmatrix}$

is easily seen to have no square roots. The matrix

$D = \mathrm{diag}(1,2) = \begin{bmatrix} 1 & 0 \\ 0 & 2 \end{bmatrix}$

has four square roots, $\mathrm{diag}(\pm 1, \pm\sqrt{2})$ . The identity matrix

$\begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}$

has infinitely many square roots (namely the $2\times 2$ involutory matrices), including $\mathrm{diag}(\pm 1, \pm 1)$ , the lower triangular matrix

$\begin{bmatrix} 1 & 0 \\ 1 & -1 \end{bmatrix},$

and any symmetric orthogonal matrix, such as

$\begin{bmatrix} \cos \theta & \sin \theta \\ \sin \theta & -\cos \theta \end{bmatrix}, \quad \theta \in[0,2\pi]$

(which is a Householder matrix). Clearly, a square root of a diagonal matrix need not be diagonal.

The matrix square root of most practical interest is the one whose eigenvalues lie in the right half-plane, which is called the principal square root, written $A^{1/2}$ . If $A$ is nonsingular and has no eigenvalues on the negative real axis then $A$ has a unique principal square root. For the diagonal matrix $D$ above, $D^{1/2} = \mathrm{diag}(1,\sqrt{2})$ .

A symmetric positive definite matrix has a unique symmetric positive definite square root. Indeed if $A$ is symmetric positive definite then it has a spectral decomposition $A = QDQ^T$ , where $Q$ is orthogonal and $D$ is diagonal with positive diagonal elements, and then $A^{1/2} = Q D^{1/2}Q^T$ is also symmetric positive definite.

If $A$ is nonsingular then it has at least $2^s$ square roots, where $s$ is the number of distinct eigenvalues. The existence of a square root of a singular matrix depends on the Jordan structure of the zero eigenvalues.

In some contexts involving symmetric positive definite matrices $A$ , such as Kalman filtering, a matrix $Y$ such that $A = Y^TY$ is called a square root, but this is not the standard meaning.

When $A$ has structure one can ask whether a square root having the same structure, or some other related structure, exists. Results are known for (for example)

stochastic matrices,
$M$ -matrices,
skew-Hamiltonian matrices,
centrosymmetric matrices, and
matrices from an automorphism group.

An important distinction is between square roots of $A$ that can be expressed as a polynomial in $A$ (primary square roots) and those that cannot. Square roots of the latter type arise when $A$ has repeated eigenvalues and two copies of an eigenvalue are mapped to different square roots. In some contexts, a nonprimary square root may be the natural choice. For example, consider the matrix

$G(\theta) = \begin{bmatrix} \cos \theta & \sin \theta \\ -\sin \theta & \cos \theta \end{bmatrix}, \quad \theta \in[0,2\pi],$

which represents a rotation through an angle $\theta$ radians clockwise. The natural square root of $G(\theta)$ is $G(\theta/2)$ . For $\theta = \pi$ , this gives the square root

$G(\pi/2) = \begin{bmatrix} 0 & 1 \\ -1 & 0 \end{bmatrix}$

$G(\pi) = \begin{bmatrix} -1 & 0 \\ 0 & -1 \end{bmatrix}.$

The matrix square root arises in many applications, often in connection with other matrix problems such as the polar decomposition, matrix geometric means, Markov chains (roots of transition matrices), quadratic matrix equations, and generalized eigenvalue problems. Most often the matrix is symmetric positive definite, but square roots of nonsymmetric matrices are also needed. Among modern applications, the matrix square root can be found in recent papers on machine learning.

References

This is a minimal set of references, which contain further useful references within.

Nicholas J. Higham, Functions of Matrices: Theory and Computation, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2008.
Nicholas J. Higham and Lijing Lin, On $p$ th roots of stochastic matrices, Linear Algebra Appl. 435, 448–463, 2011.
Nicholas J. Higham, D. Steven Mackey, Niloufer Mackey and Françoise Tisseur, Functions preserving matrix groups and iterations for the matrix square root, SIAM J. Matrix Anal. Appl. 26(3), 849–877, 2005
Roger Horn and Charles Johnson, Topics in Matrix Analysis, Cambridge University Press, 1991.

What Is IEEE Standard Arithmetic?

The IEEE Standard 754, published in 1985 and revised in 2008 and 2019, is a standard for binary and decimal floating-point arithmetic. The standard for decimal arithmetic (IEEE Standard 854) was separate when it was first published in 1987, but it was included with the binary standard from 2008. We focus here on the binary part of the standard.

The standard specifies floating-point number formats, the results of the basic floating-point operations and comparisons, rounding modes, floating-point exceptions and their handling, and conversion between different arithmetic formats.

A binary floating-point number is represented as

$y = \pm m \times 2^{e-t},$

where $t$ is the precision and $e\in [e_{\min},e_{\max}]$ is the exponent. The significand $m$ is an integer satisfying $m \le 2^t-1$ . Numbers with $m \ge 2^{t-1}$ are called normalized. Subnormal numbers, for which $0 < m <2^{t-1}$ and $e = e_{\min}$ , are supported.

Four formats are defined, whose key parameters are summarized in the following table. The second column shows the number of bits allocated to store the significand and the exponent. I use the prefix “fp” instead of the prefix “binary” used in the standard. The unit roundoff is $u = 2^{-t}$ .

Fp32 (single precision) and fp64 (double precision) were in the 1985 standard; fp16 (half precision) and fp128 (quadruple precision) were introduced in 2008. Fp16 is defined only as a storage format, though it is widely used for computation.

The size of these different number systems varies greatly. The next table shows the number of normalized and subnormal numbers in each system.

We see that while one can easily carry out a computation on every fp16 number (to check that the square root function is correctly computed, for example), it is impractical to do so for every double precision number.

A key feature of the standard is that it is a closed system, thanks to the inclusion of NaN (Not a Number) and $\infty$ (usually written as inf in programing languages) as floating-point numbers: every arithmetic operation produces a number in the system. A NaN is generated by operations such as

$0/0, \quad 0 \times \infty, \quad \infty/\infty, \quad (+\infty) + (-\infty), \quad \sqrt{-1}.$

Arithmetic operations involving a NaN return a NaN as the answer. The number $\infty$ obeys the usual mathematical conventions regarding infinity, such as

$\infty+\infty = \infty, \quad (-1)\times\infty = -\infty, \quad ({\textrm{finite}})/\infty = 0.$

This means, for example, that $1 + 1/x$ evaluates as $1$ when $x = \infty$ .

The standard specifies that all arithmetic operations (including square root) are to be performed as if they were first calculated to infinite precision and then rounded to the target format. A number is rounded to the next larger or next smaller floating-point number according to one of four rounding modes:

round to the nearest floating-point number, with rounding to even (rounding to the number with a zero least significant bit) in the case of a tie;
round towards plus infinity and round towards minus infinity (used in interval arithmetic); and
round towards zero (truncation, or chopping).

For round to nearest it follows that

$\mathrm{f\kern.2ptl}(x\mathbin{\mathrm{op}} y) = (x \mathbin{\mathrm{op}} y)(1+\delta), \quad |\delta|\le u, \quad \mathbin{\mathrm{op}}\in\{+,-,*,/,\sqrt{}\},$

where $\mathrm{f\kern.2ptl}$ denotes the computed result. The standard also includes a fused multiply-add operation (FMA), $x*y + z$ . The definition requires it to be computed with just one rounding error, so that $\mathrm{f\kern.2ptl}(x*y+z)$ is the rounded version of $x*y+z$ , and hence satisfies

$\mathrm{f\kern.2ptl}(x*y + z) = (x*y + z)(1+\delta), \quad |\delta|\le u.$

FMAs are supported in some hardware and are usually executed at the same speed as a single addition or multiplication.

The standard recommends the provision of correctly rounded exponentiation ( $x^y$ ) and transcendental functions ( $\exp$ , $\log$ , $\sin$ , $\mathrm{acos}$ , etc.) and defines domains and special values for them, but these functions are not required.

A new feature of the 2019 standard is augmented arithmetic operations, which compute $\mathrm{fl}(x \mathbin{\mathrm{op}}y)$ along with the error $x\mathbin{\mathrm{op}}y - \mathrm{fl}(x \mathbin{\mathrm{op}}y)$ , for $\mathbin{\mathrm{op}} = +,-,*$ . These operations are useful for implementing compensated summation and other special high accuracy algorithms.

William (“Velvel”) Kahan of the University of California at Berkeley received the 1989 ACM Turing Award for his contributions to computer architecture and numerical analysis, and in particular for his work on IEEE floating-point arithmetic standards 754 and 854.

References

This is a minimal set of references, which contain further useful references within.

D. Goldberg, What Every Computer Scientist Should Know About Floating-Point Arithmetic, ACM Computing Surveys 23, 5–48, 1991.
Nicholas J. Higham, Accuracy and Stability of Numerical Algorithms, second edition, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2002.
IEEE Standard for Floating-Point Arithmetic, IEEE Std 754-2019 (Revision of IEEE 754-2008), The Institute of Electrical and Electronics Engineers, New York, 2019.
Jean-Michel Muller, Nicolas Brunie, Florent de Dinechin, Claude-Pierre Jeannerod, Mioara Joldes, Vincent Lefèvre, Guillaume Melquiond, Nathalie Revol, and Serge Torres, Handbook of Floating-Point Arithmetic, second edition, Birkhäuser, Boston, MA, 2018.

What Is Floating-Point Arithmetic?

A floating-point number system $F$ is a finite subset of the real line comprising numbers of the form

$y = \pm m \times \beta^{e-t},$

where $\beta$ is the base, $t$ is the precision, and $e\in [e_{\min},e_{\max}]$ is the exponent. The system is completely defined by the four integers $\beta$ , $t$ , $e_{\min}$ , and $e_{\max}$ . The significand $m$ satisfies $0 \le m \le \beta^t-1$ . Normalized numbers are those for which $m \ge \beta^{t-1}$ , and they have a unique representation. Subnormal numbers are those with $0 < m < \beta^{t-1}$ and $e = e_{\min}$ .

An alternative representation of $y\in F$ is

$y = \pm \beta^e \left( \displaystyle\frac{d_1}{\beta} + \frac{d_2}{\beta^2} + \cdots + \frac{d_t}{\beta^t} \right) = \pm \beta^e \times .d_1 d_2 \dots d_t,$

where each digit $d_i$ satisfies $0 \le d_i \le \beta-1$ and $d_1 \ne 0$ for normalized numbers.

The floating-point numbers are not equally spaced, but they have roughly constant relative spacing (varying by up to a factor $\beta$ ).

Here are the normalized nonnegative numbers in a toy system with $\beta = 2$ , $t = 3$ , and $e \in [-1, 3]$ .

Three key properties that hold in general for binary arithmetic are visible in this example.

The spacing of the numbers increases by a factor 2 at every power of 2.
The spacing of the numbers between $1/2$ and $1$ is $u = 2^{-t}$ , which is called the unit roundoff. The spacing of the numbers between $1$ and $2$ is $\epsilon = 2^{1-t}$ , which is called the machine epsilon. Note that $\epsilon = 2u$ .
There is a gap between $0$ and the smallest normalized number, which is $2^{e_{\min}-1}$ . The subnormal numbers fill this gap with numbers having the same spacing as those between $2^{e_{\min}-1}$ and $2^{e_{\min}}$ , namely $2^{e_{\min}-t}$ . The next diagram shows the complete set of nonnegative normalized and subnormal numbers in the toy system.

In MATLAB, eps is the machine epsilon, eps(x) is the distance from x to the next larger (in magnitude) floating-point number, realmax is the largest finite number, and realmin is the smallest normalized positive number.

A real number $x$ is mapped into $F$ by rounding, and the result is denoted by $\mathrm{f\kern.2ptl}(x)$ . If $x$ exceeds the largest number in $F$ then we say that $\mathrm{f\kern.2ptl}(x)$ overflows, and in IEEE arithmetic it is represented by Inf. If $x\in F$ then $\mathrm{f\kern.2ptl}(x) = x$ ; otherwise, $x$ lies between two floating-point numbers and we need a rule for deciding which one to round $x$ to. The usual rule, known as round to nearest, is to round to whichever number is nearer. If $x$ is midway between two floating-point numbers then we need a tie-breaking rule, which is usually to round to the number with an even last digit. If $x\ne0$ and $\mathrm{f\kern.2ptl}(x) = 0$ then $\mathrm{f\kern.2ptl}(x)$ is said to underflow.

For round to nearest it can be shown that

$\mathrm{f\kern.2ptl}(x) = x(1+\delta), \quad |\delta|\le u.$

This result shows that rounding introduces a relative error no larger than $u$ .

Elementary floating-point operations, $+$ , $-$ , $*$ , $/$ , and $\sqrt{}$ are usually defined to return the correctly rounded exact result, so they satisfy

$\mathrm{f\kern.2ptl}(x\mathbin{\mathrm{op}} y) = (x \mathbin{\mathrm{op}} y)(1+\delta), \quad |\delta|\le u, \quad \mathbin{\mathrm{op}}\in\{+,-,*,/,\sqrt{}\}.$

Most floating-point arithmetics adhere to the IEEE standard, which defines several floating-point formats and four different rounding modes.

Another form of finite precision arithmetic is fixed-point arithmetic, in which numbers have the same form as $F$ but with a fixed exponent $e$ , so all the numbers are equally spaced. In most scientific computations scale factors must be introduced in order to be able to represent the range of numbers occurring. Fixed-point arithmetic is mainly used on special purpose devices such as FPGAs and in embedded systems.

References

This is a minimal set of references, which contain further useful references within.

D. Goldberg, What Every Computer Scientist Should Know About Floating-Point Arithmetic, ACM Computing Surveys 23, 5–48, 1991.
Nicholas J. Higham, Accuracy and Stability of Numerical Algorithms, second edition, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2002.
IEEE Standard for Floating-Point Arithmetic, IEEE Std 754-2019 (Revision of IEEE 754-2008), The Institute of Electrical and Electronics Engineers, New York, 2019.
Jean-Michel Muller, Nicolas Brunie, Florent de Dinechin, Claude-Pierre Jeannerod, Mioara Joldes, Vincent Lefèvre, Guillaume Melquiond, Nathalie Revol, and Serge Torres, Handbook of Floating-Point Arithmetic, second edition, Birkhäuser, Boston, MA, 2018.

What Is Rounding?

Rounding is the transformation of a number expressed in a particular base to a number with fewer digits. For example, in base 10 we might round the number $x = 7.146$ to $7.15$ , which can be described as rounding to three significant digits or two decimal places. Rounding does not change a number if it already has the requisite number of digits.

The three main uses of rounding are

to simplify a number in base 10 for human consumption,
to represent a constant such as $1/3$ , $\pi$ , or $\sqrt{7}$ in floating-point arithmetic,
to convert the result of an elementary operation (an addition, multiplication, or division) on floating-point numbers back into a floating-point number.

The floating-point numbers may be those used on a computer (base 2) or a pocket calculator (base 10).

Rounding can be done in several ways.

Round to Nearest

The most common form of rounding is to round to the nearest number with the specified number of significant digits or decimal places. In the example above, the two nearest numbers to $x$ with three significant digits are $7.14$ and $7.15$ , at distances $0.006$ and $0.004$ , respectively, from $x$ . The nearest of these two numbers, $7.15$ , is chosen.

What happens if the two candidate numbers are equally close? We need a rule for breaking the tie. The most common choices are

round to even: choose the number with an even last digit,
round to odd: choose the number with an odd last digit.

If we round $1.85$ to two significant digits, the result is $1.8$ with round to even and $1.9$ with round to odd.

There are several reasons for preferring to break ties with round to even.

In bases 2 and 10 a subsequent rounding to one less place does not involve a tie. Thus we have the rounding sequence $2.445$ , $2.44$ , $2.4$ , $2$ with round to even, but $2.445$ , $2.45$ , $2.5$ , $3$ with round to odd.
For base 2, round to even results in integers more often, as a consequence of producing a zero least significant bit.
In base 10, after round to even a rounded number can be halved without error.

IEEE Standard 745 for floating-point arithmetic supports three tie-breaking methods: round to even (the default), round to the number with larger magnitude, and round towards zero (introduced in the 2019 revision for use with the standard’s new augmented operations).

The tie-breaking rule taught in UK schools, for decimal arithmetic, is to round up on ties. The rounding rule then becomes: round down if the first digit to be dropped is $4$ or less and otherwise round up.

Round Towards Plus or Minus Infinity

Another possibility is to round to the next larger number with the specified number of digits, which is known as round towards plus infinity (or round up). Then $1.85$ rounds to $1.9$ and $-2.34$ rounds to $-2.3$ . Similarly, with round towards minus infinity (or round down) we round to the next smaller number, so that $1.85$ rounds to $1.8$ and $-2.34$ rounds to $-2.4$ .

This form of rounding is used in interval arithmetic, where an interval guaranteed to contain the exact result is computed in floating-point arithmetic.

Round Towards Zero

In this form of rounding we round towards zero, that is, we round $x$ down if $x > 0$ and round it up if $x < 0$ . This is also known as chopping, or truncation.

Stochastic Rounding

Stochastic rounding was proposed in the 1950s and is attracting renewed interest, especially in machine learning. It rounds up or down randomly. It come in two forms. The first form rounds up or down with equal probability $1/2$ . To describe the second form, let $x$ be the given number and let $x_1 x$ be the candidates for the result of rounding. We round up to $x_2$ with probability $(x-x_1)/(x_2-x_1)$ and down to $x_1$ with probability $(x_2-x)/(x_2-x_1)$ ; note that these probabilities sum to $1$ . In floating-point arithmetic, stochastic rounding overcomes the problem that can arise in summing a set of numbers whereby some individual summands are so small that they do not contribute to the computed sum even though they contribute to the exact sum.

The diagrams below illustrate round to nearest (RN), round towards zero (RZ), round towards plus infinity ( $\mathrm{R}^{+\infty}$ ), and round towards minus infinity ( $\mathrm{R}^{-\infty}$ ). They show the number $x$ to be rounded in four different configurations with respect to the origin and the midpoint (drawn with a dotted line) of the interval between the two candidate rounding results (drawn with a solid line). The red arrows point to the two possible results of rounding.

Real World Rounding

The European Commission’s rules for converting currencies of Member States into Euros (from the time of the creation of the Euro) specify that “half-way results are rounded up” (rounded to plus infinity). (PDF link)

The International Association of Athletics Federations (IAAF) specifies in Rule 165 of its Competition Rules 2018–2019 that all times of track races up to 10,000m should be recorded to a precision of 0.01 second, with rounding to plus infinity. In 2006, the athlete Justin Gatlin was wrongly credited with breaking the 100m world record when his official time of 9.766 seconds was rounded down to 9.76 seconds. Under the IAAF rules it should have been rounded up to 9.77 seconds, matching the world record set by Asafa Powell the year before. The error was discovered several days after the race.

In meteorology, rounding to nearest with ties broken by rounding to odd is favoured. Hunt suggests that the reason is to avoid falsely indicating that it is freezing. Thus $0.5^\circ$ C and $32.5^\circ$ F round to $1^\circ$ C and $33^\circ$ F instead of $0^\circ$ C and $32^\circ$ F.

Useful Tool

The $\LaTeX$ package siunitx has the ability to round numbers (in base 10) to a specified number of decimal places or significant figures.

References

This is a minimal set of references, which contain further useful references within.

Michael P. Connolly, Nicholas J. Higham, and Theo Mary, Stochastic Rounding and Its Probabilistic Backward Error Analysis, MIMS EPrint 2020.12, Manchester Institute for Mathematical Sciences, The University of Manchester, UK, April 2020.
Nicholas J. Higham, Accuracy and Stability of Numerical Algorithms, second edition, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2002.
Julian Hunt, Rounding and Other Approximations for Measurements, Records and Targets, Mathematics Today 33, 73–77, 1997.
IEEE Standard for Floating-Point Arithmetic, IEEE Std 754-2019 (Revision of IEEE 754-2008), IEEE Computer Society, New York, 2019.

What Is a Random Orthogonal Matrix?

Various explicit parametrized formulas are available for constructing orthogonal matrices. To construct a random orthogonal matrix we can take such a formula and assign random values to the parameters. For example, a Householder matrix $H = I - 2uu^T/(u^Tu)$ is orthogonal and symmetric and we can choose the nonzero vector $u$ randomly. Such an example is rather special, though, as it is a rank- $1$ perturbation of the identity matrix.

What is usually meant by a random orthogonal matrix is a matrix distributed according to the Haar measure over the group of orthogonal matrices. The Haar measure provides a uniform distribution over the orthogonal matrices. Indeed it is invariant under multiplication on the left and the right by orthogonal matrices: if $Q$ is from the Haar distribution then so is $UQV$ for any orthogonal (possibly non-random) $U$ and $V$ . A random Householder matrix is not Haar distributed.

A matrix from the Haar distribution can be generated as the orthogonal factor in the QR factorization of a random matrix with elements from the standard normal distribution (mean $0$ , variance $1$ ). In MATLAB this is done by the code

[Q,R] = qr(randn(n));
Q = Q*diag(sign(diag(R)));

The statement [Q,R] = qr(randn(n)), which returns the orthogonal factor $Q$ , is not enough on its own to give a Haar distributed matrix, because the QR factorization is not unique. The second line adjusts the signs so that $Q$ is from the unique factorization in which the triangular factor $R$ has nonnegative diagonal elements. This construction requires $2n^3$ flops.

A more efficient construction is possible, as suggested by Stewart (1980). Let $x_k$ be an $(n-k+1)$ -vector of elements from the standard normal distribution and let $H_k$ be the Householder matrix that reduces $x_k$ to $r_{kk}e_1$ , where $e_1$ is the first unit vector. Then $Q = DH'_1H'_2\dots H'_{n-1}$ is Haar distributed, where $H'_k = \mathrm{diag}(I_{k-1}, H_k)$ , $D = \mathrm{diag}(\mathrm{sign}(r_{kk}))$ , and $r_{nn} = x_n$ . This construction expresses $Q$ as the product of $n-1$ Householder matrices of growing effective dimension, and the product can be formed from right to left in $4n^3/3$ flops. The MATLAB statement Q = gallery('qmult',n) carries out this construction.

A similar construction can be made using Givens rotations (Anderson et al., 1987).

Orthogonal matrices from the Haar distribution can also be formed as $Q = A (A^TA)^{-1/2}$ , where the elements of $A$ are from the standard normal distribution. This $Q$ is the orthogonal factor in the polar decomposition $A = QH$ (where $H$ is symmetric positive semidefinite).

Random orthogonal matrices arise in a variety of applications, including Monte Carlo simulation, random matrix theory, machine learning, and the construction of test matrices with known eigenvalues or singular values.

All these ideas extend to random unitary (complex) matrices. In MATLAB, Haar distributed unitary matrices can be constructed by the code

[Q,R] = qr(complex(randn(n),randn(n)));
Q = Q*diag(sign(diag(R)));

This code exploits the fact that the $R$ factor computed by MATLAB has real diagonal entries. If the diagonal of $R$ were complex then this code would need to be modified to use the complex sign function given by $\mathrm{sign}(z) = z/|z|$ .

References

This is a minimal set of references, which contain further useful references within.

T. W. Anderson, I. Olkin and L. G. Underhill, Generation of random orthogonal matrices, SIAM J. Sci. Statist. Comput. 8(4), 625–629, 1987.
Alan Edelman and N. Raj Rao, Random matrix theory, Acta Numerica, 14, 233–297, 2005.
Nicholas J. Higham, Accuracy and Stability of Numerical Algorithms, second edition, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2002. (Chapter 28.)
Francesco Mezzadri, How to generate random matrices from the classical compact groups, Notices Amer. Math. Soc. 54(5), 592–604, 2007
G. W. Stewart, The efficient generation of random orthogonal matrices with an application to condition estimators, SIAM J. Numer. Anal. 17(3), 403–409, 1980.

What Is a Correlation Matrix?

In linear algebra terms, a correlation matrix is a symmetric positive semidefinite matrix with unit diagonal. In other words, it is a symmetric matrix with ones on the diagonal whose eigenvalues are all nonnegative.

The term comes from statistics. If $x_1, x_2, \dots, x_n$ are column vectors with $m$ elements, each vector containing samples of a random variable, then the corresponding $n\times n$ covariance matrix $V$ has $(i,j)$ element

$v_{ij} = \mathrm{cov}(x_i,x_j) = \displaystyle\frac{1}{n-1} (x_i - \overline{x}_i)^T (x_j - \overline{x}_j),$

where $\overline{x}_i$ is the mean of the elements in $x_i$ . If $v$ has nonzero diagonal elements then we can scale the diagonal to 1 to obtain the corresponding correlation matrix

$C = D^{-1/2} V D^{-1/2},$

where $D = \mathrm{diag}(v_{ii})$ . The $(i,j)$ element $c_{ij} = v_{ii}^{-1/2} v_{ij} v_{jj}^{-1/2}$ is the correlation between the variables $x_i$ and $x_j$ .

Here are a few facts.

The elements of a correlation matrix lie on the interval $[-1, 1]$ .
The eigenvalues of a correlation matrix lie on the interval $[0,n]$ .
The eigenvalues of a correlation matrix sum to $n$ (since the eigenvalues of a matrix sum to its trace).
The maximal possible determinant of a correlation matrix is $1$ .

It is usually not easy to tell whether a given matrix is a correlation matrix. For example, the matrix

$A = \begin{bmatrix} 1 & 1 & 0\\ 1 & 1 & 1\\ 0 & 1 & 1 \end{bmatrix}$

is not a correlation matrix: it has eigenvalues $-0.4142$ , $1.0000$ , $2.4142$ . The only value of $a_{13}$ and $a_{31}$ that makes $A$ a correlation matrix is $1$ .

A particularly simple class of correlation matrices is the one-parameter class $A_n$ with every off-diagonal element equal to $w$ , illustrated for $n = 3$ by

$A_3 = \begin{bmatrix} 1 & w & w\\ w & 1 & w\\ w & w & 1 \end{bmatrix}.$

The matrix $A_n$ is a correlation matrix for $-1/(n-1) \le w \le 1$ .

In some applications it is required to generate random correlation matrices, for example in Monte-Carlo simulations in finance. A method for generating random correlation matrices with a specified eigenvalue distribution was proposed by Bendel and Mickey (1978); Davies and Higham (2000) give improvements to the method. This method is implemented in the MATLAB function gallery('randcorr').

Obtaining or estimating correlations can be difficult in practice. In finance, market data is often missing or stale; different assets may be sampled at different time points (e.g., some daily and others weekly); and the matrices may be generated from different parametrized models that are not consistent. Similar problems arise in many other applications. As a result, correlation matrices obtained in practice may not be positive semidefinite, which can lead to undesirable consequences such as an investment portfolio with negative risk.

In risk management and insurance, matrix entries may be estimated, prescribed by regulations or assigned by expert judgement, but some entries may be unknown.

Two problems therefore commonly arise in connection with correlation matrices.

Nearest Correlation Matrix

Here, we have an approximate correlation matrix $A$ that has some negative eigenvalues and we wish to replace it by the nearest correlation matrix. The natural choice of norm is the Frobenius norm, $\|A\|_F = \bigl(\sum_{i,j} a_{ij}^2\bigr)^{1/2}$ , so we solve the problem

$\min \{ \, \|A-C\|_F: C~\textrm{is a correlation matrix} \,\}.$

We may also have a requirement that certain elements of $C$ remain fixed. And we may want to weight some elements more than others, by using a weighted Frobenius norm. These are convex optimization problems and have a unique solution that can be computed using the alternating projections method (Higham, 2002) or a Newton algorithm (Qi and Sun, 2006; Borsdorf and Higham, 2010).

Another variation requires $C$ to have factor structure, which means that the off-diagonal agrees with that of a rank- $k$ matrix for some given $k$ (Borsdorf, Higham, and Raydan, 2010). Yet another variation imposes a constraint that $C$ has a certain rank or a rank no larger than a certain value. These problems are non-convex, because of the objective function and the rank constraint, respectively.

Another approach that can be used for restoring definiteness, although it does not in general produce the nearest correlation matrix, is shrinking, which constructs a convex linear combination $A = \alpha C + (1-\alpha)M$ , where $M$ is a target correlation matrix (Higham, Strabić, and Šego, 2016). Shrinking can readily incorporate fixed blocks and weighting.

Correlation Matrix Completion

Here, we have a partially specified matrix and we wish to complete it, that is, fill in the missing elements in order to obtain a correlation matrix. It is known that a completion is possible for any set of specified entries if the associate graph is chordal (Grone et al., 1994). In general, if there is one completion there are many, but there is a unique one of maximal determinant, which is elegantly characterized by the property that the inverse contains zeros in the positions of the unspecified entries.

References

This is a minimal set of references, and they cite further useful references.

Rüdiger Borsdorf, Nicholas J. Higham and Marcos Raydan, Computing a nearest correlation matrix with factor structure, SIAM J. Matrix Anal. Appl. 31(5), 2603–2622, 2010
Rüdiger Borsdorf and Nicholas J. Higham, A preconditioned Newton algorithm for the nearest correlation matrix, J. Numer. Anal. 30(1), 94–107, 2010.
Philip I. Davies and Nicholas J. Higham, Numerically stable generation of correlation matrices and their factors, BIT 40(4), 640–651, 2000
Dan I. Georgescu, Nicholas J. Higham and Gareth W. Peters, Explicit solutions to correlation matrix completion problems, with an application to risk management and insurance, Roy. Soc. Open Sci. 5(3), 1–11, 2018.
Robert Grone, Charles R. Johnson, Eduardo M. Sá and Henry Wolkowicz, Positive definite completions of partial Hermitian matrices, Linear Algebra Appl. 58, 109–124, 1984.
Nicholas J. Higham, Computing the nearest correlation matrix—A problem from finance, IMAJNA J. Numer. Anal. 22(3), 329–343, 2002.
Houduo Qi and Defeng Sun, A quadratically convergent Newton method for computing the nearest correlation matrix, SIAM J. Matrix Anal. Appl. 28(2), 360–385, 2006
Nicholas J. Higham, Nataša Strabić and Vedran Šego, Restoring definiteness via shrinking, with an application to correlation matrices with a fixed block, SIAM Rev. 58(2), 245–263, 2016.
Numerical Algorithms Group, Nearest Correlation Matrix, 2019.

What Is a Hadamard Matrix?

A Hadamard matrix is an $n\times n$ matrix with elements $\pm 1$ and mutually orthogonal columns. For example,

$\left[\begin{array}{rrrr} 1 & 1 & 1 & 1\\ 1 & -1 & 1 & -1\\ 1 & 1 & -1 & -1\\ 1 & -1 & -1 & 1 \end{array}\right]$

is a Hadamard matrix.

A necessary condition for an $n\times n$ Hadamard matrix to exist with $n > 2$ is that $n$ is divisible by $4$ , but it is not known if a Hadamard matrix exists for every such $n$ .

A Hadamard matrix of order 428 was found for the first time in 2005. The smallest multiple of $4$ for which a Hadamard matrix has not been found is 668.

A Hadamard matrix satisfies $H^T H = nI$ , so $H^{-1} = n^{-1}H^T$ . It also follows that $\det(H) = \pm n^{n/2}$ . Hadamard’s inequality states that for an $n\times n$ real matrix $A$ , $|\det(A)| \le \prod_{k=1}^n \|a_k\|_2$ , where $a_k$ is the $k$ th column of $A$ . A Hadamard matrix achieves equality in this inequality (as does any matrix with orthogonal columns).

Hadamard matrices can be generated with a recursive (Kronecker product) construction: if $H$ is a Hadamard matrix then so is

$\left[\begin{array}{rr} H & H\\ H & -H \end{array}\right].$

So starting with a Hadamard matrix of size $m$ , one can build up matrices of size $2^km$ for $k = 1,2,\dots$ . The MATLAB hadamard function uses this technique. It includes the following Hadamard matrix of order 12, for which we simply display the signs of the elements:

$\left[\begin{array}{rrrrrrrrrrrr} {}+ & + & + & + & + & 1 & + & + & + & + & + & +\\ {}+ & - & + & - & + & + & + & - & - & - & + & -\\ {}+ & - & - & + & - & + & + & + & - & - & - & +\\ {}+ & + & - & - & + & - & + & + & + & - & - & -\\ {}+ & - & + & - & - & + & - & + & + & + & - & -\\ {}+ & - & - & + & - & - & + & - & + & + & + & -\\ {}+ & - & - & - & + & - & - & + & - & + & + & +\\ {}+ & + & - & - & - & + & - & - & + & - & + & +\\ {}+ & + & + & - & - & - & + & - & - & + & - & +\\ {}+ & + & + & + & - & - & - & + & - & - & + & -\\ {}+ & - & + & + & + & - & - & - & + & - & - & +\\ {}+ & + & - & + & + & + & - & - & - & + & - & - \end{array}\right].$

Hadamard matrices have applications in optimal design theory, coding theory, and graph theory.

In numerical analysis, Hadamard matrices are of interest because when LU factorization is performed on them they produce a growth factor of at least $n$ , for any form of pivoting. Evidence suggests that the growth factor for complete pivoting is exactly $n$ , but this has not been proved. It has been proved that any $n\times n$ Hadamard matrix has growth factor $n$ for complete pivoting for $n = 12$ and $n = 16$ .

An interesting property of Hadamard matrices is that the $p$ -norm (the matrix norm subordinate to the vector $p$ -norm) is known explicitly for all $p$ :

$\|H\|_p = \max\bigl( n^{1/p}, n^{1-1/p} \bigr), \quad 1\le p\le \infty.$

References

This is a minimal set of references, which contain further useful references within.

A. Hedayat and W. D. Wallis, Hadamard Matrices and Their Applications, Ann. Statist. 6, 1184–1238, 1978.
Nicholas J. Higham, Accuracy and Stability of Numerical Algorithms, second edition, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2002.
Christos D. Kravvaritis, Hadamard Matrices: Insights into Their Growth Factor and Determinant Computations, in M. Themistocles Rassias and Vijay Gupta, eds, Mathematical Analysis, Approximation Theory and Their Applications, Springer-Verlag, Berlin, 383–415, 2016.
Christos Kravvaritis and Marilena Mitrouli, The Growth Factor of a Hadamard Matrix of Order $16$ is $16$ , Numer. Linear Algebra Appl. 16, 715–743, 2009.
H. Kharaghani and B. Tayfeh-Rezaie, A Hadamard Matrix of Order $428$ , J. Combin. Des. 56, 435–440, 2005.
A Library of Hadamard Matrices, N. J. A. Sloane.

What Is an Orthogonal Matrix?

A real, square matrix $Q$ is orthogonal if $Q^TQ = QQ^T = I$ (the identity matrix). Equivalently, $Q^{-1} = Q^T$ . The columns of an orthogonal matrix are orthonormal, that is, they have 2-norm (Euclidean length) $1$ and are mutually orthogonal. The same is true of the rows.

Important examples of orthogonal matrices are rotations and reflectors. A $2\times 2$ rotation matrix has the form

$\begin{bmatrix} c & s \\ -s& c \\ \end{bmatrix}, \quad c^2 + s^2 = 1.$

For such a matrix, $c = \cos \theta$ and $s = \sin \theta$ for some $\theta$ , and the multiplication $y = Qx$ for a $2\times 1$ vector $x$ represents a rotation through an angle $\theta$ radians. An $n\times n$ rotation matrix is formed by embedding the $2\times 2$ matrix into the identity matrix of order $n$ .

A Householder reflector is a matrix of the form $H = I - 2uu^T/(u^Tu)$ , where $u$ is a nonzero $n$ -vector. It is orthogonal and symmetric. When applied to a vector it reflects the vector about the hyperplane orthogonal to $v$ . For $n = 2$ , such a matrix has the form

$\begin{bmatrix} c & s \\ s& -c \\ \end{bmatrix}, \quad c^2 + s^2 = 1.$

Here is the $4\times 4$ Householder reflector corresponding to $v = [1,1,1,1]^T/2$ :

$\frac{1}{2} \left[\begin{array}{@{\mskip2mu}rrrr@{\mskip2mu}} 1 & -1 & -1 & -1\\ -1 & 1 & -1 & -1\\ -1 & -1 & 1 & -1\\ -1 & -1 & -1 & 1\\ \end{array}\right].$

This is $1/2$ times a Hadamard matrix.

Various explicit formulas are known for orthogonal matrices. For example, the $n\times n$ matrices with $(i,j)$ elements

$q_{ij} = \displaystyle\frac{2}{\sqrt{2n+1}} \sin \left(\displaystyle\frac{2ij\pi}{2n+1}\right)$

and

$q_{ij} = \sqrt{\displaystyle\frac{2}{n}}\cos \left(\displaystyle\frac{(i-1/2)(j-1/2)\pi}{n} \right)$

are orthogonal. These and other orthogonal matrices, as well as diagonal scalings of orthogonal matrices, are constructed by the MATLAB function gallery('orthog',...).

Here are some properties of orthogonal matrices.

All the eigenvalues are on the unit circle, that is, they have modulus $1$ .
All the singular values are $1$ .
The $2$ -norm condition number is $1$ , so orthogonal matrices are perfectly conditioned.
Multiplication by an orthogonal matrix preserves Euclidean length: $\|Qx\|_2 = \|x\|_2$ for any vector $x$ .
The determinant of an orthogonal matrix is $\pm 1$ . A rotation has determinant $1$ while a reflection has determinant $-1$ .

Orthogonal matrices can be generated from skew-symmetric ones. If $S$ is skew-symmetric ( $S = -S^T$ ) then $\exp(S)$ (the matrix exponential) is orthogonal and the Cayley transform $(I-S)(I+S)^{-1}$ is orthogonal as long as $S$ has no eigenvalue equal to $-1$ .

Unitary matrices are complex square matrices $Q$ for which $Q^*Q = QQ^* = I$ , where $Q^*$ is the conjugate transpose of $Q$ . They have analogous properties to orthogonal matrices.

Category: what-is

What Is Bfloat16 Arithmetic?

References

Related Blog Posts

What Is the Matrix Exponential?

References

Related Blog Posts

What Is a Matrix Square Root?

References

Related Blog Posts

What Is IEEE Standard Arithmetic?

References

Related Blog Posts

What Is Floating-Point Arithmetic?

References

Related Blog Posts

What Is Rounding?

Round to Nearest

Round Towards Plus or Minus Infinity

Round Towards Zero

Stochastic Rounding

Real World Rounding

Useful Tool

References

Related Blog Posts

What Is a Random Orthogonal Matrix?

References

Related Blog Posts

What Is a Correlation Matrix?

Nearest Correlation Matrix

Correlation Matrix Completion

References

Related Blog Posts

What Is a Hadamard Matrix?

References

Related Blog Posts

What Is an Orthogonal Matrix?

Related Blog Posts

References

Related Blog Posts

Share this:

References

Related Blog Posts

Share this:

References

Related Blog Posts

Share this:

References

Related Blog Posts

Share this:

References

Related Blog Posts

Share this:

Round to Nearest

Round Towards Plus or Minus Infinity

Round Towards Zero

Stochastic Rounding

Real World Rounding

Useful Tool

References

Related Blog Posts

Share this:

References

Related Blog Posts

Share this:

Nearest Correlation Matrix

Correlation Matrix Completion

References

Related Blog Posts

Share this:

References

Related Blog Posts

Share this:

Related Blog Posts

Share this: