What Is a Matrix Function?

If you give an array as the argument to a mathematical function in a programming language or problem solving environment you are likely to receive the result of applying the function to each component of the array. For example, here MATLAB exponentiates the integers from 0 to 3:

>> A = [0 1; 2 3], X = exp(A)
A =
    0     1
    2     3
X =
   1.0000e+00   2.7183e+00
   7.3891e+00   2.0086e+01

When the array represents a matrix this componentwise evaluation is not useful, as it does not obey the rules of matrix algebra. To see what properties we might require, consider the matrix square root. This function is notable historically because Cayley considered it in the 1858 paper in which he introduced matrix algebra.

A square root of an n\times n matrix A is a matrix X such that X^2 = A. Any square root X satisfies

AX = X^2 X = X X^2 = XA,

so X commutes with A. Also, A^* = (X^2)^* = (X^*)^2, so X^* is a square root of A^* (here, the superscript * denotes the conjugate transpose).

Furthermore, for any nonsingular matrix Z we have

(Z^{-1}XZ)^2 = Z^{-1}X^2 Z = Z^{-1}A Z.

If we choose Z as a matrix that takes A to its Jordan canonical form J then we have (Z^{-1}XZ)^2 = J, so that Z^{-1}XZ is a square root of J, or in other words X can be expressed as X = Z \sqrt{J} Z^{-1}.

Generalizing from these properties of the matrix square root it is reasonable to ask that a function f(A) of a square matrix A satisfies

  • f(A) commutes with A,
  • f(A^*) = f(A)^*,
  • f(A) = Z f(J)Z^{-1}, where A has the Jordan canonical form A = ZJZ^{-1}.

(These are of course not the only requirements; indeed f(A) \equiv I satisfies all three conditions.)

Many definitions of f(A) can be given that satisfy these and other natural conditions. We will describe just three definitions (a notable omission is a definition in terms of polynomial interpolation).

Power Series

For a function f that has a power series expansion we can define f(A) by substituting A for the variable. It can be shown that the matrix series will be convergent for matrices whose eigenvalues lie within the radius of convergence of the scalar power series. For example, where \rho denotes the spectral radius. This definition is natural for functions that have a power series expansion, but it is rather limited in its applicability.

\begin{aligned}    \cos(A) &= I - \displaystyle\frac{A^2}{2!} + \frac{A^4}{4!} - \frac{A^6}{6!} +                    \cdots,\\   \log(I+A) &= A - \displaystyle\frac{A^2}{2} + \frac{A^3}{3} - \frac{A^4}{4} +                     \cdots, \quad \rho(A)<1,   \end{aligned}

Jordan Canonical Form Definition

If A has the Jordan canonical form

Z^{-1}AZ = J = \mathrm{diag}(J_1, J_2, \dots, J_p),


J_k = J_k(\lambda_k) = \begin{bmatrix}                 \lambda_k & 1         &          &           \\                           & \lambda_k & \ddots   &           \\                           &           & \ddots   &    1      \\                           &           &          & \lambda_k       \end{bmatrix}       = \lambda_k I + N \in \mathbb{C}^{m_k\times m_k},


f(A) := Z f(J) Z^{-1} = Z \mathrm{diag}(f(J_k)) Z^{-1},


f(J_k) :=  \begin{bmatrix}   f(\lambda_k) & f'(\lambda_k) &  \dots  & \displaystyle\frac{f^{(m_k-1)}(\lambda_k)}{(m_k-1)!} \\          & f(\lambda_k)  & \ddots  & \vdots \\          &          & \ddots  & f'(\lambda_k) \\          &          &         & f(\lambda_k)      \end{bmatrix}.

Notice that f(J_k) has the derivatives of f along its diagonals in the upper triangle. Write J_k = \lambda_k I + N, where N is zero apart from a superdiagonal of 1s. The formula for f(J_k) is just the Taylor series expansion

f(J_k) = f(\lambda_k I + N) = f(\lambda_k)I + f'(\lambda_k)N +             \displaystyle\frac{f''(\lambda_k)}{2!}N^2 + \cdots +             \displaystyle\frac{f^{(m_k-1)}(\lambda_k)}{(m_k-1)!}N^{m_k-1}.

The series is finite because N^{m_k} = 0: as N is powered up the superdiagonal of 1s moves towards the right-hand corner until it eventually disappears.

An immediate consequence of this formula is that f(A) is defined only if the necessary derivatives exist: for each eigenvalue \lambda_k we need the existence of the derivatives at \lambda_k of order up to one less than the size of the largest Jordan block in which \lambda_k appears.

When A is a diagonalizable matrix the definition simplifies considerably: if A = ZDZ^{-1} with D = \mathrm{diag}(\lambda_i) then f(A) = Zf(D)Z^{-1}  = Z \mathrm{diag}(\lambda_i) Z^{-1}.

Cauchy Integral Formula

For a function f analytic on and inside a closed contour \Gamma that encloses the spectrum of A the matrix f(A) can be defined by the Cauchy integral formula

f(A) := \displaystyle\frac{1}{2\pi \mathrm{i}}      \int_\Gamma f(z) (zI-A)^{-1} \,\mathrm{d}z.

This formula is mathematically elegant and can be used to provide short proofs of certain theoretical results.

Equivalence of Definitions

These and several other definitions turn out to be equivalent under suitable conditions. This was recognized by Rinehart, who wrote in 1955

“There have been proposed in the literature since 1880 eight distinct definitions of a matric function, by Weyr, Sylvester and Buchheim, Giorgi, Cartan, Fantappiè, Cipolla, Schwerdtfeger and Richter … All of the definitions except those of Weyr and Cipolla are essentially equivalent.”

Computing a Matrix Function in MATLAB

In MATLAB, matrix functions are distinguished from componentwise array evaluation by the postfix “m” on a function name. Thus expm, logm, and sqrtm are matrix function counterparts of exp, log, and sqrt. Compare the example at the start of this article with

>> A = [0 1; 2 3], X = expm(A)
A =
    0     1
    2     3
X =
   5.2892e+00   8.4033e+00
   1.6807e+01   3.0499e+01

The MATLAB function funm calculates f(A) for general matrix functions f (subject to some restrictions).


This is a minimal set of references, which contain further useful references within.

Related Blog Posts

This article is part of the “What Is” series, available from https://nhigham.com/category/what-is and in PDF form from the GitHub repository https://github.com/higham/what-is.

Posted in what-is | Tagged | Leave a comment

Randsvd Matrices with Large Growth Factors

Sixty years ago James Wilkinson published his backward error analysis of Gaussian elimination for solving a linear system Ax = b, where A is a nonsingular n\times n matrix. He showed that in floating-point arithmetic the computed solution \widehat{x} satisfies

(A+\Delta A) \widehat{x} = b, \qquad      \|\Delta A\|_{\infty} \le  p(n) \rho_n  u \|A\|_{\infty},

where u is the unit roundoff and p is a low degree polynomial. The term \rho_n is the growth factor, defined by

\rho_n = \displaystyle\frac{\max_{i,j,k} |a_{ij}^{(k)}|}               {\max_{i,j}|a_{ij}|} \ge 1,

where the a_{ij}^{(k)} are the elements at the kth stage of Gaussian elimination. The growth factor measures how much elements grow during the elimination. We would like the product p(n)\rho_n to be of order 1, so that \Delta A is a small relative perturbation of A. We therefore need \rho_n not to be too large.

With partial pivoting, in which row interchanges are used to ensure that at each stage the pivot element is the largest in its column, Wilkinson showed that \rho_n \le 2^{n-1} and that equality is possible. Such exponential growth implies a large \Delta A (unless we are lucky), meaning a severe loss of numerical stability. However, seventy years of digital computing experience have shown that \rho_n is usually of modest size in practice. Explaining why this is the case is one of the outstanding problems in numerical analysis.

It is easy to experiment with growth factors in MATLAB. I will use the function

function g = gf(A)
%GF     Approximate growth factor.
%   g = GF(A) is an approximation to the
%   growth factor for LU factorization
%   with partial pivoting.
[~,U] = lu(A);
g = max(abs(U),[],'all')/max(abs(A),[],'all');

It computes a lower bound on the growth factor (since it only considers k=n in the numerator in the definition), but it is entirely adequate for our purposes here. Let’s compute the growth factor for a random matrix of order 10,000 with elements from the standard normal distribution (mean 0, variance 1):

>> rng(1); n = 10000; gf(randn(n))
ans =

Growth of 61 is unremarkable for a matrix of this size. Now we try a matrix of the same size generated by the gallery('randsvd') function:

>> A = gallery('randsvd',n,1e6,2,[],[],1);
>> gf(A)
ans =

This function generates an n\times n matrix with known singular value distribution and with singular vector matrices that are random orthogonal matrices from the Haar distribution. The parameter 1e6 specifies the 2-norm condition number, while the 2 (the mode parameter) specifies that there is only one small singular value, so the singular values are 1 repeated n-1 times and 1e-6. Growth of 975 is exceptional! These matrices have been in MATLAB since the 1990s, but this large growth property has apparently not been noticed before.

It turns out that mode 2 randsvd matrices generate with high probability growth factors of size at least n/(4 \log n) for any condition number and for any pivoting strategy, not just partial pivoting. One way to check this is to randomly permute the columns of A before doing the LU factorization with partial pivoting:

>> gf(A(:,randperm(n)))
ans =

Here is a plot showing the maximum over 12 randsvd matrices for each n of the growth factors for three different pivoting strategies, along with the maximum growth factors for partial pivoting for rand and randn matrices. The black curve is n/(4 \log n). This plot emphasizes the unusually large growth for mode 2 randsvd matrices.


What is the explanation for this large growth? It stems from three facts.

  • Haar distributed orthogonal matrices have the property that that their elements are fairly small with high probability, as shown by Jiang in 2005.
  • If the largest entries in magnitude of A and A^{-1} are both small, in the sense that their product is \theta \ll 1, then A will produce a growth factor of at least 1/\theta for any pivoting strategy. This was proved by Des Higham and I in the paper Large Growth Factors in Gaussian Elimination with Pivoting.
  • If W is an orthogonal matrix generating large growth then a rank-1 perturbation of 2-norm at most 1 tends to preserve the large growth.

For full details see the new EPrint Random Matrices Generating Large Growth in LU Factorization with Pivoting by Des Higham, Srikara Pranesh and me.

Is growth of order n a problem in practice? It can be for two reasons.

  • The largest dense linear systems Ax = b solved today are of dimension n = 10^7. If we work in single precision then nu \approx 1 and so LU factorization can potentially be completely unstable if there is growth of order n.
  • For IEEE half precision arithmetic growth of order n will cause overflow once n exceeds 10^5 / \max_{i,j} |a_{ij}|. It was overflow in half precision LU factorization on randsvd matrices that alerted us to the large growth.
Posted in research | Tagged , | 2 Comments

What Is Bfloat16 Arithmetic?

Bfloat16 is a floating-point number format proposed by Google. The name stands for “Brain Floating Point Format” and it originates from the Google Brain artificial intelligence research group at Google.

Bfloat16 is a 16-bit, base 2 storage format that allocates 8 bits for the significand and 8 bits for the exponent. It contrasts with the IEEE fp16 (half precision) format, which allocates 11 bits for the significand but only 5 bits for the exponent. In both cases the implicit leading bit of the significand is not stored, hence the “+1” in this diagram:


The motivation for bfloat16, with its large exponent range, was that “neural networks are far more sensitive to the size of the exponent than that of the mantissa” (Wang and Kanwar, 2019).

Bfloat16 uses the same number of bits for the exponent as the IEEE fp32 (single precision) format. This makes conversion between fp32 and bfloat16 easy (the exponent is kept unchanged and the significand is rounded or truncated from 24 bits to 8) and the possibility of overflow in the conversion is largely avoided. Overflow can still happen, though (depending on the rounding mode): the significand of fp32 is longer, so the largest fp32 number exceeds the largest bfloat16 number, as can be seen in the following table. Here, the precision of the arithmetic is measured by the unit roundoff, which is 2^{-t}, where t is the number of bits in the significand.


Note that although the table shows the minimum positive subnormal number for bfloat16, current implementations of bfloat16 do not appear to support subnormal numbers (this is not always clear from the documentation).

As the unit roundoff values in the table show, bfloat16 numbers have the equivalent of about three decimal digits of precision, which is very low compared with the eight and sixteen digits, respectively, of fp32 and fp64 (double precision).

The next table gives the number of numbers in the bfloat16, fp16, and fp32 systems. It shows that the bfloat16 number system is very small compared with fp32, containing only about 65,000 numbers.


The spacing of the bfloat16 numbers is large far from 1. For example, 65280, 65536, and 66048 are three consecutive bfloat16 numbers.

At the time of writing, bfloat16 is available, or announced, on four platforms or architectures.

  • The Google Tensor Processing Units (TPUs, versions 2 and 3) use bfloat16 within the matrix multiplication units. In version 3 of the TPU the matrix multiplication units carry out the multiplication of 128-by-128 matrices.
  • The NVIDIA A100 GPU, based on the NVIDIA Ampere architecture, supports bfloat16 in its tensor cores through block fused multiply-adds (FMAs) C + A*B with 8-by-8 A and 8-by-4 B.
  • Intel has published a specification for bfloat16 and how it intends to implement it in hardware. The specification includes an FMA unit that takes as input two bfloat16 numbers a and b and an fp32 number c and computes c + a*b at fp32 precision, returning an fp32 number.
  • The Arm A64 instruction set supports bfloat16. In particular, it includes a block FMA C + A*B with 2-by-4 A and 4-by-2 B.

The pros and cons of bfloat16 arithmetic versus IEEE fp16 arithmetic are

  • bfloat16 has about one less (roughly three versus four) digit of equivalent decimal precision than fp16,
  • bfloat16 has a much wider range than fp16, and
  • current bfloat16 implementations do not support subnormal numbers, while fp16 does.

If you wish to experiment with bfloat16 but do not have access to hardware that supports it you will need to simulate it. In MATLAB this can be done with the chop function written by me and Srikara Pranesh.


This is a minimal set of references, which contain further useful references within.

Posted in what-is | Tagged , , | Leave a comment

What Is the Matrix Exponential?

The exponential of a square matrix A is defined by the power series (introduced by Laguerre in 1867)

\mathrm{e}^A = I + A + \displaystyle\frac{A^2}{2!} +                \frac{A^3}{3!} + \cdots.

That the series converges follows from the convergence of the series for scalars. Various other formulas are available, such as

\mathrm{e}^A = \displaystyle\lim_{s\to \infty} (I+A/s)^s.

The matrix exponential is always nonsingular and (\mathrm{e}^A)^{-1} = \mathrm{e}^{-A}.

Much interest lies in the connection between \mathrm{e}^{A+B} and \mathrm{e}^A  \mathrm{e}^B. It is easy to show that \mathrm{e}^{A+B} = \mathrm{e}^A \mathrm{e}^B if A and B commute, but commutativity is not necessary for the equality to hold. Series expansions are available that relate \mathrm{e}^{A+B} to \mathrm{e}^A \mathrm{e}^B for general A and B, including the Baker–Campbell–Hausdorff formula and the Zassenhaus formula, both of which involve the commutator [A,B] = AB - BA. For Hermitian A and B the inequality \mathrm{trace}(\mathrm{e}^{A+B}) \le \mathrm{trace}(\mathrm{e}^A \mathrm{e}^B) was proved independently by Golden and Thompson in 1965.

Especially important is the relation

\mathrm{e}^A = \bigl(\mathrm{e}^{A/2^s}\bigr)^{2^s},

for integer s, which is used in the scaling and squaring method for computing the matrix exponential.

Another important property of the matrix exponential is that it maps skew-symmetric matrices to orthogonal ones. Indeed if A = - A^T then

\bigl(\mathrm{e}^A\bigr)^{-1}    = \mathrm{e}^{-A}    = \mathrm{e}^{A^T}    = \bigl(\mathrm{e}^A\bigr)^T.

This is a special case of the fact that the exponential maps elements of a Lie algebra into the corresponding Lie group.

The matrix exponential plays a fundamental role in linear ordinary differential equations (ODEs). The vector ODE

\displaystyle\frac{dy}{dt} = A y, \quad y(0) = c

has solution y(t) = \mathrm{e}^{At} c, while the solution of the ODE in n \times n matrices

\displaystyle\frac{dY}{dt} = AY + YB, \quad Y(0) = C

is Y(t) = \mathrm{e}^{At}Ce^{Bt}.

In control theory, the matrix exponential is used in converting from continuous time dynamical systems to discrete time ones. Another application of the matrix exponential is in centrality measures for nodes in networks.

Many methods have been proposed for computing the matrix exponential. See the references for details.


This is a minimal set of references, which contain further useful references within.

Related Blog Posts

This article is part of the “What Is” series, available from https://nhigham.com/category/what-is and in PDF form from the GitHub repository https://github.com/higham/what-is.

Posted in what-is | Tagged | 1 Comment

Simulating Low Precision Floating-Point Arithmetics in MATLAB

At least five floating-point arithmetics are available in mainstream hardware:

  • the IEEE double precision (fp64), single precision (fp32), and half precision (fp16) formats,
  • bfloat16, and
  • tf32, introduced in the recently announced NVIDIA A100, which uses the NVIDIA Ampere GPU architecture.

Only fp32 and fp64 are available on current Intel processors and most programming environments support only these two precisions. For many of us it is therefore necessary to simulate other precisions if we wish to explore their properties and the behavior of algorithms when run at these precisions.


Srikara Pranesh and I have written a MATLAB function chop that allows arithmetic of arbitrary precision and exponent range (both lower than fp64) to be simulated. The idea is very simple: variables are stored in fp32 or fp64 and then rounded to the required precision and exponent range after each operation by calling chop. The precision and range, as well as various other parameters, can be changed at any point in the computation.

Our approach can be compared with defining a new data type and storage format and converting to and from single precision or double precision in order to carry out the computations. Our approach avoids the overheads of repeatedly converting to and from the special storage format. It gains speed and generality at the cost of requiring chop calls to be inserted in the code.

Chop is available from this GitHub repository and from MathWorks File Exchange. The motivation behind chop is described in this open access paper in SIAM. J. Sci. Comput. We have made use of chop to simulate fp16 and bfloat16 in several recent papers, including on mixed precision algorithms, probabilistic rounding error analysis, and fast and accurate summation.

How to Use Chop

A call to chop has the form chop(X,options), where the structure options spceifies how to round the elements of the real array X to a lower precision arithmetic. X should be a single precision or double precision array and the output will have the same type. The structure options controls various aspects of the rounding, as we now explain.

  1. The arithmetic format is specified by options.format, which is one of
    • ‘b’, ‘bfloat16’: bfloat16,
    • ‘h’, ‘half’, ‘fp16’: IEEE half precision (the default),
    • ‘s’, ‘single’, ‘fp32’: IEEE single precision,
    • ‘d’, ‘double’, ‘fp64’: IEEE double precision,
    • ‘c’, ‘custom’: custom format.

    In the last case the (base 2) format is defined by options.params, which is a 2-vector [t,emax], where t is the number of bits in the significand (including the hidden bit) and emax is the maximum value of the exponent.

  2. options.subnormal specifies whether subnormal numbers are supported (if they are not, subnormals are flushed to zero):
    • 0 = do not support subnormals (the default for bfloat16),
    • 1 = support subnormals (the default for fp16, fp32, and fp64).
  3. The form of rounding is specified by options.round:
    • 1: round to nearest using round to even last bit to break ties (the default);
    • 2: round towards plus infinity (round up);
    • 3: round towards minus infinity (round down);
    • 4: round towards zero;
    • 5: stochastic rounding—round to the next larger or next smaller floating-point number with probability proportional to 1 minus the distance to those floating-point numbers;
    • 6: stochastic rounding—round to the next larger or next smaller floating-point number with equal probability.

    For stochastic rounding, exact floating-point numbers are not changed.

  4. If options.flip = 1 (default 0) then each element of the rounded result has, with probability options.p (default 0.5), a randomly chosen bit in its significand flipped. This option is useful for simulating soft errors.
  5. If options.explim = 0 (default 1) then emax (the maximal exponent) for the specified arithmetic is ignored, so overflow, underflow, or subnormal numbers will be produced only if necessary for the data type of X. This option is useful for exploring low precisions independent of range limitations.

To avoid repeatedly passing the options argument, one can issue a call chop([],options) and the subsequent calls will use the previously given options parameters. The value of options can be inspected with [~,options] = chop.

In the rest of this post we give examples of the use of chop.

Fp16 versus Bfloat

The code

x = 1/3;
opt_h.format = 'h'; xh = chop(x,opt_h); err_h = (x - xh)/x
opt_b.format = 'b'; xb = chop(x,opt_b); err_b = (x - xb)/x
x = 70000;
xh = chop(x,opt_h)
xb = chop(x,opt_b)

produces the output

err_h =
err_b =
xh =
xb =

which illustrates the higher precison but smaller range of fp16 compared with bfloat16.

Rounding Modes

The code

opt.format = 'h'; x = 0.1; rng(2)
for k = 1:6
opt.round = k;
y(k) = chop(x,opt);
errors = y - x
diff_y = diff(y)

rounds 0.1 to fp16 using six different rounding modes. The output is

errors =
-2.4414e-05   3.6621e-05  -2.4414e-05  -2.4414e-05  -2.4414e-05
diff_y =
6.1035e-05  -6.1035e-05            0            0   6.1035e-05

Rounding maps a number to the next larger or next smaller floating-point number, so there are only two possibilties for the error, and they have opposite signs. The diff_y values are consistent with the spacing of the fp16 floating-point numbers around 0.1, which is 2^{-11}/8.

Matrix Multiplication

The overheads of chop can be minimized by choosing a suitable implementation of a computation. Matrix multiplication provides a good example. The multiplication C = A*B of n-by-n matrices involves 2n^3 floating-point operations, each of which needs to be chopped. The following code uses only 2n calls to chop, because it processes an outer product in a single call, taking advantage of the fact that chop can take a matrix argument.

A = rand(m,n); B = rand(n,p);
C = zeros(m,p);
for i = 1:n
C = chop(C + chop(A(:,i)*B(i,:)));

Tf32 Arithmetic

The recently announced NVIDIA A100 GPU supports a new floating-point format called tf32 in its tensor cores. Tf32 is a 19-bit format that has 11 bits in the significand (including the hidden bit) and 8 bits in the exponent. So it has the precision of fp16 and the range of bfloat16, effectively combining the best features of these two formats. We can simulate tf32 using the custom format of chop. The code

opt.format = 'custom';
% tf32. Significand: 10 bits plus 1 hidden, exponent: 8 bits.
t = 11; emax = 127;
opt.params = [t,emax];
x = 1/3;
xt = chop(x,opt); err_t = (x - xt)/x
x = 70000;
xt = chop(x,opt)

produces the output

err_t =
xt =

The error is the same as for fp16 but the number 70000 now rounds to a finite floating-point number.

Posted in software | Tagged | Leave a comment

What Is a Matrix Square Root?

A square root of an n\times n matrix A is any matrix X such that X^2 = A.

For a scalar a (n = 1), there are two square roots (which are equal if a = 0), and they are real if and only if a is real and nonnegative. For n \ge 2, depending on the matrix there can be no square roots, finitely many, or infinitely many. The matrix

A = \begin{bmatrix}      0  &  1 \\      0  &  0      \end{bmatrix}

is easily seen to have no square roots. The matrix

D =  \mathrm{diag}(1,2) =  \begin{bmatrix}                                1  &  0 \\                                0  &  2                               \end{bmatrix}

has four square roots, \mathrm{diag}(\pm 1, \pm\sqrt{2}). The identity matrix

\begin{bmatrix}      1  &  0 \\      0  &  1      \end{bmatrix}

has infinitely many square roots (namely the 2\times 2 involutory matrices), including \mathrm{diag}(\pm 1, \pm 1), the lower triangular matrix

\begin{bmatrix}      1  &  0 \\      1  &  -1      \end{bmatrix},

and any symmetric orthogonal matrix, such as

\begin{bmatrix}      \cos \theta  &  \sin \theta \\      \sin \theta  & -\cos \theta      \end{bmatrix}, \quad \theta \in[0,2\pi]

(which is a Householder matrix). Clearly, a square root of a diagonal matrix need not be diagonal.

The matrix square root of most practical interest is the one whose eigenvalues lie in the right half-plane, which is called the principal square root, written A^{1/2}. If A is nonsingular and has no eigenvalues on the negative real axis then A has a unique principal square root. For the diagonal matrix D above, D^{1/2} = \mathrm{diag}(1,\sqrt{2}).

A symmetric positive definite matrix has a unique symmetric positive definite square root. Indeed if A is symmetric positive definite then it has a spectral decomposition A = QDQ^T, where Q is orthogonal and D is diagonal with positive diagonal elements, and then A^{1/2} = Q D^{1/2}Q^T is also symmetric positive definite.

If A is nonsingular then it has at least 2^s square roots, where s is the number of distinct eigenvalues. The existence of a square root of a singular matrix depends on the Jordan structure of the zero eigenvalues.

In some contexts involving symmetric positive definite matrices A, such as Kalman filtering, a matrix Y such that A = Y^TY is called a square root, but this is not the standard meaning.

When A has structure one can ask whether a square root having the same structure, or some other related structure, exists. Results are known for (for example)

  • stochastic matrices,
  • M-matrices,
  • skew-Hamiltonian matrices,
  • centrosymmetric matrices, and
  • matrices from an automorphism group.

An important distinction is between square roots of A that can be expressed as a polynomial in A (primary square roots) and those that cannot. Square roots of the latter type arise when A has repeated eigenvalues and two copies of an eigenvalue are mapped to different square roots. In some contexts, a nonprimary square root may be the natural choice. For example, consider the matrix

G(\theta) = \begin{bmatrix}      \cos \theta  &  \sin \theta \\      -\sin \theta  & \cos \theta      \end{bmatrix}, \quad \theta \in[0,2\pi],

which represents a rotation through an angle \theta radians clockwise. The natural square root of G(\theta) is G(\theta/2). For \theta = \pi, this gives the square root

G(\pi/2) = \begin{bmatrix}       0 &  1 \\      -1 &  0      \end{bmatrix}


G(\pi) = \begin{bmatrix}       -1 & 0 \\        0 & -1      \end{bmatrix}.

The matrix square root arises in many applications, often in connection with other matrix problems such as the polar decomposition, matrix geometric means, Markov chains (roots of transition matrices), quadratic matrix equations, and generalized eigenvalue problems. Most often the matrix is symmetric positive definite, but square roots of nonsymmetric matrices are also needed. Among modern applications, the matrix square root can be found in recent papers on machine learning.


This is a minimal set of references, which contain further useful references within.

Related Blog Posts

This article is part of the “What Is” series, available from https://nhigham.com/category/what-is and in PDF form from the GitHub repository https://github.com/higham/what-is.

Posted in what-is | Tagged | 3 Comments

1984 Symposium in Honour of James Wilkinson

In September 1984 a two-day Symposium on Computational Mathematics–State of the Art was held at Argonne National Laboratory in honour of James Hardy Wilkinson on his 65th birthday.

Wilkinson was one of the leading figures in 20th century numerical analysis. He developed the theory and practice of backward error analysis for floating-point computation, and developed, analyzed, and implemented in software many algorithms in numerical linear algebra. Among his many honours, Wilkinson was a Fellow of the Royal Society (elected 1969) and a winner of both the Turing Prize (1970) and the SIAM John von Neumman Lecture award (1970). His legacy endures and in 2019 we celebrated the centenary of his birth with the conference Advances in Numerical Linear Algebra.

The 1984 symposium provided “an overview of the state of the art in several of the major areas of computational mathematics” and “was particularly appropriate for this occasion in view of the many fundamental contributions in computational mathematics made by Professor Wilkinson throughout his distinguished career”.

The symposium attracted more than 260 attendees and comprised ten invited talks by distinguished researchers:

  • Some Problems in the Scaling of Matrices, G. W. Stewart
  • Matrix Calculations in Optimization Algorithms, M. J. D. Powell
  • Numerical Solution of Differential-Algebraic Equations, C. W. Gear
  • The Requirements of Interactive Data Analysis Systems, Peter J. Huber
  • Linear Algebra Problems in Multivariate Approximation Theory, C. de Boor
  • Computational linear Algebra in Continuation and Two Point Boundary Value Problems, H. B. Keller
  • Second Thoughts on the Mathematical Software Effort: A Perspective, W. J. Cody
  • Enhanced Resolution in Shock Wave Calculations, James Glimm
  • Givens’ Eigenvector Recurrence Revisited, Beresford Parlett
  • The State of the Art in Error Analysis, James H. Wilkinson

Slides from the talks are available in an informal proceedings available here in PDF form and on Google Books.

Jack Dongarra has provided me with contact prints of photos taken at the workshop. I have scanned some of these and they are shown below.

One of the photos shows Jim Wilkinson with an Apple Mac that was presented to him. He used it to type up several of his later papers. Sadly, Jim died just over two years after the symposium.

Posted in conferences | Tagged | 2 Comments

Singular Values of Rank-1 Perturbations of an Orthogonal Matrix

What effect does a rank-1 perturbation of norm 1 to an n\times n orthogonal matrix have on the extremal singular values of the matrix? Here, and throughout this post, the norm is the 2-norm. The largest singular value of the perturbed matrix is bounded by 2, as can be seen by taking norms, so let us concentrate on the smallest singular value.

Consider first a perturbation of the identity matrix: B = I + xy^T, for unit norm x and y. The matrix B has eigenvalues 1 (repeated n-1 times) and 1 + y^Tx. The matrix is singular—and hence has a zero singular value—precisely when y^Tx = -1, which is the smallest value that the inner product y^Tx can take.

Another example is B = A + yy^T, where A = I - 2yy^T and y has unit norm, so that A is a Householder matrix. Here, B = I - yy^T is singular with null vector y, so it has a zero singular value,

Let’s take a random orthogonal matrix and perturb it with a random rank-1 matrix of unit norm. We use the following MATLAB code.

n = 100; rng(1)
A = gallery('qmult',n); % Random Haar distrib. orthogonal matrix.
x = randn(n,1); y = randn(n,1);
x = x/norm(x); y = y/norm(y);
B = A + x*y';
svd_B = svd(B);
max_svd_B = max(svd_B), min_svd_B = min(svd_B)

The output is

max_svd_B =
min_svd_B =

We started with a matrix having singular values all equal to 1 and now have a matrix with largest singular value a little larger than 1 and smallest singular value a little smaller than 1. If we keep running this code the extremal singular values of B do not change much; for example, the next run gives

max_svd_B =
min_svd_B =

A rank-1 perturbation of unit norm could make A singular, as we saw above, but our random perturbations are producing a well conditioned matrix.

What is the explanation? First, note that a rank-1 perturbation to an orthogonal matrix A can only change two of the singular values, because the singular values are the square roots of the eigenvalues of A^T A, which is the identity plus a rank-2 matrix. So n-2 singular values remain at 1.

A result of Benaych-Georges and Nadakuditi (2012) says that for large n the largest and smallest singular values of B tend to (1+\sqrt{5})/2 = 1.618\dots and (-1+\sqrt{5})/2 = 0.618\dots respectively! As our example shows, n does not have to be large for these limits to be approximations correct to roughly the first digit.

The result in question requires the original orthogonal matrix to be from the Haar distribution, and such matrices can be generated by A = gallery('qmult',n) or by the construction

[Q,R] = qr(randn(n));
Q = Q*diag(sign(diag(R)));

(See What Is a Random Orthogonal Matrix?.) The result also requires u and v to be unit-norm random vectors with independent entries from the same distribution.

However, as the next example shows, the perturbed singular values can be close to the values that the Benaych-Georges and Nadakuditi result predicts even when the conditions of the result are violated:

n = 100; rng(1)
A = gallery('orthog',n);   % Random orthogonal matrix (not Haar).
x = rand(n,1); y = (1:n)'; % Non-random y.
x = x/norm(x); y = y/norm(y);
B = A + x*y';
svd_B = svd(B);
max_svd_B = max(svd_B), min_svd_B = min(svd_B)
max_svd_B =
min_svd_B =

The question of the conditioning of a rank-1 perturbation of an orthogonal matrix arises in the recent EPrint Random Matrices Generating Large Growth in LU Factorization with Pivoting.

Posted in research | Tagged , | Leave a comment

Which Fountain Pen Ink for Writing Mathematics?

If, like me, you sometimes prefer to write mathematics on paper before typing into LaTeX, you have the choice of pencil or pen as your writing tool. I’ve written before about writing in pencil. My current tools of choice are fountain pens.

A fountain pen, the ink it contains and the paper used are three independent variables that combine in a complicated way to affect the experience of writing and the results produced. Here, I focus solely on inks and ask what ink one should choose depending on different mathematics-focused requirements.

I give links to reviews on the excellent Mountain of Ink and Pen Addict websites and other pen/ink websites. Unless otherwise stated, I have tried the inks myself.


Fast Drying Time

When we write down mathematical working, perhaps in the course of doing research, we are likely to need to go back and make changes to what we have just written. A fast drying ink helps avoid smudges, and it also means we can start writing on the reverse side of a page without pausing to let the ink dry or using blotting paper. Left-handed writers may always need such an ink, depending on their particular style of writing. My favourite fast-drying ink is Noodler’s Bernanke Blue, which is named after Ben Bernanke, former chairman of the Federal Reserve. Downsides are feathering (where the ink spreads out to create rough edges) and bleeding (where ink soaks through on the other side of the page) on lesser grades of paper.

Water Resistance

As mathematicians are major consumers of coffee, what we write risks being spilled on. (Indeed this is so normal an occurrence that Hanno Rein has written a LaTeX package to simulate a coffee stain.) An ink should be reasonably water resistant if spills are not to blur what we have written. Many popular inks have poor water resistance. One of the most water resistant is Noodler’s Bulletproof Black, which also “resists all the known tools of a forger, UV light, UV light wands, bleaches, alcohols, solvent…”. It’s great for writing cheques, as well as mathematics. (But it may not be the blackest black ink, if that matters to you.) Another water resistant ink is Noodler’s Baystate Blue (see below).

Interesting Names

Sometimes, writing with an interestingly named pen, paper or ink can provide inspiration. While many Coloverse inks have a space or physics theme, I am not aware of any mathematically named inks.

Vibrant Inks

Vibrant inks are great for annotating a printout or writing on paper with heavy lines, squares or dots. An outstanding ink in this respect is Noodler’s Baystate Blue, an incredibly intense, bright blue that jumps off the page. The downsides are bleeding and staining. An ink to be used with caution! I also like the much better behaved Pilot Iroshizuku Fuyu-gaki (orange) and Pilot Iroshizuku Kon-peki (blue).


For Marking (Grading)

For marking scripts one obviously needs to use a different colour than the student has used. Traditionally, this is red. A couple of my favourites for marking are Waterman Audacious Red and Cult Pens Deep Dark Orange.

Shading, Sheen and Shimmer

Many inks have one or more of the properties of shading, sheen and shimmer. Shading is where an ink appears darker or lighter in different parts of a stroke; sheen is where an ink shines with a different colour; and shimmer is where particles have been added to the ink that cause it to shimmer or glisten when it catches the light. The strength of all these effects is strongly dependent on the pen, the nib size and the paper, and for shimmer the pen needs to be moved around to distribute the particles before writing. (Sheen and shimmer do not show up well on the scans shown in this post.) Among my favourites are Diamine Majestic Blue (blue with purple sheen), Robert Oster Fire and Ice (teal with shading and pink sheen) and Diamine Cocoa Shimmer (brown with gold shimmer). For “monster sheen”, try Organic Studios Nitrogen Royal Blue, but note that it is hard to clean out of a pen.


Nonstandard Surfaces

Much mathematics is written on bar mats, napkins, paper towels, backs of envelopes, newspapers, tablecloths, and anything that is to hand, especially when mathematicians work together away from their desks. These surfaces are not fountain-pen friendly. One promising ink for such situations is Noodler’s X-Feather, which was created to combat feathering on cheap paper. (I have not tried this ink.)

All Rounders

Finally, three good all-round inks that I’ve used a lot when working on my books and papers are Diamine Asa Blue, Diamine Oxford Blue, and Diamine Red Dragon.


Posted in writing | Leave a comment

What Is IEEE Standard Arithmetic?

The IEEE Standard 754, published in 1985 and revised in 2008 and 2019, is a standard for binary and decimal floating-point arithmetic. The standard for decimal arithmetic (IEEE Standard 854) was separate when it was first published in 1987, but it was included with the binary standard from 2008. We focus here on the binary part of the standard.

The standard specifies floating-point number formats, the results of the basic floating-point operations and comparisons, rounding modes, floating-point exceptions and their handling, and conversion between different arithmetic formats.

A binary floating-point number is represented as

y = \pm m \times 2^{e-t},

where t is the precision and e\in [e_{\min},e_{\max}] is the exponent. The significand m is an integer satisfying m \le 2^t-1. Numbers with m \ge 2^{t-1} are called normalized. Subnormal numbers, for which 0 < m <2^{t-1} and e = e_{\min}, are supported.

Four formats are defined, whose key parameters are summarized in the following table. The second column shows the number of bits allocated to store the significand and the exponent. I use the prefix “fp” instead of the prefix “binary” used in the standard. The unit roundoff is u = 2^{-t}.

ieee_params_table.jpg Fp32 (single precision) and fp64 (double precision) were in the 1985 standard; fp16 (half precision) and fp128 (quadruple precision) were introduced in 2008. Fp16 is defined only as a storage format, though it is widely used for computation.

The size of these different number systems varies greatly. The next table shows the number of normalized and subnormal numbers in each system.


We see that while one can easily carry out a computation on every fp16 number (to check that the square root function is correctly computed, for example), it is impractical to do so for every double precision number.

A key feature of the standard is that it is a closed system, thanks to the inclusion of NaN (Not a Number) and \infty (usually written as inf in programing languages) as floating-point numbers: every arithmetic operation produces a number in the system. A NaN is generated by operations such as

0/0, \quad 0 \times \infty, \quad \infty/\infty, \quad (+\infty) + (-\infty), \quad \sqrt{-1}.

Arithmetic operations involving a NaN return a NaN as the answer. The number \infty obeys the usual mathematical conventions regarding infinity, such as

\infty+\infty = \infty, \quad (-1)\times\infty = -\infty, \quad   ({\textrm{finite}})/\infty = 0.

This means, for example, that 1 + 1/x evaluates as 1 when x = \infty.

The standard specifies that all arithmetic operations (including square root) are to be performed as if they were first calculated to infinite precision and then rounded to the target format. A number is rounded to the next larger or next smaller floating-point number according to one of four rounding modes:

  • round to the nearest floating-point number, with rounding to even (rounding to the number with a zero least significant bit) in the case of a tie;
  • round towards plus infinity and round towards minus infinity (used in interval arithmetic); and
  • round towards zero (truncation, or chopping).

For round to nearest it follows that

\mathrm{f\kern.2ptl}(x\mathbin{\mathrm{op}} y)     = (x \mathbin{\mathrm{op}} y)(1+\delta),     \quad |\delta|\le u, \quad \mathbin{\mathrm{op}}\in\{+,-,*,/,\sqrt{}\},

where \mathrm{f\kern.2ptl} denotes the computed result. The standard also includes a fused multiply-add operation (FMA), x*y + z. The definition requires it to be computed with just one rounding error, so that \mathrm{f\kern.2ptl}(x*y+z) is the rounded version of x*y+z, and hence satisfies

\mathrm{f\kern.2ptl}(x*y + z) = (x*y + z)(1+\delta), \quad |\delta|\le u.

FMAs are supported in some hardware and are usually executed at the same speed as a single addition or multiplication.

The standard recommends the provision of correctly rounded exponentiation (x^y) and transcendental functions (\exp, \log, \sin, \mathrm{acos}, etc.) and defines domains and special values for them, but these functions are not required.

A new feature of the 2019 standard is augmented arithmetic operations, which compute \mathrm{fl}(x \mathbin{\mathrm{op}}y) along with the error x\mathbin{\mathrm{op}}y - \mathrm{fl}(x \mathbin{\mathrm{op}}y), for \mathbin{\mathrm{op}} = +,-,*. These operations are useful for implementing compensated summation and other special high accuracy algorithms.

William (“Velvel”) Kahan of the University of California at Berkeley received the 1989 ACM Turing Award for his contributions to computer architecture and numerical analysis, and in particular for his work on IEEE floating-point arithmetic standards 754 and 854.


This is a minimal set of references, which contain further useful references within.

Posted in what-is | Tagged | Leave a comment