SIAM CSE21 Minisymposium on Reduced Precision Arithmetic and Stochastic Rounding

This minisymposium took place at the SIAM Conference on Computational Science and Engineering, March 2, 2021. This page makes available slides from some of the talks.

Minisymposium description: Reduced precision floating-point arithmetic, such as IEEE half precision and bfloat16, is increasingly available in hardware. Low precision computations promise major increases in speed and reductions in data communication costs, but they also bring an increased risk of overflow, underflow, and loss of accuracy. One way to improve the results of low precision computations is to use stochastic rounding instead of round to nearest, and this is proving popular in machine learning. This minisymposium will discuss recent advances in exploitation and analysis of reduced precision arithmetic and stochastic rounding.

Algorithms for Stochastically Rounded Elementary Arithmetic Operations in IEEE 754 Floating-Point Arithmetic Massimiliano Fasi, Örebro University, Sweden; Mantas Mikaitis, University of Manchester, United Kingdom. Abstract. Slides.

Reduced Precision Elementary Functions Jean-Michel Muller, ENS Lyon, France. Abstract. Slides.

Effect of Reduced Precision and Stochastic Rounding in the Numerical Solution of Parabolic Equations Matteo Croci and Michael B. Giles, University of Oxford, United Kingdom. Abstract. Slides.

Stochastic Rounding and its Probabilistic Backward Error Analysis Michael P. Connolly and Nicholas J. Higham, University of Manchester, United Kingdom; Theo Mary, Sorbonne Universités and CNRS, France. Abstract. Slides.

Stochastic Rounding in Weather and Climate Models Milan Kloewer, Edmund Paxton, and Matthew Chantry, University of Oxford, United Kingdom Abstract. Slides.

Matrix Rank Relations

Matrix rank is an important concept in linear algebra. While rank deficiency can be a sign of an incompletely or improperly specified problem (a singular system of linear equations, for example), in some problems low rank of a matrix is a desired property or outcome. Here we present some fundamental rank relations in a concise form useful for reference. These are all immediate consequences of the singular value decomposition (SVD), but we give elementary (albeit not entirely self-contained) proofs of them.

The rank of a matrix A\in\mathbb{R}^{m\times n} is the maximum number of linearly independent columns, which is the dimension of the range space of A, \mathrm{range}(A) = \{\, Ax: x \in\mathbb{R}^n \,\}. An important but non-obvious fact is that this is the same as the maximum number of linearly independent rows (see (5) below).

A rank-1 matrix has the form xy^*, where x and y are nonzero vectors. Every column is a multiple of x and every row is a multiple of y^*. A sum of k rank-1 matrices has the form

\notag    A = \displaystyle\sum_{i=1}^{k} x_iy_i^*      =  \begin{bmatrix} x_1 & x_2 & \dots & x_k \end{bmatrix}         \begin{bmatrix} y_1^* \\ y_2^* \\ \vdots\\ y_k^* \end{bmatrix}      \equiv XY^*.         \qquad (0)

Each column of A is a linear combination of the vectors x_1, x_2, …, x_k, so A has at most k linearly independent columns, that is, A has rank at most k. In fact, \mathrm{rank}(A) = k if X and Y have rank k, as follows from (4) below. Any rank-k matrix can be written in the form (0) with X and Y of rank k; indeed this is the full-rank factorization below.

Here are some fundamental rank equalities and inequalities.

Rank-Nullity Theorem

The rank-nullity theorem says that

\notag    \boxed{ \mathrm{rank}(A) +  \mathrm{dim}( \mathrm{null}(A) ) = n,     \quad A\in\mathbb{R}^{m\times n},}

where \mathrm{null}(A) = \{\, x \in\mathbb{R}^n: Ax = 0 \,\} is the null space of A.

Rank Bound

The rank cannot exceed the number of columns, or, by (5) below, the number of rows:

\notag    \boxed{ \mathrm{rank}(A) \le \min(m,n), \quad A\in\mathbb{C}^{m\times n}. }

Rank of a Sum

For any A and B of the same dimension,

\notag     \boxed{|\mathrm{rank}(A) - \mathrm{rank}(B)| \le    \mathrm{rank}(A+B) \le \mathrm{rank}(A) + \mathrm{rank}(B).} \qquad (1)

The upper bound follows from the fact that the dimension of the sum of two subspaces cannot exceed the sum of the dimensions of the subspaces. Interestingly, the upper bound is also a corollary of the bound (3) for the rank of a matrix product, because

\notag   \begin{aligned}    \mathrm{rank}(A+B) &= \mathrm{rank}\biggl(  \begin{bmatrix} A & B  \end{bmatrix}  \begin{bmatrix} I \\ I  \end{bmatrix} \biggr)\\  &\le  \min\biggl(\mathrm{rank}\bigl(\begin{bmatrix} A & B  \end{bmatrix}\bigr),  \mathrm{rank}\biggl(\begin{bmatrix} I \\ I  \end{bmatrix} \biggr)\biggr)\\  &\le \mathrm{rank}\bigl(\begin{bmatrix} A & B  \end{bmatrix}\bigr)\\   &\le \mathrm{rank}(A) + \mathrm{rank}(B).   \end{aligned}

For the lower bound, writing A = -B + A+B and applying the upper bound gives \mathrm{rank}(A) \le \mathrm{rank}(-B) + \mathrm{rank}(A+B) = \mathrm{rank}(B) + \mathrm{rank}(A+B), and likewise with the roles of A and B interchanged.

Rank of A and A^*A

For any A,

\notag    \boxed{\mathrm{rank}(A^*A) = \mathrm{rank}(A).}  \qquad (2)

Indeed Ax = 0 implies A^*Ax = 0, and A^*Ax = 0 implies 0 = x^*A^*Ax = (Ax)^*(Ax), which implies Ax = 0. Hence the null spaces of A and A^*A are the same. The equality (2) follows from the rank-nullity theorem.

Rank of a General Product

For any A and B for which the product AB is defined,

\notag   \boxed{\mathrm{rank}(AB) \le \min\bigl( \mathrm{rank}(A), \mathrm{rank}(B) \bigr).}     \qquad (3)

If B = [b_1,\dots,b_n] then AB = [Ab_1,\dots,Ab_n], so the columns of AB are linear combinations of those of A and so AB cannot have more linearly independent columns than A, that is, \mathrm{rank}(AB) \le \mathrm{rank}(A). Using (5) below, we then have

\notag  \mathrm{rank}(AB) = \mathrm{rank}(B^*A^*)   \le \mathrm{rank}(B^*) = \mathrm{rank}(B).

The latter inequality can be proved without using (5) (our proof of which uses (3)), as follows. Suppose \mathrm{rank}(B) < \mathrm{rank}(AB) = r. Let the columns of Y span \mathrm{range}(AB), so that Y has r columns and Y = ABZ for some matrix Z with r columns. Now \mathrm{rank}(BZ) \le \mathrm{rank}(B) < r by the first part, so BZg = 0 for some nonzero g. But then Yg = ABZg = 0, which contradicts the linear independence of the columns of Y, so we must have \mathrm{rank}(B) \ge \mathrm{rank}(AB).

Rank of a Product of Full-Rank Matrices

We have

\notag    \boxed{  \mathrm{rank}(AB) = r, \quad     A\in\mathbb{C}^{m\times r}, \;     B\in\mathbb{C}^{r\times n}, \;     \mathrm{rank}(A) = \mathrm{rank}(B) = r .}     \qquad (4)

We note that A^*A and BB^* are both nonsingular r\times r matrices by (2), so their product has rank r. Using (3),

\notag   r = \mathrm{rank}(A^*A BB^*)     \le \mathrm{rank}(A B) \le r,

and hence \mathrm{rank}(A B) = r.

Another important relation is

\notag   \boxed{ \mathrm{rank}(XAY ) = \mathrm{rank}(A), \quad    X\in\mathbb{C}^{m\times m} \;\mathrm{and}\;    Y\in\mathbb{C}^{n\times n}\; \mathrm{nonsingular}. }

This is a consequence of the equality \mathrm{range}(XAY) = X\mathrm{range}(A)Y for nonsingular X and Y.

Ranks of A and A^*

By (2) and (3) we have \mathrm{rank}(A) = \mathrm{rank}(A^*A) \le \mathrm{rank}(A^*). Interchanging the roles of A and A^* gives \mathrm{rank}(A^*) \le \mathrm{rank}(A) and so

\notag   \boxed{ \mathrm{rank}(A^*) = \mathrm{rank}(A). } \qquad (5)

In other words, the rank of A is equal to the maximum number of linearly independent rows as well as the maximum number of linearly independent columns.

Full-Rank Factorization

A\in\mathbb{C}^{m \times n} has rank r if and only if A = GH for some G\in\mathbb{C}^{m \times r} and H\in\mathbb{C}^{r \times n}, both of rank r, and this is called a full-rank factorization. The existence of such a factorization implies that \mathrm{rank}(A) = r by (4). Conversely, suppose that A has rank r. Let the columns of X\in\mathbb{C}^{m \times r} form a basis for the range space of A. Then there are r-vectors y_j such that a_j = Xy_j, j = 1\colon n, and with Y = [y_1,y_2,\dots, y_n] we have A = XY. Finally, r = \mathrm{rank}(A)     = \mathrm{rank}(XY)     \le \mathrm{rank}(Y) by (3), and since \mathrm{rank}(Y) \le r we have \mathrm{rank}(Y) = r.

Rank and Minors

A characterization of rank that is sometimes used as the definition is that it is the size of the largest nonsingular square submatrix. Equivalently, the rank is the size of the largest nonzero minor, where a minor of size k is the determinant of a k\times k submatrix.

rank(AB) and rank(BA)

Although AB and BA have some properties in common when both products are defined (notably they have the same nonzero eigenvalues), \mathrm{rank}(AB) is not always equal to \mathrm{rank}(BA). A simple example is A = x and B = y^* with x and y orthogonal vectors: AB = xy^* but BA = y^*x = 0. An example with square A and B is

\notag   \begin{gathered}   A = \begin{bmatrix} 0 & 1 \\ 0 & 0 \end{bmatrix}, \quad   B = \begin{bmatrix} 1 & 0 \\ 0 & 0 \end{bmatrix}, \\    \mathrm{rank}(AB) =    \mathrm{rank}\biggl( \begin{bmatrix} 0 & 0 \\ 0 & 0 \end{bmatrix}    \biggr) = 0, \quad    \mathrm{rank}(BA) =    \mathrm{rank}\biggl( \begin{bmatrix} 0 & 1 \\ 0 & 0 \end{bmatrix}    \biggr) = 1.   \end{gathered}

Note that A = e_1e_2^T and B = e_1e_1^T, where e_i has 1 in the ith position and zeros everywhere else. Such matrices are easy to manipulate in this form (e.g., AB = e_1 (e_2^Te_1)e_1^T = 0) and are useful for constructing examples.

How to Find Rank

If we have a full-rank factorization of A then we can read off the rank from the dimensions of the factors. But finding a full-rank factorization is a nontrivial task. The ultimate full-rank factorization is the SVD

\notag     A = U\Sigma V^T,

where U\in\mathbb{R}^{m\times m} and V\in\mathbb{R}^{n\times n} are orthogonal, \Sigma = \mathrm{diag}(\sigma_1,\dots, \sigma_p)\in\mathbb{R}^{m\times n}, where p = \min(m,n), and \sigma_1\ge \sigma_2\ge \cdots \ge \sigma_r > 0 = \sigma_{r+1} = \cdots = \sigma_p = 0. The rank of A is r, the number of nonzero singular values.

In floating-point arithmetic, the standard algorithms for computing the SVD are numerically stable, that is, the computed singular values are the exact singular values of a matrix A + \Delta A with \|\Delta A\|_2 \le c_{m,n}u\|A\|, where c_{m,n} is a constant and u is the unit roundoff. Unfortunately, A + \Delta A will typically be full rank when A is rank deficient. For example, consider this computation.

>> n = 4; A = zeros(n); A(:) = 1:n^2, svd(A)
A =
     1     5     9    13
     2     6    10    14
     3     7    11    15
     4     8    12    16
ans =

The matrix has rank 2 and the two zero singular values are approximated by computed singular values of order 10^{-15}. In general, we have no way to know whether tiny computed singular values signify exactly zero singular values. In practice, one typically defines a numerical rank based on a threshold and regards computed singular values less than the threshold as zero. Indeed the MATLAB rank function computes the rank as the number of singular values exceeding 2u \max(m,n)\widehat{\sigma}_1, where \widehat{\sigma}_1 is the largest computed singular value. If the data from which the matrix is constructed is uncertain then the definition of numerical rank should take into account the level of uncertainty in the data. Dealing with rank deficiency in the presence of data errors and in finite precision arithmetic is a tricky business.


An excellent reference for further rank relations is Horn and Johnson. Stewart describes some of the issues associated with rank-deficient matrices in practical computation.

Diagonally Perturbing a Symmetric Matrix to Make It Positive Definite

Suppose A is a matrix that is symmetric but not positive definite. What is the best way to perturb the diagonal to make A positive definite? We want to compute a vector d such that

\notag     A(d) = A + D, \quad D = \mathrm{diag}(d),

is positive definite. Since the positive definite matrices form an open set there is no minimal d, so we relax the requirement to be that A is positive semidefinite. The perturbation D needs to make any negative eigenvalues of A become nonnegative. We will require all the entries of d to be nonnegative. Denote the eigenvalues of A by \lambda_n(A) \le \lambda_{n-1}(A) \le \cdots \le \lambda_1(A) and assume that \lambda_n(A) < 0.

A natural choice is to take D to be a multiple of the identity matrix. For d_i \equiv \delta, A(d) has eigenvalues \lambda_i + \delta, and so the smallest possible \delta is \delta = - \lambda_n(A). This choice of D shifts all the diagonal elements by the same amount, which might be undesirable for a matrix with widely varying diagonal elements.

When the diagonal entries of A are positive another possibility is to take d_i = \alpha a_{ii}, so that each diagonal entry undergoes a relative perturbation of size \alpha. Write D_A = \mathrm{diag}(a_{ii}) and note that C = D_A^{-1/2}A D_A^{-1/2} is symmetric with unit diagonal. Then

\notag     A + \alpha D_A = D_A^{1/2}(C + \alpha I)D_A^{1/2}.

Since A + \alpha D_A is positive semidefinite if and only if C + \alpha I is positive semidefinite, the smallest possible \alpha is \alpha = -\lambda_n(C).

More generally, we can treat the d_i as n independent variables and ask for the solution of the optimization problem

\notag    \min \|d\| ~~\mathrm{subject~to}~~ A + \mathrm{diag}(d)               ~\mathrm{positive~semidefinite}, ~d \ge 0.   \qquad(\dagger)

Of particular interest are the norms \|d\|_1 = \sum_i d_i = \mathrm{trace}(D) (since d\ge0) and \|d\|_\infty = \max_i d_i.

If A+D is positive semidefinite then from standard eigenvalue inequalities,

\notag       0 \le \lambda_n(A+D) \le  \lambda_n(A) +  \lambda_1(D),

so that

\notag       \max_i d_i \ge -\lambda_n(A).

Since d_i \equiv -\lambda_n(A) satisfies the constraints of (\dagger), this means that this d solves (\dagger) for the \infty-norm, though the solution is obviously not unique in general.

For the 1– and 2-norms, (\dagger) does not have an explicit solution, but it can be solved by semidefinite programming techniques.

Another approach to finding a suitable D is to compute a modified Cholesky factorization. Given a symmetric A, such a method computes a perturbation E such that A + E = R^TR for an upper triangular R with positive diagonal elements, so that A + E is positive definite. The methods of Gill, Murray, and Wright (1981) and Schnabel and Eskow (1990) compute a diagonal E. The cost in flops is essentially the same as that of computing a Cholesky factorization (n^3/3) flops), so this approach is likely to require fewer flops than computing the minimum eigenvalue or solving an optimization problem, but the perturbations produces will not be optimal.


We take the 5\times 5 Fiedler matrix

>> A = gallery('fiedler',5)
A =
     0     1     2     3     4
     1     0     1     2     3
     2     1     0     1     2
     3     2     1     0     1
     4     3     2     1     0

The smallest eigenvalue is -5.2361, so A+D is positive semidefinite for D_1 = 5.2361I. The Gill–Murray–Wright method gives D_{\mathrm{GMW}} with diagonal elements

2.0000e+01   4.6159e+00   1.3194e+00   2.5327e+00   1.0600e+01

and has \lambda_5(A+D_{\mathrm{GMW}}) = 0.5196 while the Schnabel–Eskow method gives D_{\mathrm{SE}} with diagonal elements

6     6     6     6     6

and has \lambda_5(A+D_{\mathrm{SE}}) = 0.7639. If we increase the diagonal elements of D_1 by 0.5 to give comparable smallest eigenvalues for the perturbed matrices then we have

  \Vert d\Vert_{\infty} \Vert d \Vert_1
Shift 5.2361 26.180
Gill-Murray-Wright 20.000 39.068
Schnabel–Eskow 6.000 30.000


When Does Thresholding Preserve Positive Definiteness?

Does a symmetric positive definite matrix remain positive definite when we set one or more elements to zero? This question arises in thresholding, in which elements of absolute value less than some tolerance are set to zero. Thresholding is used in some applications to remove correlations thought to be spurious, so that only statistically significant ones are retained.

We will focus on the case where just one element is changed and consider an arbitrary target value rather than zero. Given an n\times n symmetric positive definite matrix A we define A(t) to be the matrix resulting from adding t to the (i,j) and (j,i) elements and we ask when is A(t) positive definite. We can write

\notag   A(t) = A + t(e_i^{}e_j^T + e_j^{}e_i^T) \equiv A + tE_{ij},

where e_i is the ith column of the identity matrix. The perturbation E_{ij} has rank 2, with eigenvalues -1, 1, and 0 repeated n-2 times. Hence we can write E_{ij} in the form E_{ij} = pp^T - qq^T, where p^Tp = q^Tq = 1 and p^Tq = 0. Adding pp^T to A causes each eigenvalue to increase or stay the same, while subtracting qq^T decreases or leaves unchanged each eigenvalue. However, more is true: after each of these rank-1 perturbations the eigenvalues of the original and perturbed matrices interlace, by Weyl’s theorem. Hence, with the eigenvalues of A ordered as \lambda_n(A) \le \cdots \le \lambda_1(A), we have (Horn and Johnson, Cor. 4.3.7)

\notag \begin{aligned}   \lambda_n(A(t)) &\le \lambda_{n-1}(A), \\   \lambda_{i+1}(A) &\le \lambda_i(A(t)) \le \lambda_{i-1}(A),    \quad i = 2\colon n-1, \\   \lambda_2(A) &\le \lambda_1(A(t)). \end{aligned}

Because A is positive definite these inequalities imply that \lambda_{n-1}(A(t)) \ge \lambda_n(A) > 0, so A(t) has at most one negative eigenvalue. Since \det(A(t)) is the product of the eigenvalues of A(t) this means that A(t) is positive definite precisely when \det(A(t)) > 0.

There is a simple expression for \det(A(t)), which follows from a lemma of Chan (1984), as explained by Georgescu, Higham, and Peters (2018):

\notag  \det(A(t)) = \det(A)\big(1+ 2t b_{ij} + t^2(b_{ij}^2-b_{ii}b_{jj})\big),

where B = A^{-1}. Hence the condition for A(t) to be positive definite is

\notag  q_{ij}(t) = 1 + 2t b_{ij} + t^2(b_{ij}^2-b_{ii}b_{jj}) > 0.

We can factorize

\notag     q_{ij}(t) = \Bigl( t\bigl(b_{ij}  - \sqrt{b_{ii}b_{jj}}\bigr) + 1 \Bigr)                 \Bigl( t\bigl(b_{ij}  + \sqrt{b_{ii}b_{jj}}\bigr) + 1 \Bigr),

so q_{ij}(t) > 0 for

\notag    t\in \left( \displaystyle\frac{-1}{ \sqrt{b_{ii}b_{jj}} + b_{ij} },                 \displaystyle\frac{1}{ \sqrt{b_{ii}b_{jj}} - b_{ij} } \right) =: I_{ij},

where the endpoints are finite because B, like A, is positive definite and so |b_{ij}| < \sqrt{b_{ii}b_{jj}}.

The condition for A to remain positive definite when a_{ij} is set to zero is q_{ij}(-a_{ij}) > 0, or equivalently -a_{ij} \in I_{ij}. To check either of these conditions we need just b_{ij}, b_{ii}, and b_{jj}. These elements can be computed without computing the whole inverse by solving the equations Ab_k = e_k for k = i,j, for the kth column b_k of B, making use of a Cholesky factorization of A.

As an example, we consider the 4\times 4 Lehmer matrix, which has (i,j) element i/j for i \ge j:

\notag   A = \begin{bmatrix}         1           & \frac{1}{2}  & \frac{1}{3} & \frac{1}{4} \\[3pt]         \frac{1}{2} &           1  & \frac{2}{3} & \frac{1}{2} \\[3pt]         \frac{1}{3} &  \frac{2}{3} & 1           & \frac{3}{4} \\[3pt]         \frac{1}{4} &  \frac{1}{2} & \frac{3}{4} &  1         \end{bmatrix}.

The smallest eigenvalue of A is 0.208. Any off-diagonal element except the (2,4) element can be zeroed without destroying positive definiteness, and if the (2,4) element is zeroed then the new matrix has smallest eigenvalue -0.0249. For i=2 and j=4, the following plot shows in red \lambda_{\min}(A(t)) and in blue q_{24}(t); the black dots are the endpoints of the closure of the interval I_{24} = (-0.453,0.453) and the vertical black line is the value -a_{24}. Clearly, -a_{24} lies outside I_{24}, which is why zeroing this element causes a loss of positive definiteness. Note that I_{24} also tells us that we can increase a_{24} to any number less than 0.953 without losing definiteness.


Given a positive definite matrix and a set S of elements to be modified we may wish to determine subsets (including a maximal subset) of S for which the modifications preserve definiteness. Efficiently determining these subsets appears to be an open problem.

In practical applications thresholding may lead to an indefinite matrix. Definiteness must then be restored to obtain a valid correlation matrix. One way to do this is to find the nearest correlation matrix in the Frobenius norm such that the zeroed elements remain zero. This can be done by the alternating projections method with a projection to keep the zeroed elements fixed. Since the nearest correlation matrix is positive semidefinite, it is also desirable to to incorporate a lower bound \delta > 0 on the smallest eigenvalue, which corresponds to another projection. Both these projections are supported in the algorithm of Higham and Strabić (2016), implemented in the code at For the Lehmer matrix, the nearest correlation matrix with zero (2,4) element and eigenvalues at least \delta = 0.01 is (to four significant figures)

\notag   \begin{bmatrix}    1       &    0.4946  &    0.3403  &    0.2445  \\    0.4946  &    1       &    0.6439  &    0       \\    0.3403  &    0.6439  &    1       &    0.7266  \\    0.2445  &    0       &    0.7266  &    1    \end{bmatrix}.

A related question is for what patterns of elements that are set to zero is positive definiteness guaranteed to be preserved for all positive definite A? Clearly, setting all the off-diagonal elements to zero preserves definiteness, since the diagonal of a positive definite matrix is positive. Guillot and Rajaratnam (2012) show that the answer to the question is that the new matrix must be a symmetric permutation of a block diagonal matrix. However, for particular A this restriction does not necessarily hold, as the Lehmer matrix example shows.


What Is a Unitarily Invariant Norm?

A norm on \mathbb{C}^{m \times n} is unitarily invariant if \|UAV\| = \|A\| for all unitary U\in\mathbb{C}^{m \times m} and V\in\mathbb{C}^{n\times n} and for all A\in\mathbb{C}^{m \times n}. One can restrict the definition to real matrices, though the term unitarily invariant is still typically used.

Two widely used matrix norms are unitarily invariant: the 2-norm and the Frobenius norm. The unitary invariance follows from the definitions. For the 2-norm, for any unitary U and V, using the fact that \|Uz\|_2 = \|z\|_2, we obtain

\notag \begin{aligned}    \|UAV\|_2 &= \max_{x\ne0} \frac{\|UAVx\|_2}{\|x\|_2}              = \max_{x\ne0} \frac{\|AVx\|_2}{\|x\|_2}\\              &= \max_{x\ne0} \frac{\|Ay\|_2}{\|V^*y\|_2} \quad(y = Vx)\\              &= \max_{y\ne0} \frac{\|Ay\|_2}{\|y\|_2}  = \|A\|_2. \end{aligned}

For the Frobenius norm, using \|A\|_F^2 = \mathrm{trace}(A^*A),

\notag \begin{aligned}    \|UAV\|_F^2 &= \mathrm{trace}(V^*A^*U^* \cdot UAV)\\                &= \mathrm{trace}(V^*AAV)\\                &= \mathrm{trace}(A^*A) = \|A\|_F^2, \end{aligned}

since the trace is invariant under similarity transformations.

More insight into unitarily invariant norms comes from recognizing a connection with the singular value decomposition

\notag   A = P\Sigma Q^*, \quad P^*P = I_m, \quad Q^*Q = I_n, \quad                 \Sigma = \mathrm{diag}(\sigma_i), \quad     \sigma_1 \ge \cdots \ge \sigma_q \ge 0.

Clearly, \|A\| = \|\Sigma\|, so \|A\| depends only on the singular values. Indeed, for the 2-norm and the Frobenius norm we have \|A\|_2 = \sigma_1 and \|A\|_F = (\sum_{i=1}^q \sigma_i^2)^{1/2}. Here, and throughout this article, q = \min(m,n). Another implication of the singular value dependence is that \|A\| = \|A^*\| for all A for any unitarily invariant norm.

There is a beautiful characterization of unitarily invariant norms in terms of symmetric gauge functions, which are functions f:\mathbb{R}^q\to\mathbb{R}^q such that f is an absolute norm on \mathbb{R}^q and f(Px) = f(x) for any permutation matrix P and all x. An absolute norm is one with the property that \|x\| = \||x|\| for all x, and this condition is equivalent to the monotonicity condition that |x| \le |y| implies \|x\| \le \|y\| for all x and y.


A norm on \mathbb{C}^{m \times n} is unitarily invariant if and only if \|A\| = f(\sigma_1, \dots, \sigma_q) for all A for some symmetric gauge function f, where \sigma_1, \dots, \sigma_q are the singular values of A.

The matrix 2-norm and the Frobenius norm correspond to f being the vector \infty-norm and the 2-norm, respectively. More generally, we can take for f any vector p-norm, obtaining the class of Schatten p-norms:

\notag   \|A\|_p^S  = \biggl(\displaystyle\sum_{i=1}^q \sigma_i^p\biggr)^{1/p}, \quad 1\le    p\le\infty.

The Schatten 1-norm is the sum of the singular values, \|A\|_1^S = \sum_{i=1}^q \sigma_i, which is called the trace norm or nuclear norm. It can act as a proxy for the rank of a matrix. The trace norm can be expressed as \|A\|_1^S = \mathrm{trace}(H)            = \mathrm{trace}\bigl((A^*A)^{1/2}\bigr), where A = UH is a polar decomposition.

Another class of unitarily invariant norms is the Ky Fan k-norms

\notag   \|A\|_{(k)} = \displaystyle\sum_{i=1}^k \sigma_i, \quad 1\le k \le n.

We have \|A\|_{(n)} = \|A\|_1^S and \|A\|_{(1)} = \|A\|_2.

Among unitarily invariant norms, the 2-norm and the Frobenius norm are widely usd in numerical analysis and matrix analysis. The nuclear norm is used in problems involving matrix rank minimization, such as matrix completion problems.

The benefit of the concept of unitarily invariant norm is that one can prove certain results for this whole class of norms, obtaining results for the particular norms of interest as special cases. Here are three important examples.

  • For A\in\mathbb{C}^{n\times n}, the matrix (A+A^*)/2 is the nearest Hermitian matrix to A in any unitarily invariant norm.
  • For A\in\mathbb{C}^{m\times n} (m\ge n), the unitary polar factor is the nearest matrix with orthonormal columns to A in any unitarily invariant norm.
  • The best rank-k approximation to A\in\mathbb{C}^{m\times n} in any unitarily invariant norm is obtained by setting all the singular values beyond the kth to zero in the SVD of A. This result, which generalizes the Eckart-Young theorem, which covers the 2- and Frobenius norm instances, is an easy consequence of the following result of Mirsky (1960).


    Let A,B\in\mathbb{C}^{m\times n} have SVDs with diagonal matrices \Sigma_A, \Sigma_B \in\mathbb{R}^{m\times n}, where the diagonal elements are arranged in nonincreasing order. Then \|A-B\| \ge \|\Sigma_A-\Sigma_B\| for every unitarily invariant norm.

We also give a useful matrix norm inequality. For any matrices A, B, and C for which the product ABC is defined,

\notag     \|ABC\| \le \|A\|_2 \mskip2mu \|B\| \mskip2mu \|C\|_2

holds for any unitarily invariant norm, and in fact, any two of the norms on the right-hand side can be 2-norms.


This is a minimal set of references, which contain further useful references within.

Related Blog Posts

This article is part of the “What Is” series, available from and in PDF form from the GitHub repository

What Is the Nearest Positive Semidefinite Matrix?

Given a symmetric matrix and a nonnegative number \delta, what is the nearest symmetric matrix whose eigenvalues are all at least \delta? In other words, how can we project a symmetric matrix onto the set of symmetric positive semidefinite matrices with eigenvalues at least \delta?

Distance can be measured in any norm, but the most common choice is the Frobenius norm, \|A\|_F = (\sum_{i,j = 1}^n |a_{ij}|^2)^{1/2}.

This problem occurs in a very wide range of applications. A typical scenario is that a covariance matrix is approximated in some way that does not guarantee positive semidefinitess, for example by treating blocks of the matrix independently. In machine learning, some methods use indefinite kernels and these can require an indefinite similarity matrix to be replaced by a semidefinite one. When \delta > 0, the problem arises when the matrix is positive definite but ill conditioned and a matrix of smaller condition number is required, or when rounding errors result in small negative eigenvalues and a “safely positive definite” matrix is wanted.

The following theorem gives the solution to the problem for the Frobenius norm.

Theorem (Cheng and Higham, 1998).

Let the symmetric matrix A\in\mathbb{R}^{n \times n} have the spectral decomposition A = Q \mskip1mu\mathrm{diag}(\lambda_i) \mskip1muQ^T and let \delta \ge 0. The unique matrix with smallest eigenvalue at least \delta nearest to A in the Frobenius norm is given by

\notag        X = Q \mskip1mu \mathrm{diag}(\tau_i) \mskip1mu Q^T, \quad  \tau_i = \begin{cases} \lambda_i, & \lambda_i \geq \delta\\                         \delta, & \lambda_i < \delta.           \end{cases}

The theorem says that there is a unique nearest matrix and that is has the same eigenvectors as A with any eigenvalues less than \delta increased to \delta. When \delta = 0, another way to express this result is in terms of the polar decomposition A = UH, where U is orthogonal and H is symmetric positive definite (with H = Q \mskip1mu\mathrm{diag}(|\lambda_i|)Q^T, since A is symmetric): X = (A + H)/2.

One can pose the same nearness problem for nonsymmetric A. For the Frobenius norm there is again a unique nearest symmetric matrix whose eigenvalues are all at least \delta, and it is given by applying the theorem to the symmetric part B of A (which is the nearest symmetric matrix to A), where

\notag   A = \frac{1}{2} (A + A^T ) + \frac{1}{2} (A - A^T )  \equiv B + C,    \quad(B = B^T, ~C = -C^T).

For the 2-norm the answer is known for nonsymmetric A only for \delta = 0, so we take \delta = 0 from this point on. We write X \ge 0 to denote that the symmetric matrix X is positive semidefinite and use a superscript 1/2 to denote the positive semidefinite square root.

Theorem (Halmos, 1972).

The 2-norm distance from A\in\mathbb{R}^{n \times n} to the nearest positive semidefinite matrix is

\notag   d_2(A) = \min\{\, r \ge 0: r^2I + C^2 \ge 0 ~\mathrm{and}~ G(r) \ge 0 \,\},


\notag    G(r) = B + \bigl(r^2I + C^2\bigr)^{1/2},

and a nearest matrix is

\notag   X = B + (d_2(A)^2I + C^2)^{1/2}.

When A is symmetric, so that C = 0, the nearest matrix and the distance reduce to

\notag   X = B + d_2(A)I, \quad d_2(A) = \max\{\, |\lambda_i| : \lambda_i < 0 \,\}.

Clearly, X is not, in general, unique because when A is symmetric, instead of shifting B by a multiple of I we could just perturb the negative eigenvalue to zero, as with the optimal Frobenius norm perturbation, without changing the 2-norm of the perturbation. A distinguishing feature of X in Halmos’s theorem is that X-P \ge 0 for any other nearest positive semidefinite matrix P (Bouldin, 1973, Theorem 4.2); thus X has the minimum number of zero eigenvalues over all nearest positive semidefinite matrices.

The Frobenius norm solution X_F is an approximate minimizer of the 2-norm distance since (Higham, 1988)

\notag   d_2(A) \le \| A - X_F \|_F \le 2 d_2(A).

Halmos’s theorem simplifies the computation of d_2(A) because it reduces the minimization problem to one dimension. However, the problem is nonlinear and has no closed form solution for general A, so iterative methods are needed to compute d_2(A). Two algorithms are developed in Higham (1988). Both are based on the following properties of the matrix G(r) in the theorem: \lambda_{\min}(G(r)) is monotone increasing on [\,\rho(C), \infty), and either \lambda_{\min}(G(\rho(C))) \ge 0, in which case d_2(A) = \rho(C), or \lambda_{\min}(G(r)) has a unique zero r^* = d_2(A) > \rho(C). The first algorithm uses the bisection method, determining the sign of \lambda_{\min}(G(r)) by attempting a Cholesky factorization of G(r): the sign is nonnegative if the factorization exists and negative otherwise. This approach is attractive if d_2(A) is required only to low accuracy. For higher accuracy computations it is better to apply a more rapidly converging zero finder to f(r) = \lambda_{\min} (G(r)) = 0. A hybrid Newton–bisection method is used in Higham (1988).


For a numerical example (in MATLAB), we take the Jordan block

\notag  A =  \begin{bmatrix}  0  & 1  & 0  & 0  & 0  \\  0  & 0  & 1  & 0  & 0  \\  0  & 0  & 0  & 1  & 0  \\  0  & 0  & 0  & 0  & 1  \\  0  & 0  & 0  & 0  & 0  \end{bmatrix}.

The symmetric part of A, B = (A+A^T)/2, has two negative eigenvalues and a zero eigemnvalue:

>> A = gallery('jordbloc',5,0); As = sym(A); eig_B = eig((As+As')/2)'
eig_B =
[-1/2, 0, 1/2, -3^(1/2)/2, 3^(1/2)/2]
>> double(eig_B)
ans =
  -5.0000e-01            0   5.0000e-01  -8.6603e-01   8.6603e-01

The nearest symmetric positive semidefinite matrix to A in the Frobenius norm can be computed in MATLAB as

B = (A + A')/2; [Q,D] = eig(B); d = diag(D); X_F = Q*diag(max(d,0))*Q';

We can improve this code by using the implicit expansion feature of MATLAB to avoid forming a diagonal matrix. Since the computed result is not exactly symmetric because of rounding errors, we also need to replace it by the nearest symmetric matrix:

[Q,d] = eig(B,'vector'); X_F = Q*(max(d,0).*Q'); X_F = (X_F + X_F')/2;

We obtain

X_F = 
 1.972e-01   2.500e-01   1.443e-01  -4.163e-17  -5.283e-02  
 2.500e-01   3.415e-01   2.500e-01   9.151e-02  -1.665e-16  
 1.443e-01   2.500e-01   2.887e-01   2.500e-01   1.443e-01  
-4.163e-17   9.151e-02   2.500e-01   3.415e-01   2.500e-01  
-5.283e-02  -1.665e-16   1.443e-01   2.500e-01   1.972e-01

which has eigenvalues

-3.517e-17   7.246e-17   9.519e-17   5.000e-01   8.660e-01  

Notice that the three eigenvalues that would be zero in exact arithmetic are of order the unit roundoff. A nearest matrix in the 2-norm, as given by Halmos’s theorem. is

X_2 = 
 8.336e-01   5.000e-01   1.711e-01   0.000e+00  -1.756e-02  
 5.000e-01   6.625e-01   5.000e-01   1.887e-01   0.000e+00  
 1.711e-01   5.000e-01   6.450e-01   5.000e-01   1.711e-01  
 0.000e+00   1.887e-01   5.000e-01   6.625e-01   5.000e-01  
-1.756e-02   0.000e+00   1.711e-01   5.000e-01   8.336e-01 

and its eigenvalues are

-7.608e-17   1.281e-01   5.436e-01   1.197e+00   1.769e+00  

The distances are

\notag  \begin{aligned}   1.732  &= \|A - X_F\|_F < \|A - X_2\|_F = 2.207,\\   0.9872 &= \|A - X_2\|_2 < \|A - X_F\|_2 = 1.0355.   \end{aligned}


This is a minimal set of references, which contain further useful references within.

Related Blog Posts

This article is part of the “What Is” series, available from and in PDF form from the GitHub repository

What Is the Nearest Symmetric Matrix?

What is the nearest symmetric matrix to a real, nonsymmetric square matrix A? This question arises whenever a symmetric matrix is approximated by formulas that do not respect the symmetry. For example, a finite difference approximation to a Hessian matrix can be nonsymmetric even though the Hessian is symmetric. In some cases, lack of symmetry is caused by rounding errors. The natural way to symmetrize A is to replace it by (A + A^T)/2. Is this the best choice?

As our criterion of optimality we take that \| A - X \| is minimized over symmetric X for some suitable norm. Fan and Hoffman (1955) showed that (A + A^T)/2 is a solution in any unitarily invariant norm. A norm is unitarily invariant if \|A\| = \|UAV\| for all unitary U and V. Such a norm depends only on the singular values of A, and hence \|A\| = \|A^T\| since A and A^T have the same singular values. The most important examples of unitarily invariant norms are the 2-norm and the Frobenius norm.

The proof that (A+A^T)/2 is optimal is simple. For any symmetric Y,

\notag  \begin{aligned}   \Bigl\| A - \frac{1}{2}(A + A^T) \Bigr\|         &=  \frac{1}{2} \|A - A^T \|          = \frac{1}{2} \| A - Y + Y^T - A^T \| \\         &\le \frac{1}{2} \| A - Y \| + \frac{1}{2} \|(Y - A)^T \| \\         &= \| A - Y\|.   \end{aligned}

Simple examples of a matrix and a nearest symmetric matrix are

\notag    A = \begin{bmatrix} 0 & 1 \\ 0 & 0 \end{bmatrix}, \;\;    X = \begin{bmatrix} 0 & \frac{1}{2} \\ \frac{1}{2} & 0 \end{bmatrix},\qquad    A = \begin{bmatrix} 0 & 1 \\ -1 & 0 \end{bmatrix}, \;\;    X = \begin{bmatrix} 0 & 0 \\ 0 & 0 \end{bmatrix}.

Note that any A can be written

\notag   A = \frac{1}{2} (A + A^T ) + \frac{1}{2} (A - A^T )  \equiv B + C,

where B and C are the symmetric part and the skew-symmetric part of A, respectively, so the nearest symmetric matrix to A is the symmetric part of A.

For the Frobenius norm, (A + A^T)/2 is the unique nearest symmetric matrix, which follows from the fact that \|S + K\|_F^2 = \|S\|_F^2 + \|K\|_F^2 for symmetric S and skew-symmetric K. For the 2-norm, however, the nearest symmetric matrix is not unique in general. An example of non-uniqueness is

\notag    A = \begin{bmatrix} 0 & 0 & 0 \\ 0 & 0 & 1 \\ 0 & 0 & 0 \end{bmatrix}, \quad    X = \begin{bmatrix} 0 & 0 & 0 \\ 0 & 0 & \frac{1}{2} \\ 0 & \frac{1}{2} & 0 \end{bmatrix}, \quad    Y = \begin{bmatrix} \theta & 0 & 0 \\ 0 & 0 & \frac{1}{2} \\ 0 & \frac{1}{2} & 0 \end{bmatrix},

for which \|A - X\|_2 = 0.5, and \|A - Y\|_2 = 0.5 for any \theta such that |\theta| \le 0.5.

Entirely analogous to the above is the nearest skew-symmetric matrix problem, for which the solution is the skew-symmetric part for any unitarily invariant norm. For complex matrices, these results generalize in the obvious way: (A + A^*)/2 is the nearest Hermitian matrix to A and (A - A^*)/2 is the nearest skew-Hermitian matrix to A in any unitarily invariant norm.


This article is part of the “What Is” series, available from and in PDF form from the GitHub repository

What Is the Softmax Function?

The softmax function takes as input a real n-vector x and returns the vector g with elements given by

\notag   \qquad\qquad   g_j(x) =  \displaystyle\frac{\mathrm{e}^{x_j}}{\sum_{i=1}^n \mathrm{e}^{x_i}}, \quad          j=1\colon n.   \qquad\qquad (*)

It arises in machine learning, game theory, and statistics. Since \mathrm{e}^{x_j} \ge 0 and \sum_{j=1}^n g_j = 1, the softmax function is often used to convert a vector x into a vector of probabilities, with the more positive entries giving the larger probabilities.

The softmax function is the gradient of the log-sum-exp function

\notag      \mathrm{lse}(x) = \log\displaystyle\sum_{i=1}^n \mathrm{e}^{x_i},

where \log is the natural logarithm, that is, g_j(x) = (\partial/\partial x_j) \mathrm{lse}(x).

The following plots show the two components of softmax for n = 2. Note that they are constant on lines x_1 - x_2 = \mathrm{constant}, as shown by the contours.


Here are some examples:

>> softmax([-1 0 1])
ans =
>> softmax([-1 0 10])
ans =

Note how softmax increases the relative weighting of the larger components over the smaller ones. The MATLAB function softmax used here is available at

A concise alternative formula, which removes the denominator of (*) by rewriting it as the exponential of \mathrm{lse}(x) and moving it into the numerator, is

\notag   \qquad\qquad     g_j = \exp\bigl(x_j - \mathrm{lse}(x)\bigr).   \qquad\qquad (\#)

Straightforward evaluation of softmax from either (*) or (\#) is not recommended, because of the possibility of overflow. Overflow can be avoided in (*) by shifting the components of x, just as for the log-sum-exp function, to obtain

\notag   \qquad\qquad   g_j(x) =  \displaystyle\frac{\mathrm{e}^{x_j-\max(x)}}{\sum_{i=1}^n \mathrm{e}^{x_i-\max(x)}}, \quad             j=1\colon n.   \qquad\qquad (\dagger)

where \max(x) = \max_i x_i. It can be shown that computing softmax via this formula is numerically reliable. The shifted version of (\#) tends to be less accurate, so (\dagger) is preferred.


This is a minimal set of references, which contain further useful references within.

Related Blog Posts

This article is part of the “What Is” series, available from and in PDF form from the GitHub repository

What Is the Log-Sum-Exp Function?

The log-sum-exp function takes as input a real n-vector x and returns the scalar

\notag      \mathrm{lse}(x) = \log \displaystyle\sum_{i=1}^n \mathrm{e}^{x_i},

where \log is the natural logarithm. It provides an approximation to the largest element of x, which is given by the \max function, \max(x) = \max_i x_i. Indeed,

\notag   \mathrm{e}^{\max(x)} \le \displaystyle\sum_{i=1}^n \mathrm{e}^{x_i}                        \le n \mskip1mu \mathrm{e}^{\max(x)},

and on taking logs we obtain

\notag   \qquad\qquad       \max(x) \le \mathrm{lse}(x) \le \max(x) + \log n.   \qquad\qquad (*)

The log-sum-exp function can be thought of as a smoothed version of the max function, because whereas the max function is not differentiable at points where the maximum is achieved in two different components, the log-sum-exp function is infinitely differentiable everywhere. The following plots of \mathrm{lse}(x) and \max(x) for n = 2 show this connection.


The log-sum-exp function appears in a variety of settings, including statistics, optimization, and machine learning.

For the special case where x = [0~t]^T, we obtain the function f(t) = \log(1+\mathrm{e}^t), which is known as the softplus function in machine learning. The softplus function approximates the ReLU (rectified linear unit) activation function \max(t,0) and satisfies, by (*),

\notag     \max(t,0) \le f(t) \le \max(t,0) + \log 2.

Two points are worth noting.

  • While \log(x_1 + x_2) \ne \log x_1 + \log x_2, in general, we do (trivially) have \log(x_1 + x_2) = \mathrm{lse}(\log x_1,\log x_2), and more generally \log(x_1 + x_2 + \cdots + x_n) = \mathrm{lse}(\log x_1,\log   x_2,\dots,\log x_n).
  • The log-sum-exp function is not to be confused with the exp-sum-log function: \exp \sum_{i=1}^n \log x_i = x_1x_2\dots x_n.

Here are some examples:

>> format long e
>> logsumexp([1 2 3])
ans =

>> logsumexp([1 2 30])
ans =

>> logsumexp([1 2 -3])
ans =

The MATLAB function logsumexp used here is available at

Straightforward evaluation of log-sum-exp from its definition is not recommended, because of the possibility of overflow. Indeed, \exp(x) overflows for x = 12, x = 89, and x = 710 in IEEE half, single, and double precision arithmetic, respectively. Overflow can be avoided by writing

\notag \begin{aligned}   \mathrm{lse}(x) &= \log \sum_{i=1}^n \mathrm{e}^{x_i}          = \log \sum_{i=1}^n \mathrm{e}^a \mathrm{e}^{x_i - a}          = \log  \left(\mathrm{e}^a\sum_{i=1}^n \mathrm{e}^{x_i - a}\right), \end{aligned}

which gives

\notag     \mathrm{lse}(x) = a + \log\displaystyle\sum_{i=1}^n \mathrm{e}^{x_i - a}.

We take a = \max(x), so that all exponentiations are of nonpositive numbers and therefore overflow is avoided. Any underflows are harmless. A refinement is to write

\notag     \qquad\qquad     \mathrm{lse}(x) = \max(x) + \mathrm{log1p}\Biggl( \displaystyle\sum_{i=1 \atop i\ne k}^n                                 \mathrm{e}^{x_i - \max(x)}\Biggr),     \qquad\qquad (\#)

where x_k = \max(x) (if there is more than one such k, we can take any of them). Here, \mathrm{log1p}(x) = \log(1+x) is a function provided in MATLAB and various other languages that accurately evaluates \log(1+x) even when x is small, in which case 1+x would suffer a loss of precision if it was explicitly computed.

Whereas the original formula involves the logarithm of a sum of nonnegative quantities, when \max(x) < 0 the shifted formula (\#) computes \mathrm{lse}(x) as the sum of two terms of opposite sign, so could potentially suffer from numerical cancellation. It can be shown by rounding error analysis, however, that computing log-sum-exp via (\#) is numerically reliable.


This is a minimal set of references, which contain further useful references within.

Related Blog Posts

This article is part of the “What Is” series, available from and in PDF form from the GitHub repository

Top Ten Posts of 2020


According to the WordPress statistics, this blog received over 64,000 visitors and 106,000 views in 2020.

Here are the ten most-viewed posts published during the year. All are in the What Is series that I started in March 2020, and which will continue in 2021.

  1. What is Numerical Stability?
  2. What Is a Condition Number?
  3. What Is a Symmetric Positive Definite Matrix?
  4. What Is the Sylvester Equation?
  5. What Is Backward Error?
  6. What Is a Matrix Square Root?
  7. What Is a Fréchet Derivative?
  8. What Is the Complex Step Approximation?
  9. What Is a Cholesky Factorization?
  10. What Is the Hilbert Matrix?